How to Install PySpark

How to Install PySpark
How to Install PySpark

In our previous article, we learned about the ETL process, in the context of PySpark and the need for it. So, here is part 2, we will help you understand the installation, configuration, and implementation process of PySpark.

Let us begin.

Setting up the right environment

First of all, you must verify whether or not you have installed JAVA and Scala. You can do this using the following command:

java -version
  • Running this command will give you the output containing the current java version:
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
  • Now similarly, check for the Scala installation on your system using this command:
scala -version
  • To know the current version of Scala present on your system, running this command:
Scala code runner version 2.11.10 -- Copyright 2002-2017, LAMP/EPFL

Let us now download and configure PySpark with the following steps:

>> Step 1: Go to the official Apache Spark website and download the latest Apache Spark version available there as shown below. Here I am using spark-2.2.1-bin-hadoop2.6

Image for post
Image for post

>> Step 2: Extract the downloaded Spark tar file by using this command:

tar -xvf spark-2.2.1-bin-hadoop2.6.tgz

When we run any spark application, a driver program starts, which has the main function. This starts your SparkContext.

Configuring your PySpark installation

A new directory will be created: spark-2.2.1-bin-hadoop2.6. Before starting PySpark, you must set the following configuration to set the Spark and py4j paths:

export SPARK_HOME=/home/pranjal/spark-2.2.1-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/python:$PATH

Alternatively, you can set the above environment globally and put them in the .bashrc file. You can then run the following command for the configurations to work:

source .bashrc

Now, you are done with all your configurations. So, go to the Spark directory and then to the bin folder. Here, we invoke the PySpark shell by running the following command:

pyspark

This will start your spark-shell:

Python 2.7.17 (default, Nov 7 2019, 10:07:09) 
[GCC 7.4.0] on linux2
Type "help", "copyright" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 2.7.17 (default, Nov 7 2019 10:07:09)
SparkSession available as 'spark'.
>>>

Implementation

Before starting any implementation, we must know about SparkContext. This is the entry point for any spark. When we run any spark application, a driver program starts, which has the main function. This starts your SparkContext.

SparkContext uses the py4j library to launch a JVM and create a JavaSparkContext. By default, PySpark has SparkContext available as ‘sc’. So, creating a new SparkContext will not work and you will get the following error:

ValueError: Cannot run multiple SparkContexts at once

Now, we are done with our setup. So let us run a sample example in our PySpark shell. In this example, we will be counting the number of lines containing special characters — ‘#’ or ‘$’ — in the example.txt file:

>>> logFile = "file:///home/pranjal/spark-2.2.1-bin-hadoop2.6/example.txt"
>>> logData = sc.textFile(logFile).cache()
>>> num_of_# = logData.filter(lambda s: '#' in s).count()
>>> num_of_$ = logData.filter(lambda s: '$' in s).count()
>>> print "Lines with #: %i, lines with $: %i" % (num_of_#, num_of_$)
Lines with #: 10, lines with $: 5

Note that we can read JSON files or other types of files from our system or from other systems. Further, by performing various data transformations, we can load these onto another system or data warehouse by creating ETL pipelines.

Let us run our example using a Python program. Create a python file called sample.py and run the following code in that file:

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic") \
.config("spark.config", "sample-value") \
.getOrCreate()
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/sample.json")
# Show the content of the DataFrame to stdout
df.show()

We then execute the following spark-submit command to run our python file:

spark-submit sample.py

Doing so gives us the following output:

# This is the sample output :
# +----+-------+
# | age| name|
# +----+-------+
# |null| Pranjal|
# | 30| Rahul|
# | 19| Candy|
# +----+-------+

We can do much more by using the large set of libraries provided by python. We can also transform large amounts of data using the ETL Process.

Conclusion

To sum up, Spark has given us the ability to simplify the challenging and complex intensive task of processing large amounts of real-time data or archived data (both structured and unstructured data).

Thanks for reading!

Apache Spark is the registered trademark of The Apache Software Foundation.

Author — Pranjal Gupta, DLT Labs

About the Author: Pranjal is currently working with our DL Asset Track team as a Nodejs developer.

Disclaimer: This article was originally published on the DLT Labs Blog page:
https://www.dltlabs.com/blog/how-to-install-pyspark-903994

DLT Labs is a global leader in Distributed Ledger Technology and Enterprise Products. To know more, head over to: https://www.dltlabs.com/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store