Setting up the right environment
First of all, you must verify whether or not you have installed JAVA and Scala. You can do this using the following command:
- Running this command will give you the output containing the current java version:
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
- Now similarly, check for the Scala installation on your system using this command:
- To know the current version of Scala present on your system, running this command:
Scala code runner version 2.11.10 -- Copyright 2002-2017, LAMP/EPFL
Let us now download and configure PySpark with the following steps:
>> Step 1: Go to the official Apache Spark website and download the latest Apache Spark version available there as shown below. Here I am using spark-2.2.1-bin-hadoop2.6
>> Step 2: Extract the downloaded Spark tar file by using this command:
tar -xvf spark-2.2.1-bin-hadoop2.6.tgz
When we run any spark application, a driver program starts, which has the main function. This starts your SparkContext.
Configuring your PySpark installation
A new directory will be created: spark-2.2.1-bin-hadoop2.6. Before starting PySpark, you must set the following configuration to set the Spark and py4j paths:
Alternatively, you can set the above environment globally and put them in the .bashrc file. You can then run the following command for the configurations to work:
Now, you are done with all your configurations. So, go to the Spark directory and then to the bin folder. Here, we invoke the PySpark shell by running the following command:
This will start your spark-shell:
Python 2.7.17 (default, Nov 7 2019, 10:07:09)
[GCC 7.4.0] on linux2
Type "help", "copyright" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/Using Python version 2.7.17 (default, Nov 7 2019 10:07:09)
SparkSession available as 'spark'.
Before starting any implementation, we must know about SparkContext. This is the entry point for any spark. When we run any spark application, a driver program starts, which has the main function. This starts your SparkContext.
SparkContext uses the py4j library to launch a JVM and create a JavaSparkContext. By default, PySpark has SparkContext available as ‘sc’. So, creating a new SparkContext will not work and you will get the following error:
ValueError: Cannot run multiple SparkContexts at once
Now, we are done with our setup. So let us run a sample example in our PySpark shell. In this example, we will be counting the number of lines containing special characters — ‘#’ or ‘$’ — in the example.txt file:
>>> logFile = "file:///home/pranjal/spark-2.2.1-bin-hadoop2.6/example.txt"
>>> logData = sc.textFile(logFile).cache()
>>> num_of_# = logData.filter(lambda s: '#' in s).count()
>>> num_of_$ = logData.filter(lambda s: '$' in s).count()
>>> print "Lines with #: %i, lines with $: %i" % (num_of_#, num_of_$)
Lines with #: 10, lines with $: 5
Note that we can read JSON files or other types of files from our system or from other systems. Further, by performing various data transformations, we can load these onto another system or data warehouse by creating ETL pipelines.
Let us run our example using a Python program. Create a python file called sample.py and run the following code in that file:
from pyspark.sql import SparkSession
spark = SparkSession \
.appName("Python Spark SQL basic") \
.config("spark.config", "sample-value") \
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/sample.json")
# Show the content of the DataFrame to stdoutdf.show()
We then execute the following spark-submit command to run our python file:
Doing so gives us the following output:
# This is the sample output :
# | age| name|
# |null| Pranjal|
# | 30| Rahul|
# | 19| Candy|
We can do much more by using the large set of libraries provided by python. We can also transform large amounts of data using the ETL Process.
To sum up, Spark has given us the ability to simplify the challenging and complex intensive task of processing large amounts of real-time data or archived data (both structured and unstructured data).
Thanks for reading!
Apache Spark is the registered trademark of The Apache Software Foundation.
Author — Pranjal Gupta, DLT Labs™
About the Author: Pranjal is currently working with our DL Asset Track team as a Nodejs developer.
Disclaimer: This article was originally published on the DLT Labs Blog page: