-
Step 1: Make sure that java and scala are installed on your computer (PySpark doesn't work very well with Java 9, so it's better to use Java 8)
-
Step 2: Download Spark from here
-
Step 3: Move the downloaded file in a choosen directory, for example
~/Dev
and unzip it withtar -xvf Downloads/spark-2.1.0-bin-hadoop2.7.tgz
-
Step 4: Now we need to set the following environments to set the Spark path and the Py4j path in the bash profile
export SPARK_HOME = ~/Dev/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:~/Dev/hadoop/spark-2.1.0-bin-hadoop2.7/bin
export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH = $SPARK_HOME/python:$PATH
- Step 5: restart the terminal application and then test if it works by typing
$SPARK_HOME/bin/pyspark
- Step 6: To run a python script, for example
FirstSparkApp.py
type
$SPARK_HOME/bin/spark-submit FirstSparkApp.py
-
Download and install the JDK from here
-
Via terminal install pyspark using the following command
pip install pyspark
-
Download and unzip winutils from here
-
Modify your environment variables in the following way:
- Modify the environment variable "path" and add "path_to_unzip_winutils_folder/bin"
- Add to the environment variables a new variable called "HADOOP_HOME" and give it value "path_to_unzip_winutils_folder"
-
In your python code you will need to add the following lines of codes
from pyspark import SparkContext sc = SparkContext.getOrCreate()
import numpy as np
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("PiEstimate")
sc = SparkContext(conf = conf)
n = 10000
def f(_):
x, y = np.random.random(2)*2-1
if x**2+y**2 <= 1:
return 1
else:
return 0
count = sc.parallelize(range(1,n+1)).map(f).reduce(lambda x, y: x+y)
print("Pi is roughly {}".format(4.0 * count/n))