Pyspark installation guide

OSX installation guide

PySpark - Environment setup

Step 1: Make sure that java and scala are installed on your computer (PySpark doesn't work very well with Java 9, so it's better to use Java 8)
Step 2: Download Spark from here
Step 3: Move the downloaded file in a choosen directory, for example ~/Dev and unzip it with tar -xvf Downloads/spark-2.1.0-bin-hadoop2.7.tgz
Step 4: Now we need to set the following environments to set the Spark path and the Py4j path in the bash profile

export SPARK_HOME = ~/Dev/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:~/Dev/hadoop/spark-2.1.0-bin-hadoop2.7/bin
export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH = $SPARK_HOME/python:$PATH

Step 5: restart the terminal application and then test if it works by typing

$SPARK_HOME/bin/pyspark

Step 6: To run a python script, for example FirstSparkApp.py type

$SPARK_HOME/bin/spark-submit FirstSparkApp.py

Windows installation Guide

Download and install the JDK from here
Via terminal install pyspark using the following command
```
pip install pyspark
```
Download and unzip winutils from here
Modify your environment variables in the following way:
- Modify the environment variable "path" and add "path_to_unzip_winutils_folder/bin"
- Add to the environment variables a new variable called "HADOOP_HOME" and give it value "path_to_unzip_winutils_folder"

In your python code you will need to add the following lines of codes

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

Example

import numpy as np
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("PiEstimate")
sc = SparkContext(conf = conf)

n = 10000

def f(_):
    x, y = np.random.random(2)*2-1
    if x**2+y**2 <= 1:
        return 1
    else:
        return 0

count = sc.parallelize(range(1,n+1)).map(f).reduce(lambda x, y: x+y)

print("Pi is roughly {}".format(4.0 * count/n))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PythonInstructions.md

PythonInstructions.md

Pyspark installation guide

OSX installation guide

PySpark - Environment setup

Windows installation Guide

Example

Files

PythonInstructions.md

Latest commit

History

PythonInstructions.md

File metadata and controls

Pyspark installation guide

OSX installation guide

PySpark - Environment setup

Windows installation Guide

Example