How to Install PySpark with Java 8 on Ubuntu 18.04

In this tutorial, we will see How to Install PySpark with JAVA 8 on Ubuntu 18.04?

We will install Java 8, Spark and configured all the environment variables.

My machine has ubuntu 18.04 and I am using Java 8 along with Anaconda3. If you follow the steps, you should be able to install PySpark without any problem.

Make Sure That You Have Java Installed

If you don’t, run the following command in the terminal:

sudo apt install openjdk-8-jdk

After installation, if you type the java -version in the terminal you will get:

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)

Download Spark from https://spark.apache.org/downloads.html

Remember the directory where you downloaded it. I got it in my default downloads folder where I will install spark.

Set the $JAVA_HOME Environment Variable

For this, run the following in the terminal:

sudo vim /etc/environment

It will open the file in vim. Then, in a new line after the PATH variable add

JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

Type wq! and then hit enter. This will save the edit in the file. Later, in the terminal run

source /etc/environment

Don’t forget to run the last line in the terminal, as that will create the environment variable and load it in the currently running shell. Now, if you run

echo $JAVA_HOME

The output should be:

/usr/lib/jvm/java-8-openjdk-amd64

Just like it was added. Now some versions of ubuntu do not run the /etc/environment file every time we open the terminal so it’s better to add it in the .bashrc file as the .bashrc file is loaded to the terminal every time it’s opened. So run the following command in the terminal,

vim ~/.bashrc

File opens. Add at the end

source /etc/environment

We will add spark variables below it later. Exit for now and load the .bashrc file in the terminal again by running the following command.

source ~/.bashrc

Or you can exit this terminal and create another. By now, if you run echo $JAVA_HOME you should get the expected output.

Installing PySpark

Easy Way

This method is best for WSL (Windows Subsystem for Linux) Ubuntu:

Just execute the below command if you have Python and PIP already installed.

pip install pyspark

Manual Way

Go to the directory where the spark zip file was downloaded and run the command to install it:

cd Downloads
sudo tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz

Note : If your spark file is of different version correct the name accordingly.

Configure Environment Variables for Spark

This step is only meant if you have installed in “Manual Way”

vim ~/.bashrc

Add the following at the end,

export SPARK_HOME=~/Downloads/spark-2.4.3-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:~/anaconda3/bin
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$PATH:$JAVA_HOME/jre/bin

Save the file and exit. Finally, load the .bashrc file again in the terminal by

source ~/.bashrc

Now run:

pyspark

Finally, if you execute the below command it will launch Spark Shell.

cd $SPARK_HOME
cd bin 
spark-shell --version