In this tutorial, we will see How to Install PySpark with JAVA 8 on Ubuntu 18.04?
We will install Java 8, Spark and configured all the environment variables.
My machine has ubuntu 18.04 and I am using Java 8 along with Anaconda3. If you follow the steps, you should be able to install PySpark without any problem.
Make Sure That You Have Java Installed
If you don’t, run the following command in the terminal:
sudo apt install openjdk-8-jdk
After installation, if you type the java -version in the terminal you will get:
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
Download Spark from https://spark.apache.org/downloads.html
Remember the directory where you downloaded it. I got it in my default downloads folder where I will install spark.
Set the $JAVA_HOME Environment Variable
For this, run the following in the terminal:
sudo vim /etc/environment
It will open the file in vim. Then, in a new line after the PATH variable add
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
Type wq! and then hit enter. This will save the edit in the file. Later, in the terminal run
source /etc/environment
Don’t forget to run the last line in the terminal, as that will create the environment variable and load it in the currently running shell. Now, if you run
echo $JAVA_HOME
The output should be:
/usr/lib/jvm/java-8-openjdk-amd64
Just like it was added. Now some versions of ubuntu do not run the /etc/environment
file every time we open the terminal so it’s better to add it in the .bashrc file as the .bashrc file is loaded to the terminal every time it’s opened. So run the following command in the terminal,
vim ~/.bashrc
File opens. Add at the end
source /etc/environment
We will add spark variables below it later. Exit for now and load the .bashrc file in the terminal again by running the following command.
source ~/.bashrc
Or you can exit this terminal and create another. By now, if you run echo $JAVA_HOME you should get the expected output.
Installing PySpark
Easy Way
This method is best for WSL (Windows Subsystem for Linux) Ubuntu:
Just execute the below command if you have Python and PIP already installed.
pip install pyspark
Manual Way
Go to the directory where the spark zip file was downloaded and run the command to install it:
cd Downloads
sudo tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz
Note : If your spark file is of different version correct the name accordingly.
Configure Environment Variables for Spark
This step is only meant if you have installed in “Manual Way”
vim ~/.bashrc
Add the following at the end,
export SPARK_HOME=~/Downloads/spark-2.4.3-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:~/anaconda3/bin
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$PATH:$JAVA_HOME/jre/bin
Save the file and exit. Finally, load the .bashrc file again in the terminal by
source ~/.bashrc
Now run:
pyspark
Finally, if you execute the below command it will launch Spark Shell.
cd $SPARK_HOME
cd bin
spark-shell --version