How to access Apache PySpark from command line? - apache-spark

I'm taking an online course on Apache PySpark using Jupyter notebooks. In order to easily open the Jupyter notebooks they had me enter these lines of code into my bash profile (I'm using MAC OS):
export SPARK_HOME="(INSERTED MY SPARK DIRECTORY)"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
I'm not too familiar with Linux and the course didn't explain what these lines of code do. Before I did this, I could access PySpark via the command line by typing "pyspark". But now when I type "pyspark" it opens a jupyter notebook. Now I can't figure out how to access it from the command line. What does this code do and how do I access command line pyspark?

Are you using local installation of Pyspark?
You can use https://github.com/minrk/findspark
Install findspark using Anaconda.
First, you add these two lines and it will be able to find pyspark.
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")

Related

SparkSession | Ubuntu | Pycharm not working

I'm trying to use PySpark locally on Ubuntu using PyCharm rather than a jupyter notebook in order to build an Electron app. However, when I'm trying to set up a SparkSession, it doesn't work. When I try this:
spark = SparkSession.builder.master('local[*]').appName('Search').enableHiveSupport().getOrCreate
df = pd.DataFrame([1,2,3], columns=['Test'])
myschema = StructType([StructField('Test'),Integertype(),True)])
df2 = spark.createDataFrame(df,schema=myschema)
print(type(df2))
the session opens but it tells me
"AttributeError: 'function' object has no attribute 'createDataFrame' "
Then, rewrite the above with ".getOrCreate()" and it tells me
"FileNotFoundError: [Error 2] No such file or directory "home/...././bin/spark-submit'
I guess the set up in Pycharm might be off, but I don't really understand why.
You need to to use method invocation getOrCreate(), not getOrCreate. Also, make sure you install pyspark inside the python interpreter used for your project in pycharm. You can access it via Preferences -> Python Interpreter in pycharm.
Update:
Try downloading and extracting the spark binaries (e.g. spark 2.4.0) on your local and then add the following entries in your bashrc (and source it). I'm assuming you're using spark 2.4.0, so the py4j is specific to this version. For any other version of spark, check the py4j version and add accordingly.
export SPARK_HOME=/<your_path>/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=${SPARK_HOME}/python:$PYTHONPATH
export PYTHONPATH=${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/<location_of_python_interpreter>
Whatever python interpreter you're linking into PYSPARK_PYTHON, make sure to use the same python in your pycharm project.

PySpark: No such file or directory error when creating a SparkContext on Jupyter Notebook [duplicate]

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:
PATH="/my/path/to/anaconda3/bin:$PATH"
export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"
export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"
When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output sc is not empty. It seems to work fine.
When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
And sc in my Jupyter Notebook is empty.
Can anyone help solve this situation?
Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:
I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
and placed it in the ~/.ipython/profile_default/startup/ directory
When I did this, the error then became:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:
Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to ugly outcomes, like typing pyspark and ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submit with the above settings... :(
(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).
At the time of writing (Dec 2017), there is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
I have not bothered to change my details to /my/path/to etc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file - I don't use one either).
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.
If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.
Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOME should be OK). And confirm that, when you type pyspark, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...
UPDATE (after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
Conda can help correctly manage a lot of dependencies...
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Create a conda environment with all needed dependencies apart from spark:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
Activate the environment
$ source activate findspark-jupyter-openjdk8-py3
Launch a Jupyter Notebook server:
$ jupyter notebook
In your browser, create a new Python3 notebook
Try calculating PI with the following script (borrowed from this)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
I just conda installed sparkmagic (after re-installing a newer version of Spark).
I think that alone simply works, and it is much simpler than fiddling configuration files by hand.

Use Jupyter notebook with pyspark in my cluster HDP

I have a cluster of 4 nodes where it's already installed Spark, I use Pyspark or spark-shell to launch spark and start programming.
I knew how to use Zepplin, but I would like to use jupyter instead as Programation interface (IDE) because it's more useful.
I read that I should export this 2 variable to my .bashrc to make it work:
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
how can I use Pyspark with jupyter?

pySpark not found: value % %pyspark

I am using spark cluster on EMR with Zepplin notebook along with it
I opened the Zepplin notebook in webbroswer and created a notebook, typed in
%pyspark
get the error
<console>:26: error: not found: value % %pyspark
how can I use pyspark in Zepplin ? What have I done wrong here?
Try checking into your zeppelin.python property. Maybe your default system python and Zeppelins' Python have a conflict in their versions.
Try adding this line to your .bashrc
export PYSPARK_PYTHON=/home/$USER/path/to/your/default/system/python
You might have missed settig SPARK_HOME but if it isnt the case you can use findspark library
https://github.com/minrk/findspark/blob/master/README.md
Import findspark
findspark.find(path to spark folder)
Or if you intend to use pyspark 2.2 you can directly do
pip install pyspark
And if the above line throws error try it with sudo
export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Set these environment variables in IDE or system variables
SPARK_HOME = <path to spark home>
PYSPARK_SUBMIT_ARGS = "--master local[2] pyspark-shell"
PYTHONPATH = %SPARK_HOME%\python;%SPARK_HOME%\python\build;%PYTHONPATH%;
It may be the interpreter binding for spark is not set up in that note. There is a gear icon on the right next to a lock and keyboard icon.
Click that icon and the interpreters list will be displayed. Make sure the spark binding is blue.
If the spark binding is not listed, use some of these other answers to understand why Zeppelin does not have an available spark binding.

PySpark in jupyter notebook using spark-csv package [duplicate]

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

Resources