PySpark in jupyter notebook using spark-csv package [duplicate] - apache-spark

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

Related

Debug PySpark in VS Code

I'm building a project in PySpark using VS Code. I have Python and Spark installed and PySpark is correctly imported and running in a Jupyter notebook. Do do so, I run:
import findspark
findspark.init()
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
[my code... ]
Now, how do I debug my PySpark code in VS Code? I don't want to run findspark in my project. Do I need to create a virtual environment and run a pre-script? What are the best practices?
So the steps I followed and what I learnt:
make sure Python, Spark and the environmental variables are correctly installed and set
Create a virtual environment, and pip install all required libraries (including pyspark)
Run de debugger. The debugger will just analyze the code and not run it. So a SparkContext and a SparkSession are not required.

Not able to import sparkdl in jupyter notebook

I am trying to use spark deep learning library(https://github.com/databricks/spark-deep-learning) in jupyter notebook.
When I try to "import sparkdl" in jupyter notebook I am getting error "no module found".
When I am running the below command in cli
pyspark --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11
I am able to import sparkdl in the spark shell and its working.
How can I use this library in jupyter notebook?
Here is the snippet that I use with PySpark 2.4. You will need a connection to the web to be able to install the package.
# Import libraries
from pyspark.sql import SparkSession
# Creating SparkSession
spark = (SparkSession
.builder
.config('spark.jars.packages', 'databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11')
.getOrCreate()
)
# Import Spar-Deep-Learning-Pipelines
import sparkdl
You can check several points:
use %conda list|grep "sparkdl" in jupyter notebook cell to check if sparkdl is installed as you wish.
virtual environment. Is sparkdl installed into another virtual environment?
hope this could help u.
First you have to download sparkdl jar file using below command:
wget https://repos.spark-packages.org/databricks/spark-deep-learning/1.5.0-spark2.4-s_2.11/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar
Second you have to install sparkdl pypi package using below command:
pip install sparkdl
Then you can use the below snippet in jupyter notebook:
import findspark
findspark.init()
from pyspark.conf import SparkConf
from pyspark import SparkContext
conf = SparkConf().set("spark.jars", "./spark-deep-learning-1.5.0-spark2.4-s_2.11.jar")
conf.setAppName("ML")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession(sc)
import sparkdl
This solution doesn't require a web connection once you download the jar file Krystof

How to add JDBC dependency to Anaconda and Jupyter for Spark [duplicate]

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

pySpark not found: value % %pyspark

I am using spark cluster on EMR with Zepplin notebook along with it
I opened the Zepplin notebook in webbroswer and created a notebook, typed in
%pyspark
get the error
<console>:26: error: not found: value % %pyspark
how can I use pyspark in Zepplin ? What have I done wrong here?
Try checking into your zeppelin.python property. Maybe your default system python and Zeppelins' Python have a conflict in their versions.
Try adding this line to your .bashrc
export PYSPARK_PYTHON=/home/$USER/path/to/your/default/system/python
You might have missed settig SPARK_HOME but if it isnt the case you can use findspark library
https://github.com/minrk/findspark/blob/master/README.md
Import findspark
findspark.find(path to spark folder)
Or if you intend to use pyspark 2.2 you can directly do
pip install pyspark
And if the above line throws error try it with sudo
export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Set these environment variables in IDE or system variables
SPARK_HOME = <path to spark home>
PYSPARK_SUBMIT_ARGS = "--master local[2] pyspark-shell"
PYTHONPATH = %SPARK_HOME%\python;%SPARK_HOME%\python\build;%PYTHONPATH%;
It may be the interpreter binding for spark is not set up in that note. There is a gear icon on the right next to a lock and keyboard icon.
Click that icon and the interpreters list will be displayed. Make sure the spark binding is blue.
If the spark binding is not listed, use some of these other answers to understand why Zeppelin does not have an available spark binding.

Using Spark packages with Jupyter Notebook on HD Insight

I'm trying to use graphFrames on PySpark via a Jupyter notebook. My Spark cluster is on HD Insight, so I don't have access to edit kernel.json.
The solutions suggested [here][1] and [here][2] didn't work. This is what I tried to run:
import os
packages = "graphframes:graphframes:0.3.0-spark2.0" # -s_2.11
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
from graphframes import *
This resulted in an error that a module named graphframes doesn't exist. Is there a way to initiate a new SparkContext after changing this env variable?
I've also tried passing the PYSPARK_SUBMIT_ARGS variable to IPython via the %set_env magic command and then importing graphframes:
%set_env PYSPARK_SUBMIT_ARGS='--packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 pyspark-shell'
from graphframes import *
But this resulted in the same error.
I saw some suggestions to pass the jar to IPython, but I'm not sure how to download the needed jar to my HD Insight cluster.
Do you have any suggestions?
It turns out I had two separate issues:
1) I was using the wrong syntax to configure the notebook. you should use:
# For HDInsight 3.3 and HDInsight 3.4
%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
# For HDInsight 3.5
%%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-csv_2.10:1.4.0" }}
Here are the relevant docs from Microsoft.
2) According to this useful answer, there seems to be bug in Spark that causes it to miss the package's jar. This worked for me:
sc.addPyFile(os.path.expanduser('./graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar'))

Resources