I'm building a project in PySpark using VS Code. I have Python and Spark installed and PySpark is correctly imported and running in a Jupyter notebook. Do do so, I run:
import findspark
findspark.init()
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
[my code... ]
Now, how do I debug my PySpark code in VS Code? I don't want to run findspark in my project. Do I need to create a virtual environment and run a pre-script? What are the best practices?
So the steps I followed and what I learnt:
make sure Python, Spark and the environmental variables are correctly installed and set
Create a virtual environment, and pip install all required libraries (including pyspark)
Run de debugger. The debugger will just analyze the code and not run it. So a SparkContext and a SparkSession are not required.
Related
I'm taking an online course on Apache PySpark using Jupyter notebooks. In order to easily open the Jupyter notebooks they had me enter these lines of code into my bash profile (I'm using MAC OS):
export SPARK_HOME="(INSERTED MY SPARK DIRECTORY)"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
I'm not too familiar with Linux and the course didn't explain what these lines of code do. Before I did this, I could access PySpark via the command line by typing "pyspark". But now when I type "pyspark" it opens a jupyter notebook. Now I can't figure out how to access it from the command line. What does this code do and how do I access command line pyspark?
Are you using local installation of Pyspark?
You can use https://github.com/minrk/findspark
Install findspark using Anaconda.
First, you add these two lines and it will be able to find pyspark.
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
we have developed the code using databricks connect and used from pyspark.dbutils import DBUtils, while packaging the code to databricks in a wheel file, it fails with an error no module found pyspark.dbutils.
There is no pip install pyspark.dbutils present.
How to fix this?
Dbutils should already be available with databricks-connect, so import it using this script:
from pyspark.sql import SparkSession
from pyspark import dbutils
import argparse
spark = SparkSession.builder.getOrCreate()
setting = spark.conf.get("spark.master")
if "local" in setting:
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark.sparkContext)
else:
print("Do nothing - dbutils should be available already")
I am trying to use spark deep learning library(https://github.com/databricks/spark-deep-learning) in jupyter notebook.
When I try to "import sparkdl" in jupyter notebook I am getting error "no module found".
When I am running the below command in cli
pyspark --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11
I am able to import sparkdl in the spark shell and its working.
How can I use this library in jupyter notebook?
Here is the snippet that I use with PySpark 2.4. You will need a connection to the web to be able to install the package.
# Import libraries
from pyspark.sql import SparkSession
# Creating SparkSession
spark = (SparkSession
.builder
.config('spark.jars.packages', 'databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11')
.getOrCreate()
)
# Import Spar-Deep-Learning-Pipelines
import sparkdl
You can check several points:
use %conda list|grep "sparkdl" in jupyter notebook cell to check if sparkdl is installed as you wish.
virtual environment. Is sparkdl installed into another virtual environment?
hope this could help u.
First you have to download sparkdl jar file using below command:
wget https://repos.spark-packages.org/databricks/spark-deep-learning/1.5.0-spark2.4-s_2.11/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar
Second you have to install sparkdl pypi package using below command:
pip install sparkdl
Then you can use the below snippet in jupyter notebook:
import findspark
findspark.init()
from pyspark.conf import SparkConf
from pyspark import SparkContext
conf = SparkConf().set("spark.jars", "./spark-deep-learning-1.5.0-spark2.4-s_2.11.jar")
conf.setAppName("ML")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession(sc)
import sparkdl
This solution doesn't require a web connection once you download the jar file Krystof
This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.
This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.