pyspark,jars and jupyter notebook [duplicate]

pyspark,jars and jupyter notebook [duplicate] - apache-spark

This question already has answers here:
How to load jar dependenices in IPython Notebook
(2 answers)
Add Jar to standalone pyspark
(6 answers)
Closed 4 years ago.
I recently used jars file to allow mongodb integration with spark, so i type:
pyspark --jars mongo-hadoop-spark-2.0.2.jar,mongo-java-driver-3.4.2.jar,mongo-hadoop-2.0.2.jar
wich let me interact with mongodb database from the pyspark shell.
Secondly i use jupyter notebook with the command line 'jupyter notebook' and write:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
to run pyspark command inside jupyter.
How could i tell Spark to automatically integrate my jars files as i did with the Shell ? Is there some config files i should fill inside the spark directory (in my $SPARK_HOME or can i do that from inside jupyter notebook ?
Thanks.
PS: I am a newby in info ;)

Related

"ImportError: No module named sklearn" when --deploy-mode is cluster [duplicate]

This question already has answers here:
ImportError: No module named numpy on spark workers
(8 answers)
Closed 1 year ago.
I have written the following pyspark code.
from pyspark.sql import SparkSession
import sys
import sklearn
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
print (sys.version_info)
When I run with:
spark-submit --master yarn --deploy-mode client test.py
it executes correctly. However, when I change --deploy-mode to the "cluster", i.e.:
spark-submit --master yarn --deploy-mode cluster test.py
I see the following error. I have no idea why this happens and how can I resolve it.
ImportError: No module named sklearn
I have seen this post. But it did not help me.

--deploy-mode client will use the current machine where you submitting your Spark application as the driver, and obviously that machine has sklearn package installed. However, --deploy-mode cluster will randomly pick a driver from available resources, so you don't know upfront which machine will be driver, and one of them might not have sklearn package installed, hence the error you're facing with. So the solution is install sklearn package in all available nodes in your cluster

How to access Apache PySpark from command line?

I'm taking an online course on Apache PySpark using Jupyter notebooks. In order to easily open the Jupyter notebooks they had me enter these lines of code into my bash profile (I'm using MAC OS):
export SPARK_HOME="(INSERTED MY SPARK DIRECTORY)"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
I'm not too familiar with Linux and the course didn't explain what these lines of code do. Before I did this, I could access PySpark via the command line by typing "pyspark". But now when I type "pyspark" it opens a jupyter notebook. Now I can't figure out how to access it from the command line. What does this code do and how do I access command line pyspark?

Are you using local installation of Pyspark?
You can use https://github.com/minrk/findspark
Install findspark using Anaconda.
First, you add these two lines and it will be able to find pyspark.
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")

How to add JDBC dependency to Anaconda and Jupyter for Spark [duplicate]

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

Use Jupyter notebook with pyspark in my cluster HDP

I have a cluster of 4 nodes where it's already installed Spark, I use Pyspark or spark-shell to launch spark and start programming.
I knew how to use Zepplin, but I would like to use jupyter instead as Programation interface (IDE) because it's more useful.
I read that I should export this 2 variable to my .bashrc to make it work:
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
how can I use Pyspark with jupyter?

PySpark in jupyter notebook using spark-csv package [duplicate]

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pyspark,jars and jupyter notebook [duplicate] - apache-spark

Related

"ImportError: No module named sklearn" when --deploy-mode is cluster [duplicate]

How to access Apache PySpark from command line?

How to add JDBC dependency to Anaconda and Jupyter for Spark [duplicate]

Use Jupyter notebook with pyspark in my cluster HDP

PySpark in jupyter notebook using spark-csv package [duplicate]

Categories

Resources