How to run a script in PySpark - apache-spark

I'm trying to run a script in the pyspark environment but so far I haven't been able to.
How can I run a script like python script.py but in pyspark?

You can do: ./bin/spark-submit mypythonfile.py
Running python applications through pyspark is not supported as of Spark 2.0.

pyspark 2.0 and later execute script file in environment variable PYTHONSTARTUP, so you can run:
PYTHONSTARTUP=code.py pyspark
Compared to spark-submit answer this is useful for running initialization code before using the interactive pyspark shell.

Just spark-submit mypythonfile.py should be enough.

You can execute "script.py" as follows
pyspark < script.py
or
# if you want to run pyspark in yarn cluster
pyspark --master yarn < script.py

Existing answers are right (that is use spark-submit), but some of us might want to just get started with a sparkSession object like in pyspark.
So in the pySpark script to be run first add:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('yarn') \
.appName('pythonSpark') \
.enableHiveSupport()
.getOrCreate()
Then use spark.conf.set('conf_name', 'conf_value') to set any configuration like executor cores, memory, etc.

Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file.
The command is,
$ spark-submit --master <url> <SCRIPTNAME>.py.
I'm running spark in windows 64bit architecture system with JDK 1.8 version.
P.S find a screenshot of my terminal window.
Code snippet

Related

How to add jar files to $SPARK_HOME/jars correctly?

I have used this command and it works fine:
spark = SparkSession.builder.appName('Apptest')\
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.5').getOrCreate()
But I'd like to download the jar file and always start with:
spark = SparkSession.builder.appName('Apptest').getOrCreate()
How can I do it? I have tried:
Move to SPARK_HOME jar dir:
cd /de/spark-2.4.6-bin-hadoop2.7/jars
Download jar file
curl https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.3.5/mongo-spark-connector_2.11-2.3.5.jar --output mongo-spark-connector_2.11-2.3.5.jar
But spark don't see it. I got the following error:
Py4JJavaError: An error occurred while calling o66.save.
: java.lang.NoClassDefFoundError: com/mongodb/ConnectionString
I know there is ./spark-shell --jar command, but I am using jupyter notebook. Is there some step missing?
Since you're using SparkSession in the jupyter notebook, unfortunately you have to use the .config('spark.jars.packages', '...') to add the jars that you want when you're creating the spark object.
Instead, if you want to add the jar in "default" mode when you launch the notebook, I would recommend you to create a custom kernel, so that every time when you create a new notebook, you even don't need to create the spark. If you're using Anaconda, you can check the docs: https://docs.anaconda.com/ae-notebooks/admin-guide/install/config/custom-pyspark-kernel/
What I was looking for is .config("spark.jars",".."):
spark = SparkSession.builder.appName('Test')\
.config("spark.jars", "/root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar") \
.getOrCreate()
Or:
import os
os.environ["PYSPARK_SUBMIT_ARGS"]="--jars /root/mongo-spark-connector_2.11-2.3.5.jar,/root/mongo-java-driver-3.12.5.jar pyspark-shell"
Also, seems that only put the jar files in $SPARK_HOME/jars works fine as well, but in my case and question example was missing the dependency mongo-java-driver-3.12.5.jar. After download all dependencies in $SPARK_HOME/jars I was able to run only with:
spark = SparkSession.builder.appName('Test').getOrCreate()
I have find out the dependencies in: https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector_2.11/2.3.5

How to run the pyspark shell code as a job like we run Python jobs

I am very new to spark.
I have created the spark code in pyspark but it is repl shell, how can I run this as a script.
I tried python script.py but it fails as it cannot access spark libraries.
spark-submit script.py is the way to submit a pyspark code.
Before using this please make sure pyspark and spark is properly set up in your environment.
If not done yet, do follow the below link

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

Understanding Spark Version

When i give pyspark in shell it displays spark version as version 1.6.0 in console.
But when i give spark2-submit --version it says version 2.2.0.cloudera2.
I want to understand the difference between them and what is the actual version on which pyspark runs? Whenever is run a py script, I use spark2-submit script.py.
Before executing Pyspark, try setting your spark version environment variable. Try to run this command below on your terminal :
SPARK_MAJOR_VERSION=2 pyspark
When I give pyspark2 it shows version 2.2.0. This matches with spark2-submit --version.

How to lauch prorams in Apache spark?

I have a “myprogram.py” and my “myprogram.scala” that I need to run on my spark machine. How Can I upload and launch them?
I have been using shell to do my transformation and calling actions, but now I want to launch a complete program on spark machine instead of entering single commands every time. Also I believe that will make it easy for me to make changes to my program instead of starting to enter commands in shell.
I did standalone installation in Ubuntu 14.04, on single machine, not a cluster, used spark 1.4.1.
I went through spark docs online, but I only find instruction on how to do that on cluster. Please help me on that.
Thank you.
The documentation to do this (as commented above) is available here: http://spark.apache.org/docs/latest/submitting-applications.html
However, the code you need is here:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
You'll need to compile the scala file using sbt (documentation here: http://www.scala-sbt.org/0.13/tutorial/index.html)
Here's some information on the build.sbt file you'll need in order to grab the right dependencies: http://spark.apache.org/docs/latest/quick-start.html
Once the scala file is compiled, you'll send the resulting jar using the above submit command.
Put it simply:
In Linux terminal, cd to the directory that spark is unpacked/installed
Note, this folder normally contains subfolders like “bin”, “conf”, “lib”, “logs” and so on.
To run the Python program locally with simple/default settings, type command
./bin/spark-submit --master local[*] myprogram.py
More complete descriptions are here like zero323 and ApolloFortyNine described.

Resources