Understanding Spark Version - apache-spark

When i give pyspark in shell it displays spark version as version 1.6.0 in console.
But when i give spark2-submit --version it says version 2.2.0.cloudera2.
I want to understand the difference between them and what is the actual version on which pyspark runs? Whenever is run a py script, I use spark2-submit script.py.

Before executing Pyspark, try setting your spark version environment variable. Try to run this command below on your terminal :
SPARK_MAJOR_VERSION=2 pyspark

When I give pyspark2 it shows version 2.2.0. This matches with spark2-submit --version.

Related

where is local hadoop folder in pyspark (mac)

I have installed pyspark in local mac using homebrew. I am able to see spark under /usr/local/Cellar/apache-spark/3.2.1/
but not able to see hadoop folder. If I run pyspark in terminal it is running spark shell.
Where can I see its path?
I a trying to connect S3 to pyspark and I have dependency jars
You do not need to know the location of Hadoop to do this.
You should use a command like spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 app.py instead, which will pull all necessary dependencies rather than download all JARs (with their dependencies) locally.

How to run the pyspark shell code as a job like we run Python jobs

I am very new to spark.
I have created the spark code in pyspark but it is repl shell, how can I run this as a script.
I tried python script.py but it fails as it cannot access spark libraries.
spark-submit script.py is the way to submit a pyspark code.
Before using this please make sure pyspark and spark is properly set up in your environment.
If not done yet, do follow the below link

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

How to run a script in PySpark

I'm trying to run a script in the pyspark environment but so far I haven't been able to.
How can I run a script like python script.py but in pyspark?
You can do: ./bin/spark-submit mypythonfile.py
Running python applications through pyspark is not supported as of Spark 2.0.
pyspark 2.0 and later execute script file in environment variable PYTHONSTARTUP, so you can run:
PYTHONSTARTUP=code.py pyspark
Compared to spark-submit answer this is useful for running initialization code before using the interactive pyspark shell.
Just spark-submit mypythonfile.py should be enough.
You can execute "script.py" as follows
pyspark < script.py
or
# if you want to run pyspark in yarn cluster
pyspark --master yarn < script.py
Existing answers are right (that is use spark-submit), but some of us might want to just get started with a sparkSession object like in pyspark.
So in the pySpark script to be run first add:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('yarn') \
.appName('pythonSpark') \
.enableHiveSupport()
.getOrCreate()
Then use spark.conf.set('conf_name', 'conf_value') to set any configuration like executor cores, memory, etc.
Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file.
The command is,
$ spark-submit --master <url> <SCRIPTNAME>.py.
I'm running spark in windows 64bit architecture system with JDK 1.8 version.
P.S find a screenshot of my terminal window.
Code snippet

How to add spark-csv package to jupyter server on Azure for use with iPython

I want to use the spark-csv package from https://github.com/databricks/spark-csv from within the jupyter service running on Spark HDInsight cluster on Azure.
From local cluster I know I can do this like:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
However I don't understand/know where to put this in the Azure spark configuration.. Any clues hints are appreciated.
You can use the %%configure magic to add any required external package.
It should be as simple as putting the following snippet in your first code cell.
%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
This specific example is also covered in the documentation. Just make sure you start Spark session after the %%configure cell.
One option for managing Spark packages in a cluster from a Jupyter notebook is Apache Toree. Toree gives you some extra line magics that allow you to manage Spark packages from within a Jupyter notebook. For example, inside a Jupyter scala notebook, you would install spark-csv with
%AddDeps com.databricks spark-csv_2.11 1.4.0 --transitive
To install Apache Toree on your Spark clusters, ssh into you Spark clusters and run,
sudo pip install --pre toree
sudo jupyter toree install \
--spark_home=$SPARK_HOME \
--interpreters=PySpark,SQL,Scala,SparkR
I know you specifically asked about Jupyter notebooks running PySpark. At this time, Apache Toree is an incubating project. I have run into trouble using the provided line magics with pyspark notebooks specifically. Maybe you will have better luck. I am looking into why this is, but personally, I prefer Scala in Spark. Hope this helps!
You can try to execute your two lines of code (export ...) in a script that you can invoke in Azure at the time of creation of the HDInsight cluster.
Since you are using HDInsight, you can use a "Script Action" on the Spark cluster load that imports the needed libraries. The script can be a very simple shell script and it can be automatically executed on startup, and automatically re-executed on new nodes if the cluster is resized.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/

Resources