Creation of SparkContext in PySpark - apache-spark

I am wondering which process creates SparkContext in PySpark applications. I understood that there will be a Python main process and a JVM one and it sounded like the python main script will spawn the JVM one but not so sure that's accurate. I have two questions:
Which one actually creates SparkContext? I am guessing it is the JVM one and it will be passed to the Python main process?
If I run my PySpark via spark-submit, it looks like it is the JVM process? Would this be the one creating SparkContext?
Overall it is not clear to me whether there are any differences in running Pyspark codes with spark-submit vs. with python interpreter (python3 my_pyspark.py for example) in terms of which one creates the SparkContext.

Related

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

Running a PySpark code in python vs spark-submit

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?
The SO answer here was almost similar but did not explain it in great details. Would love to know, why?
Any help is appreciated. Thanks in advance.
I am assuming when you say python interpreter you are referring to pyspark shell.
You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).
When to use PySpark Interpreter.
Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.
Spark Submit.
This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.
A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.
Hope this helps.
Regards,
Neeraj
Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster

What pyspark api calls requiere same version of python in workers in yarn-client mode

usually I run my code with different versions of Python in the driver than in the worker nodes, using yarn-client mode.
For instance, I usually use python3.5 in the driver and the default python2.6 in workers and this works pretty.
I am currently in a project where we need to call
sqlContext.createDataFrame
But this seems to try to execute this sentence in python in the workers and then I got the requirement of installing the same version of python in workers which is what I am trying to avoid.
So, For using "sqlContext.createDataFrame" it is a requirement to have the same python version in driver and workers ?
And if so, which other "pure" pyspark.sql api call would also have this requirement?
Thanks,
Jose
Yes, the same Python verion is the requirement in general. Some API call may not fail because there is no Python executor in use but it is not a valid configuration.
Every call that interacts with Python code, like udf or DataFrame.rdd.* will trigger the same exception.
If you want to avoid upgrading cluster Python then use Python 2 on the driver.
In general, many pyspark operations are just a wrapper to calling spark operations on the JVM. For these operations it doesn't matter what version of python is used in the worker because no python is executed on the worker, only JVM operations.
Examples of such operations include reading a dataframe from file, all built-in functions which do not require python objects/functions as input etc.
Once a function requires an actual python object or function this becomes a little trickier.
Let's say for example that you want to use a UDF and use lambda x: x+1 as the function.
Spark doesn't really know what the function is. Instead it serializes it and sends it to the worker who de-serialize it in turn.
For this serialization/de-serialization process to work, the versions of both sides need to be compatible and that is often not the case (especially between major versions).
All of this leads us to createDataFrame. If you use RDD as one of the parameters for example, the RDD would contain python objects as the records and these would need to be serialized and de-serialized and therefore must have the same version.

Why does SparkContext randomly close, and how do you restart it from Zeppelin?

I am working in Zeppelin writing spark-sql queries and sometimes I suddenly start getting this error (after not changing code):
Cannot call methods on a stopped SparkContext.
Then the output says further down:
The currently active SparkContext was created at:
(No active SparkContext.)
This obviously doesn't make sense. Is this a bug in Zeppelin? Or am I doing something wrong? How can I restart the SparkContext?
Thank you
I have faced this problem a couple of times.
If you are setting your master as yarn-client, it might be due to the stop / restart of Resource Manager, the interpreter process may still be running but the Spark Context (which is a Yarn application) does not exists any more.
You could check if Spark Context is still running by consulting your Resource manager web Interface and check if there is an application named Zeppelin running.
Sometimes restarting the interpreter process from within Zeppelin (interpreter tab --> spark --> restart) will solve the problem.
Other times you need to:
kill the Spark interpreter process from the command line
remove the Spark Interpreter PID file
and the next time you start a paragraph it will start new spark context
I'm facing the same problem running multiple jobs in PySpark. Seems that in Spark 2.0.0, with SparkSession, when I call spark.stop() SparkSession calls the following trace:
# SparkSession
self._sc.stop()
# SparkContext.stop()
self._jsc = None
Then, when I try to create a new job with new a SparkContext, SparkSession return the same SparkContext than before with self.jsc = None.
I solved setting SparkSession._instantiatedContext = None after spark.stop() forcing SparkSession to create a new SparkContext next time that I demand.
It's not the best option, but meanwhile it's solving my issue.
I've noticed this issue more when running pyspark commands even with trivial variable declarations that a cell execution hangs in running state.
As mentioned above by user1314742, just killing the relevant PID solves this issue for me.
e.g.:
ps -ef | grep zeppelin
This is where restarting the Spark interpreter and restarting zeppelin notebook does not solve the issue. I guess because it cannot control the hung PID itself.
Could you check your driver memory is enough or not ? I solved this issue by
enlarge driver memory
tune GC:
--conf spark.cleaner.periodicGC.interval=60
--conf spark.cleaner.referenceTracking.blocking=false

stop all existing spark contexts

I am trying to create a new Spark context using pyspark, and i get the following:
WARN SparkContext: Another SparkContext is being constructed (or threw
an exception in its constructor). This may indicate an error, since
only one SparkContext may be running in this JVM (see SPARK-2243). The
other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
I do not have any other context active (in theory), but maybe it did not finish correctly and it is still there. How can I find out if there is other or kill all the current ones? I am using spark 1.5.1
When you run pyspark shell and execute python script inside it, e.g., using 'execfile()', SparkContext available as sc, HiveContext available as sqlContext. To run a python script without any contexts just use ./bin/spark-submit 'your_python_file'.

Resources