Notebook vs spark-submit - apache-spark

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.

There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj

I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

Related

Out of memory error while running spark submit

I am trying to load a 60gb table data onto a spark python dataframe and then write that into a hive table.
I have set driver memory, executor memory, max result size sufficiently to handle the data. But i am getting error when i run through spark submit with all the above said configs mentioned in command line.
Note: Through spark python shell (by specifying driver & executor memory while launching the shell), i am able to populate the target hive table.
Any thoughts??
Try using syntax:
./spark-submit --conf ...
For the memory-related configuration. What I suspect you're doing is - you are setting them, while initializing SparkSession - which becomes irrelevant, since kernel is already started by then. Same parameters, as you set for running shell will do.
https://spark.apache.org/docs/latest/submitting-applications.html

Running Spark on local machine with master = local[*] and invoking .collect method

I need some help in understanding this documentation on Spark website:
Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). [1st category] On a single machine, this will generate the expected output and print all the RDD’s elements. [2nd category] However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead...
I running spark locally (with local[*] inside Eclipse IDE) that connects to staging Cassandra (which is running on multiple nodes) falls in the first category or second?
Any help is appreciated.
You're not submitting your code to a cluster, therefore your code is the first category

Running a PySpark code in python vs spark-submit

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?
The SO answer here was almost similar but did not explain it in great details. Would love to know, why?
Any help is appreciated. Thanks in advance.
I am assuming when you say python interpreter you are referring to pyspark shell.
You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).
When to use PySpark Interpreter.
Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.
Spark Submit.
This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.
A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.
Hope this helps.
Regards,
Neeraj
Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster

Spark Job not getting any cores on EC2

I use flintrock 0.9.0 with spark 2.2.0 to start my cluster on EC2. the code is written in pyspark I have been doing this for a while now and run a couple of successful jobs. In the last 2 days I encountered a problem that when I start a cluster on certain instances I don't get any cores. I observed this behavior on c1.medium and now on r3.xlarge the code to get the spark and spark context objects is this
conf = SparkConf().setAppName('the_final_join')\
.setMaster(master)\
.set('spark.executor.memory','29G')\
.set('spark.driver.memory','29G')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
on c1.medium is used .set('spark.executor.cores', '2') and it seemed to work. But now I tried to run my code on a bigger cluster of r3.xlarge instances and my Job doesn't get any code no matter what I do. All workers are alive and I see that each of them should have 4 cores. Did something change in the last 2 months or am I missing something in the startup process? I launch the instances in us-east-1c I don't know if this has something to do with it.
Part of your issue may be that your are trying to allocate more memory to the Driver/Executors than you have access to.
yarn.nodemanager.resource.memory-mb controls the maximum sum of memory used by the containers on each node (cite)
You can look up this value for various instances here. r3.xlarge have access to 23,424M, but your trying to give your driver/executor 29G. Yarn is not launching Spark, ultimately, because it doesn't have access to enough memory to run your job.

Why does SparkContext randomly close, and how do you restart it from Zeppelin?

I am working in Zeppelin writing spark-sql queries and sometimes I suddenly start getting this error (after not changing code):
Cannot call methods on a stopped SparkContext.
Then the output says further down:
The currently active SparkContext was created at:
(No active SparkContext.)
This obviously doesn't make sense. Is this a bug in Zeppelin? Or am I doing something wrong? How can I restart the SparkContext?
Thank you
I have faced this problem a couple of times.
If you are setting your master as yarn-client, it might be due to the stop / restart of Resource Manager, the interpreter process may still be running but the Spark Context (which is a Yarn application) does not exists any more.
You could check if Spark Context is still running by consulting your Resource manager web Interface and check if there is an application named Zeppelin running.
Sometimes restarting the interpreter process from within Zeppelin (interpreter tab --> spark --> restart) will solve the problem.
Other times you need to:
kill the Spark interpreter process from the command line
remove the Spark Interpreter PID file
and the next time you start a paragraph it will start new spark context
I'm facing the same problem running multiple jobs in PySpark. Seems that in Spark 2.0.0, with SparkSession, when I call spark.stop() SparkSession calls the following trace:
# SparkSession
self._sc.stop()
# SparkContext.stop()
self._jsc = None
Then, when I try to create a new job with new a SparkContext, SparkSession return the same SparkContext than before with self.jsc = None.
I solved setting SparkSession._instantiatedContext = None after spark.stop() forcing SparkSession to create a new SparkContext next time that I demand.
It's not the best option, but meanwhile it's solving my issue.
I've noticed this issue more when running pyspark commands even with trivial variable declarations that a cell execution hangs in running state.
As mentioned above by user1314742, just killing the relevant PID solves this issue for me.
e.g.:
ps -ef | grep zeppelin
This is where restarting the Spark interpreter and restarting zeppelin notebook does not solve the issue. I guess because it cannot control the hung PID itself.
Could you check your driver memory is enough or not ? I solved this issue by
enlarge driver memory
tune GC:
--conf spark.cleaner.periodicGC.interval=60
--conf spark.cleaner.referenceTracking.blocking=false

Resources