Running a PySpark code in python vs spark-submit - apache-spark

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?
The SO answer here was almost similar but did not explain it in great details. Would love to know, why?
Any help is appreciated. Thanks in advance.

I am assuming when you say python interpreter you are referring to pyspark shell.
You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).
When to use PySpark Interpreter.
Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.
Spark Submit.
This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.
A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.
Hope this helps.
Regards,
Neeraj

Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster

Related

pyspark using Pytest is not showing spark UI

I have written a pytest case using pyspark (Spark 3.0) for reading a file and getting a count of data frames, but I am unable to see spark UI and I am getting an OOM Error. What is the solution and how to debug without seeing spark UI
Thanks,
Xi
Wrap all tests inside an shell script , and call that shell script in a python code
then call using spark-submit python.py

Notebook to write Java jobs for Spark

I am writing my first Spark job using Java API.
I want to run it using a notebook.
I am looking into Zeppelin and Jupyter.
At Zeppelin documentation I see support for Scala, IPySpark and SparkR. It is not clear to me whether using two interpreters %spark.sql %java will allow me to work with Java API of Spark SQL
Jupyter has "IJava" kernel but I see no support for Spark with Java.
Are there other options?
#Victoriia Zeppelin 0.9.0 has %java interpreter with example here
zeppelin.apache.org
I try to start with it in GoogleCloud, but had some problems...
use magic command %jars path/to/spark.jar in the IJava cell, according to the IJava's author
then take a look on import org.apache.spark.sql.* for example

Is that possible to run "spark-submit" in databricks without creating jobs ? if yes ! What is the possiblities,

I am trying to execute spark-submit in databricks workspace notebook without creating jobs, Help me!
No, that is not possible like one would do with /bin/spark-submit as it does not fit in with their notebook approach to making things easier for less techy persons.
The closest you can get is as stated here: https://docs.databricks.com/dev-tools/api/latest/examples.html#create-a-spark-submit-job

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

SparkUI for pyspark - corresponding line of code for each stage?

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code.
Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?
Thanks!
When you run a toPandas call, the line in the python code is shown in the SQL tab. Other collect commands, such as count or parquet do not show the line number. I'm not sure why this is, but I find it can be very handy.
Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?
Yes. The Spark UI provides the Scala methods called from the PySpark actions in your Python code. Armed with the PySpark codebase, you can readily identify the calling PySpark method. In your example, cache is self-explanatory and a quick search for javaToPython reveals that it is called by the PySpark DataFrame.rdd method.

Resources