my pyspark did not start on terminal,but with jupyter notebook - apache-spark

Not long ago,
when I input pyspark in my terminal.
the terminal will finally become...um...like this:
some information
>>>
but now it start with jupyter notebook automatically.
This phenomenon happened with spark-3.0.0-preview2-bin-hadoop3.2
I have used many version of spark.
Is above phenomenon due to my error in configuration or due to spark edition update?
Thanks for your help.

Related

PySpark session builder doesn't start

I have a problem regarding PySpark in Jupyter notebook. I installed Java, and Spark added the path variables and didn't get an error. However, when I write builder it keeps running and doesn't start. I waited more than 30 minutes to start but it just kept running. Code like below:
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Practise').getOrCreate()

Notebook to write Java jobs for Spark

I am writing my first Spark job using Java API.
I want to run it using a notebook.
I am looking into Zeppelin and Jupyter.
At Zeppelin documentation I see support for Scala, IPySpark and SparkR. It is not clear to me whether using two interpreters %spark.sql %java will allow me to work with Java API of Spark SQL
Jupyter has "IJava" kernel but I see no support for Spark with Java.
Are there other options?
#Victoriia Zeppelin 0.9.0 has %java interpreter with example here
zeppelin.apache.org
I try to start with it in GoogleCloud, but had some problems...
use magic command %jars path/to/spark.jar in the IJava cell, according to the IJava's author
then take a look on import org.apache.spark.sql.* for example

Spark exception 5063 in TensorFlow extended example code on GPU

I am trying to run an example code of TensorFlow extended at https://www.tensorflow.org/tfx/tutorials/transform/census
on databricks GPU cluster.
My env:
7.1 ML Spark 3.0.0 Scala 2.12 GPU
python 3.7
tensorflow: Version: 2.1.1
tensorflow-transform==0.22.0
apache_beam==2.21.0
When I run
transform_data(train, test, temp)
I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
It seems that this is a known issue of RDD on Spark.
https://issues.apache.org/jira/browse/SPARK-5063
I tried to search some solutions here, but none of them work for me.
how to deal with error SPARK-5063 in spark
At the example code, I do not see where SparkContext is accessed from worker explicitly.
It is called from Apache Beam ?
Thanks

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

Why does SparkContext randomly close, and how do you restart it from Zeppelin?

I am working in Zeppelin writing spark-sql queries and sometimes I suddenly start getting this error (after not changing code):
Cannot call methods on a stopped SparkContext.
Then the output says further down:
The currently active SparkContext was created at:
(No active SparkContext.)
This obviously doesn't make sense. Is this a bug in Zeppelin? Or am I doing something wrong? How can I restart the SparkContext?
Thank you
I have faced this problem a couple of times.
If you are setting your master as yarn-client, it might be due to the stop / restart of Resource Manager, the interpreter process may still be running but the Spark Context (which is a Yarn application) does not exists any more.
You could check if Spark Context is still running by consulting your Resource manager web Interface and check if there is an application named Zeppelin running.
Sometimes restarting the interpreter process from within Zeppelin (interpreter tab --> spark --> restart) will solve the problem.
Other times you need to:
kill the Spark interpreter process from the command line
remove the Spark Interpreter PID file
and the next time you start a paragraph it will start new spark context
I'm facing the same problem running multiple jobs in PySpark. Seems that in Spark 2.0.0, with SparkSession, when I call spark.stop() SparkSession calls the following trace:
# SparkSession
self._sc.stop()
# SparkContext.stop()
self._jsc = None
Then, when I try to create a new job with new a SparkContext, SparkSession return the same SparkContext than before with self.jsc = None.
I solved setting SparkSession._instantiatedContext = None after spark.stop() forcing SparkSession to create a new SparkContext next time that I demand.
It's not the best option, but meanwhile it's solving my issue.
I've noticed this issue more when running pyspark commands even with trivial variable declarations that a cell execution hangs in running state.
As mentioned above by user1314742, just killing the relevant PID solves this issue for me.
e.g.:
ps -ef | grep zeppelin
This is where restarting the Spark interpreter and restarting zeppelin notebook does not solve the issue. I guess because it cannot control the hung PID itself.
Could you check your driver memory is enough or not ? I solved this issue by
enlarge driver memory
tune GC:
--conf spark.cleaner.periodicGC.interval=60
--conf spark.cleaner.referenceTracking.blocking=false

Resources