Killing a SparkContext so I can create a new one - apache-spark

Been trying to run a Jupyter Notebook setup for pyspark v2.1.1, but every time I try instantiating a context (freshly restarted kernel and derby.log file and metastore_db dir were deleted), I get the following error telling me a context is already running.
ValueError: Cannot run multiple SparkContexts at once;
existing SparkContext(app=PySparkShell, master=local[16]) created by
<module> at /home/ubuntu/anaconda2/lib/python2.7/site-packages/IPython/utils/py3compat.py:289
I've tried restarting the kernel and deleting the derby.log and also attempted to load that context by the app name and master it gives in the error, and then stop the context to no avail:
sc = SparkContext(app='PySparkShell', master='local[16]')
sc.stop()
Has anyone had this problem and know how to just get a context running in a Jupyter Notebook when this happens?

So instead of figuring out how to kill the Spark Context already running, apparently you can "get" (or "create") an already created context by calling
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
at the beginning of your jupyter notebook.

Related

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6

Cannot call methods on a stopped SparkContext while creating SparkSession

I tried to run ScalaTest at my local. And I got an error pointing to SparkSession creation. I have no clue why this error message pops up. Please refer to the image for more details.
The solution I have tried is to look up if there's any spark.stop() code in my codeset, but found none. I wonder if anyone has any ideas why it stops at the pointing line (line 53 in the image.)
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Data Validation").config("spark.sql.autoBroadcastJoinThreshold", -1).getOrCreate()
Thanks!

why my first spark/yarn app doesn't start (spark-submit error)

I am newbie in distributed system, big data. I recently started with Hadoop/yarn and spark(spark on yarn platform) for my graduation project and for now, I am blocked in.
I want to start my first spark application but I don't know the issue. when I use spark-submit to start the python script
#!/usr/bin/env python
from pyspark import SparkContext
sc=SparkContext("local[*]",appName="app")
data = sc.textFile("test.txt")
print(data.collect())
from numpy import array
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')]))
print(parsedData.collect())
this error shows up( unable to load Hadoop library...)
if someone can help me, please.
Here's a capture of the error:

Setting PYSPARK_SUBMIT_ARGS causes creating SparkContext to fail

a little backstory to my problem: I've been working on a spark project and recently switched my OS to Debian 9. After the switch, I reinstalled spark version 2.2.0 and started getting the following errors when running pytest:
E Exception: Java gateway process exited before sending the driver its port number
After googling for a little while, it looks like people have been seeing this cryptic error in two situations: 1) when trying to use spark with java 9; 2) when the environment variable PYSPARK_SUBMIT_ARGS is set.
It looks like I'm in the second scenario, because I'm using java 1.8. I have written a minimal example
from pyspark import SparkContext
import os
def test_whatever():
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'
sc = SparkContext.getOrCreate()
It fails with said error, but when the fourth line is commented out, the test is fine (I invoke it with pytest file_name.py).
Removing this env variable is -- at least I don't think it is -- a solution to this problem, because it gives some important information SparkContext. I can't find any documentation in this regard and am lost completely.
I would appreciate any hints on this
Putting this at the top of my jupyter notebook works for me:
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64/'

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources