Setting PYSPARK_SUBMIT_ARGS causes creating SparkContext to fail - apache-spark

a little backstory to my problem: I've been working on a spark project and recently switched my OS to Debian 9. After the switch, I reinstalled spark version 2.2.0 and started getting the following errors when running pytest:
E Exception: Java gateway process exited before sending the driver its port number
After googling for a little while, it looks like people have been seeing this cryptic error in two situations: 1) when trying to use spark with java 9; 2) when the environment variable PYSPARK_SUBMIT_ARGS is set.
It looks like I'm in the second scenario, because I'm using java 1.8. I have written a minimal example
from pyspark import SparkContext
import os
def test_whatever():
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'
sc = SparkContext.getOrCreate()
It fails with said error, but when the fourth line is commented out, the test is fine (I invoke it with pytest file_name.py).
Removing this env variable is -- at least I don't think it is -- a solution to this problem, because it gives some important information SparkContext. I can't find any documentation in this regard and am lost completely.
I would appreciate any hints on this

Putting this at the top of my jupyter notebook works for me:
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64/'

Related

Pyspark -No module named coverage_daemon

I am trying to execute this simple code in my dataframe:
import ast rddAlertsRdd = df.rdd.map(lambda message: ast.literal_eval(message['value'])) rddAlerts= rddAlertsRdd.collect()
But I´m getting the error below:
Versions:
Spark: 3.3.1
Hadoop: 2.7
Python: 3.7
Pyspark: 3.3.1
Py4j: 0.10.9.5
OpenJDK: 8
Can it be a problem related to compatibility versions? Appreciate your help!
In order to solve the problem I tried to change Spark environment variables in my Dockerfile.
This is what I have in my Dockerfile:
tl;dr No idea what could be wrong but giving you a little more about the possible cause while reading the source code. Hope this helps.
The only place with coverage_daemon is python/test_coverage/conf/spark-defaults.conf which (as you may've guessed already) is for test coverage and does not seem to be used in production.
It appears that for some reason python/run-tests-with-coverage got executed.
It looks as if you're using Jupyter environment that seems misconfigured.

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6

why my first spark/yarn app doesn't start (spark-submit error)

I am newbie in distributed system, big data. I recently started with Hadoop/yarn and spark(spark on yarn platform) for my graduation project and for now, I am blocked in.
I want to start my first spark application but I don't know the issue. when I use spark-submit to start the python script
#!/usr/bin/env python
from pyspark import SparkContext
sc=SparkContext("local[*]",appName="app")
data = sc.textFile("test.txt")
print(data.collect())
from numpy import array
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')]))
print(parsedData.collect())
this error shows up( unable to load Hadoop library...)
if someone can help me, please.
Here's a capture of the error:

Killing a SparkContext so I can create a new one

Been trying to run a Jupyter Notebook setup for pyspark v2.1.1, but every time I try instantiating a context (freshly restarted kernel and derby.log file and metastore_db dir were deleted), I get the following error telling me a context is already running.
ValueError: Cannot run multiple SparkContexts at once;
existing SparkContext(app=PySparkShell, master=local[16]) created by
<module> at /home/ubuntu/anaconda2/lib/python2.7/site-packages/IPython/utils/py3compat.py:289
I've tried restarting the kernel and deleting the derby.log and also attempted to load that context by the app name and master it gives in the error, and then stop the context to no avail:
sc = SparkContext(app='PySparkShell', master='local[16]')
sc.stop()
Has anyone had this problem and know how to just get a context running in a Jupyter Notebook when this happens?
So instead of figuring out how to kill the Spark Context already running, apparently you can "get" (or "create") an already created context by calling
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
at the beginning of your jupyter notebook.

Combining PyCharm, Spark and Jupyter

In the current setup I use a Jupyter notebook server that has a pyspark profile to use Spark. This all works great. I'm however working on a pretty big project and the notebook environment is lacking a bit for me. I found out that PyCharm allows you to run notebooks inside the IDE, giving you more of the advantages of a full IDE as opposed to Jupyter.
In the best case scenario I would run PyCharm locally as opposed to remote desktop on the gateway but using the gateway would be an acceptable alternative.
I'm trying first to get it to work on the gateway. If I have my (spark) Jupyter server running, the IP address set correctly 127.0.0.1:8888 and I create an .ipynb file, after I enter a line and press enter (not running it, just add a newline) I get the following error in the terminal I started pycharm from:
ERROR - pplication.impl.LaterInvocator - Not a stub type: Py:IPNB_TARGET in class org.jetbrains.plugins.ipnb.psi.IpnbPyTargetExpression
Googling doesn't get me anywhere.
I was able to get all three working by installing spark via terminal on OS X. Then I added the following packages to PyCharm project interpreter: findspark, pyspark.
Tested it out with
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
outputting: 3.14160028

Resources