Spark SQL RDD loads in pyspark but not in spark-submit: "JDBCRDD: closed connection" - apache-spark

I have the following simple code for loading a table from my Postgres database into an RDD.
# this setup is just for spark-submit, will be ignored in pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("GA")#.setMaster("localhost")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# func for loading table
def get_db_rdd(table):
url = "jdbc:postgresql://localhost:5432/harvest?user=postgres"
print(url)
lower = 0
upper = 1000
ret = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("partitionColumn", "id") \
.option("numPartitions", 1024) \
.option("lowerBound", lower) \
.option("upperBound", upper) \
.option("password", "password") \
.load()
ret = ret.rdd
return ret
# load table, and print results
print(get_db_rdd("mytable").collect())
I run ./bin/pyspark then paste that into the interpreter, and it prints out the data from my table as expected.
Now, if I save that code to a file named test.py then do ./bin/spark-submit test.py, it starts to run, but then I see these messages spam my console forever:
17/02/16 02:24:21 INFO Executor: Running task 45.0 in stage 0.0 (TID 45)
17/02/16 02:24:21 INFO JDBCRDD: closed connection
17/02/16 02:24:21 INFO Executor: Finished task 45.0 in stage 0.0 (TID 45). 1673 bytes result sent to driver
Edit: This is on a single machine. I haven't started any masters or slaves; spark-submit is the only command I run after system start. I tried with the master/slave setup with the same results.
My spark-env.sh file looks like this:
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=800m
export SPARK_EXECUTOR_MEMORY=800m
export SPARK_EXECUTOR_CORES=2
export SPARK_CLASSPATH=/home/ubuntu/spark/pg_driver.jar # Postgres driver I need for SQLContext
export PYTHONHASHSEED=1337 # have to make workers use same seed in Python3
It works if I spark-submit a Python file that just creates an RDD from a list or something. I only have problems when I try to use a JDBC RDD. What piece am I missing?

When using spark-submit you should supply the jar to the executors.
As mentioned in spark 2.1 JDBC documents:
To get started you will need to include the JDBC driver for you
particular database on the spark classpath. For example, to connect to
postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Note: The same should be for spark-submit command
Troubleshooting
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.

This is a horrible hack. I'm not considering this the answer, but it does work.
Alright, only pyspark works? Fine, then we'll use it. Wrote this Bash script:
cat $1 | $SPARK_HOME/bin/pyspark # pipe the Python file into pyspark
I run that script in my Python script that's submitting jobs. Also, I'm including the code I use to pass arguments between the processes, in case it helps someone:
new_env = os.environ.copy()
new_env["pyspark_argument_1"] = "some param I need in my Spark script" # etc...
p = subprocess.Popen(["pyspark_wrapper.sh {}".format(py_fname)], shell=True, env=new_env)
In my Spark script:
something_passed_from_submitter = os.environ["pyspark_argument_1"]
# do stuff in Spark...
I feel like Spark is better supported and (if this is a bug) less buggy with Scala than with Python 3, so that might be the better solution for now. But my script uses some files we wrote in Python 3, so...

Related

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

How to run python spark script with specific jars

I have to run a python script on EMR instance using pyspark to query dynamoDB. I am able to do that by querying dynamodb on pyspark which is executed by including jars with following command.
`pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`
I ran following python3 script to query data using pyspark python module.
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
start_time = time.time()
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
df_load = sparkSession.sql("SELECT * FROM example")
df_load.show()
print(time.time() - start_time)
Which caused following runtime exception for missing jars.
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.dynamodb.DynamoDBSerDe not found
How do I convert the pyspark --jars.. to a pythonic equivalent.
As of now I tried copying the jars from the location /usr/share/... to $SPARK_HOME/libs/jars and adding that path to spark-defaults.conf external class path that had no effect.
Use spark-submit command to execute your python script. Example :
spark-submit --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar script.py

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).
Run 1 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))
external package: external_package.py
class external(object):
def __init__(self):
pass
def fun(self,input):
return input*2
readme.txt
MY TEXT HERE
spark-submit command
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
/local-pgm-path/spark_distro.py \
1000
Output: Working as expected
['MY TEXT HERE']
But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing.
Like below.
Run 2 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))
external_package.py Same as above
spark submit
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
--files /local-path/readme.txt#readme.txt \
/local-pgm-path/spark_distro.py \
1000
Output:
Traceback (most recent call last):
File "/local-pgm-path/spark_distro.py", line 31, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'
Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.
I have finally figured out the issue, and it is a very subtle one indeed.
As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):
addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.
--files FILES
Comma-separated list of files to be placed in the working
directory of each executor.
In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.
Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):
test_fail.py:
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_fail.py
Result:
[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'
In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_success.py
Result:
[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']
Notice also that here we don't need SparkFiles.get - the file is readily accessible.
As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).
Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).
Tested with both Spark 1.6.0 & 2.2.0.
UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:
$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020
But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:
Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.
(Notice that I am not talking for --py-files, which serves a different, legitimate role.)
Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

Connect to Oracle DB using PySpark

I am trying to connect to an Oracle DB using PySpark.
spark_config = SparkConf().setMaster(config['cluster']).setAppName('sim_transactions_test').set("jars", "..\Lib\ojdbc7.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df_sim_input = self.sqlContext.read\
.format("jdbc")\
.option("driver", "oracle.jdbc.driver.OracleDriver")\
.option("url", config["db.url"])\
.option("dbtable", query)\
.option("user", config["db.user"])\
.option("password", config["db.password"])\
.load()
This gives me a
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
So it seems it cannot find the jar file in the SparkContext. It seems to be possible to load a PySpark shell with external jars, but I want to load them from the Python code.
Can someone explain to me how you can add this external jar from Python and make a query to an Oracle DB?
Extra question, how come that for a postgres DB the code works fine without importing an external jdbc? Is that because if it is installed on your system, it will automatically find it?
You should probably also set driver-class-path as jars sends the jar file only to workers, not the driver.
That said, you should be very careful when setting JVM configuration in the python code as you need to make sure the JVM loads with them (you can't add them later). You can try setting PYSPARK_SUBMIT_ARGS e.g.:
export PYSPARK_SUBMIT_ARGS="--jars jarname --driver-class-path jarname pyspark-shell"
This will tell pyspark to add these options to the JVM loading the same as if you would have added it in the command line

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources