Spark on EMR w/multiple JDBC jars - apache-spark

My setup: small Spark project built w/SBT (+ sbt-assembly for making "fat" jars) that needs to talk to multiple DB backends using JDBC (PostgreSQL + SQL Server in this case, but I think my problem generalizes). I can build + run my project in local driver mode with no problems, using either the fully-shaded JAR or a slim one w/JDBC libs added to the classpath using spark-submit. I've confirmed the classfiles are in my jar and the various drivers are correctly being concatenated into META-INF/services/java.sql.Driver, and can load any of the classes in question via the Scala repl when the fat JAR is in my classpath.
Now the problem: no combination of build options, job submission options, etc. I can puzzle out allow me access to >1 JDBC Driver once I submit the job to EMR. I've tried the plain fat JAR as well as adding the drivers via various spark-submit options (--jars, --packages, etc.). In every case my job throws the good ol' "No suitable driver" error, but only for the second driver to be loaded. One extra wrinkle: I'm submitting the job to EMR via an EC2 host rather than my local development machine (b/c cloud security, that's why) but it's an identical JAR in either case.
One other fun data point: I've verified the driver classes are available at runtime in the EMR job by forcing a Class.forName(...) on each of 'em before actually attempting to connect. Not a single ClassNotFoundException to be seen. Likewise dropping into spark-shell on the EMR master node and running the same code path to grab a DB connection (or more than one!) appears to work fine.
I've been poking at this for a few days now and am honestly starting to worry that it's an underlying classloader issue or something equally obtuse.
A few standard disclaimers: this is not an open source tool so I can't hand out much in the way of source code or raw logs, but I'm happy to look at and report back on anything that can be suitably redacted.

Since your investigation doesn't show any obvious problems, it might be just a Spark problem. In that case, explicitly declaring driver class, might help:
val postgresDF = spark.read
.format("jdbc")
.option("driver" , "org.postgresql.Driver")
...
.load()
val msSQLDF = spark.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
...
.load()

Related

Standalone Spark application in IntelliJ

I am trying to run a spark application (written in Scala) on a local server for debug. It seems that YARN is the default in the version of spark (2.2.1) that I have in the sbt build definitions, and according to an error I'm consistently getting, there is no spark/YARN server listening:
Client:920 - Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number
According to netstat indeed there is really no port 8032 on my local server, in listening state.
How would I typically accomplish running my spark application locally, in a way bypassing this problem? I only need the application to process a small amount of data for debug, and hence would like to be able to run locally, without reliance on specific SPARK/YARN installations and setups on the local server ― that would be an ideal debug setup.
Is that possible?
My sbt definitions already bring in all the necessary spark and spark.yarn jars. The problem also reproduces when running the same project in sbt, outside of IntelliJ.
You can add this property to VM options in your debug configurations instead of hardcoding inside the code
-Dspark.master=local[2]
You could submit spark application in local mode with .master("local[*]") if you have to test pipeline with miniscule data.
Full code:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")
.getOrCreate()
For spark-submit use --master local[*] as one of the arguments. Refer this: https://spark.apache.org/docs/latest/submitting-applications.html
Note: Do not hard code master in your codebase, always try to supply these variables from commandline. This makes application reusable for local/test/mesos/kubernetes/yarn/whatever.

How to install a postgresql JDBC driver in pyspark

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.
I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?
spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).
You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.
If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).
You can read more about the configuration keys I introduced on the official documentation.

What setup is needed to use the Spark Cassandra Connector with Spark Job Server

I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

Resources