Adding JDBC driver to AWS Glue for existing Spark code - apache-spark

I am trying to run existing Spark (Scala) code on AWS Glue.
This code uses spark.read.option("jdbc") and I have been adding the JDBC driver to the Spark classpath with the spark.driver.extraClassPath option.
This works fine locally as well as on EMR, assuming I can copy the driver from S3 to the instances first with a bootstrap action.
But what's the equivalent on Glue? If I add the driver to the "dependent JARs" option, it doesn't work and I get the "no suitable driver" error, presumably because the JAR must be visible to Spark's own classloader.

Edit your job and at the end of the screen, you can see the libraries option.
And some options are needed, see the last part of the documentation.

Related

How to use new Hadoop parquet magic commiter to custom S3 server with Spark

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
All the new committers config documentation I've read up to date, is missing one fundamental fact:
spark 2.x.x does not have needed support classes to make new S3a committers to function.
They promise those cloud integration libs will be bundled with spark 3.0.0, but for now you have to add libraries yourself.
Under the cloud integration maven repos there are multiple distributions supporting the committers, I found one working with directory committer but not the magic.
In general the directory committer is the recommended over magic as it has been well tested and tried. It requires shared filesystem (magic committer does not require one, but needs s3guard) such as HDFS or NFS (we use AWS EFS) to coordinate spark worker writes to S3.

Spark on EMR w/multiple JDBC jars

My setup: small Spark project built w/SBT (+ sbt-assembly for making "fat" jars) that needs to talk to multiple DB backends using JDBC (PostgreSQL + SQL Server in this case, but I think my problem generalizes). I can build + run my project in local driver mode with no problems, using either the fully-shaded JAR or a slim one w/JDBC libs added to the classpath using spark-submit. I've confirmed the classfiles are in my jar and the various drivers are correctly being concatenated into META-INF/services/java.sql.Driver, and can load any of the classes in question via the Scala repl when the fat JAR is in my classpath.
Now the problem: no combination of build options, job submission options, etc. I can puzzle out allow me access to >1 JDBC Driver once I submit the job to EMR. I've tried the plain fat JAR as well as adding the drivers via various spark-submit options (--jars, --packages, etc.). In every case my job throws the good ol' "No suitable driver" error, but only for the second driver to be loaded. One extra wrinkle: I'm submitting the job to EMR via an EC2 host rather than my local development machine (b/c cloud security, that's why) but it's an identical JAR in either case.
One other fun data point: I've verified the driver classes are available at runtime in the EMR job by forcing a Class.forName(...) on each of 'em before actually attempting to connect. Not a single ClassNotFoundException to be seen. Likewise dropping into spark-shell on the EMR master node and running the same code path to grab a DB connection (or more than one!) appears to work fine.
I've been poking at this for a few days now and am honestly starting to worry that it's an underlying classloader issue or something equally obtuse.
A few standard disclaimers: this is not an open source tool so I can't hand out much in the way of source code or raw logs, but I'm happy to look at and report back on anything that can be suitably redacted.
Since your investigation doesn't show any obvious problems, it might be just a Spark problem. In that case, explicitly declaring driver class, might help:
val postgresDF = spark.read
.format("jdbc")
.option("driver" , "org.postgresql.Driver")
...
.load()
val msSQLDF = spark.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
...
.load()

How to install a postgresql JDBC driver in pyspark

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.
I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?
spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).
You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.
If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).
You can read more about the configuration keys I introduced on the official documentation.

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

spark driver not found

I am trying to write dataframe to sqlserver using spark. I am using the method write for dataframewriter to write to sql server.
Using DriverManager.getConnection I am able to get connection of sqlserver and able to write but when using jdbc method and passing uri I am getting "No suitable driver found".
I have passed the jtds jar in the --jars in spark-shell.
Spark version : 1.4
The issue is that spark is not finding driver jar file. So download jar and place in all worker nodes of spark cluster on the same path and add this path to SPARK_CLASSPATH in spark-env.sh file
as follow
SPARK_CLASSPATH=/home/mysql-connector-java-5.1.6.jar
hope it will help

Resources