With sbt accembly pkg, but still got Err "java.sql.SQLException: No suitable driver found for jdbc:mysql" [duplicate] - apache-spark

I am using
df.write.mode("append").jdbc("jdbc:mysql://ip:port/database", "table_name", properties)
to insert into a table in MySQL.
Also, I have added Class.forName("com.mysql.jdbc.Driver") in my code.
When I submit my Spark application:
spark-submit --class MY_MAIN_CLASS
--master yarn-client
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
This yarn-client mode works for me.
But when I use yarn-cluster mode:
spark-submit --class MY_MAIN_CLASS
--master yarn-cluster
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
It doens't work. I also tried setting "--conf":
spark-submit --class MY_MAIN_CLASS
--master yarn-cluster
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
--conf spark.executor.extraClassPath=/path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
but still get the "No suitable driver found for jdbc" error.

I had to add the driver option when using the sparkSession's read function.
.option("driver", "org.postgresql.Driver")
var jdbcDF - sparkSession.read
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://<host>:<port>/<DBName>")
.option("dbtable", "<tableName>")
.option("user", "<user>")
.option("password", "<password>")
.load()
Depending on how your dependencies are setup, you'll notice that when you include something like compile group: 'org.postgresql', name: 'postgresql', version: '42.2.8' in Gradle, for example, this will include the Driver class at org/postgresql/Driver.class, and that's the one you want to instruct spark to load.

There is 3 possible solutions,
You might want to assembly you application with your build manager (Maven,SBT) thus you'll not need to add the dependecies in your spark-submit cli.
You can use the following option in your spark-submit cli :
--jars $(echo ./lib/*.jar | tr ' ' ',')
Explanation : Supposing that you have all your jars in a lib directory in your project root, this will read all the libraries and add them to the application submit.
You can also try to configure these 2 variables : spark.driver.extraClassPath and spark.executor.extraClassPath in SPARK_HOME/conf/spark-default.conf file and specify the value of these variables as the path of the jar file. Ensure that the same path exists on worker nodes.

I tried the suggestions shown here which didn't work for me (with mysql). While debugging through the DriverManager code, I realized that I needed to register my driver since this was not happening automatically with "spark-submit". I therefore added
Driver driver = new Driver();
The constructor registers the driver with the DriverManager, which solved the SQLException problem for me.

Related

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

Spark --jars option added jar are not working

I am trying to add redshift jar using spark-submit option:
Running command on Spark 2.1.0
spark-submit --class Test --master spark://xyz.local:7077 --executor-cores 4 --total-executor-cores 32 --executor-memory 6G --driver-memory 4G --driver-cores 2 --deploy-mode cluster -jars s3a://d11-batch-jobs-on-spark/jars/redshift-jdbc42-1.2.10.1009.jar,s3a://mybucket/jars/spark-redshift_2.11-3.0.0-preview1.jar s3a://mybucket/jars/app.jar
and in code I am reading from redshift table but getting
ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
What am I doing wrong?
I'm having issues using the --jars as well...
My advise is, for packages in the Maven repository, to use --packages instead of --jars, as it resolves other dependencies withing those packages.
USAGE
spark-submit --packages <groupId>:<artifactId>:<version>
In your case, except all other options and args, it'd look like this:
spark-submit --packages com.amazon.redshift:redshift-jdbc42:1.2.10.1009
You can find IDs and version from an XML-style provided by Maven after following the link to your desired version.
The accepted answer to this question provides more info on --jars and -packages

spark-submit works for yarn-cluster mode but SparkLauncher doesn't, with same params

I'm able to submit a spark job through spark-submit however when I try to do the same programatically using SparkLauncher, it gives me nothing ( I dont even see a Spark job on the UI)
Below is the scenario:
I've a server(say hostname: cr-hdbc101.dev.local:7123) which hosts the hdfs cluster. I push a fat jar to the server which I'm trying to exec.
The following spark-submit works as expected and a spark job is submitted in yarn-cluster mode
spark-submit \
--verbose \
--class com.digital.StartSparkJob \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--driver-memory 2g \
--executor-memory 3g \
--executor-cores 4 \
/usr/share/Deployments/Consolidateservice.jar "<arg_to_main>"
However the following piece of SparkLauncher code doesn't work
val sparkLauncher = new SparkLauncher()
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")
.setAppResource("/usr/share/Deployments/Consolidateservice.jar")
.setMaster("yarn-cluster")
.setVerbose(true)
.setMainClass("com.digital.StartSparkJob")
.setDeployMode("cluster")
.setConf("spark.driver.cores", "2")
.setConf("spark.driver.memory", "2g")
.setConf("spark.executor.cores", "4")
.setConf("spark.executor.memory", "3g")
.addAppArgs(<arg_to_main>)
.startApplication()
I thought maybe SparkLauncher is not getting correct env variables to work with, so I send the following to SparkLauncher, but to no avail(basically I pass everything in the spark-env.sh to SparkLauncher)
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("HADOOP_HOME", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("SPARK_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("SCALA_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("LD_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop/lib/native")
env.put("SPARK_DIST_CLASSPATH", "/etc/spark/conf.cloudera.spark_on_yarn/classpath.txt")
val sparkLauncher = new SparkLauncher(env)
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")...
What adds to the frustration, is that when I use same SparkLauncher code for yarn-client mode, it works perfectly fine.
Can someone please point to me what am I missing, I just feel I'm staring at the issue without recognizing it.
NOTE: Both the main class(com.digital.StartSparkJob) and SparkLauncher code are part of the fat jar I'm pushing to the server. I just call the SparkLauncher code with an external API, which in turn should open a driver JVM on the cluster
SparkVersion: 1.6.0, scala ver: 2.10.5
I wasn't even getting logs on the Spark-UI...the sparkApp wasn't even running. Therefore I ran the sparkLauncher as a process(using .launch().waitFor() ) so that I can capture the error Logs.
I captured the logs using .getInputStream and .getErrorStream and found out that the user being passed to the cluster is wrong. My cluster will work only for user "abcd".
I did set System.setProperty("HADOOP_USER_NAME", "abcd"), as well as added "spark.yarn.appMasterEnv.HADOOP_USER_NAME=abcd" to spark-default.conf, before launching SparkLauncher. However looks like they don't get ported over to cluster.
I therefore passed the HADOOP_USER_NAME as an childArg to the SparkLauncher
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("HADOOP_USER_NAME", "abcd")
try {
val sparkLauncher = new SparkLauncher(env)...

spark.executor.extraJavaOptions ignored in spark-submit

I am a newbie trying to profile a local spark job.
Here is the command that I am trying to execute, but I am getting a warning stating my executor options are being ignored since they are non-spark config properties.
error:
Warning: Ignoring non-spark config property: “spark.executor.extraJavaOptions=javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application”
Command:
./bin/spark-submit --master local[2] --class org.apache.spark.examples.GroupByTest --conf “spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application” --name HdfsWordCount --jars /Users/shprin/statD/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar libexec/examples/jars/spark-examples_2.11-2.3.0.jar
Spark version : 2.0.3
Please let me know, how to solve this.
Thanks in Advance.
I think the problem is the double quote you are using to specify the spark.executor.extraJavaOptions. It should have been a single quote.
./bin/spark-submit --master local[2] --conf 'spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application' --class org.apache.spark.examples.GroupByTest --name HdfsWordCount --jars /Users/shprin/statD/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar libexec/examples/jars/spark-examples_2.11-2.3.0.jar
Apart from answers above, if your parameter contains both spaces and single quotes (for instance a query paramter) you should enclose it with in escaped double quote \"
Example:
spark-submit --master yarn --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DfileFormat=PARQUET -Dquery=\"select * from bucket where code in ('A')\" -Dchunk=yes" spark-app.jar

Spark on YARN + Secured hbase

I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).
However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
I am following hortonworks recommendations for providing information to yarn cluster regarding the HBase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html
Any pointers what could be going on ?
the mechanism for logging into HBase:
UserGroupInformation.setConfiguration(hbaseConf)
val keyTab = "keytab-location")
val principal = "kerberos-principal"
val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
UserGroupInformation.setLoginUser(ugi)
ugi.doAs(new PrivilegedExceptionAction[Void]() {
override def run: Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}
})
Also, I tried the alternative mechanism for logging in, as:
UserGroupInformation.loginUserFromKeytab(principal, keyTab)
connection=ConnectionFactory.createConnection(hbaseConf)
please suggest.
You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279
A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.
The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client to DEBUG.
The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
For instance, SPARK_CLASSPATH is now deprecated, and its content actually injected in Spark properties spark.driver / spark.executor.extraClassPath.But SPARK_DIST_CLASSPATH is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath is evaluated).
Other variables of interest are
HADOOP_CONF_DIR
SPARK_CONF_DIR
SPARK_EXTRA_LIB_PATH
SPARK_SUBMIT_OPTS
SPARK_PRINT_LAUNCH_COMMAND
The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled is set silently to false -- which makes absolutely no sense, that default is hard-coded to true in Spark source code...!
So you are advised to force it explicitly to true in your job config.
The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar).
The fifth problem is that, AFAIK, if you don't have metrics-core*.jar in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.
¤¤¤¤¤ How to make that stuff work, with debug traces
# we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
# and also *almost* HBase-ready (as in a CDH distro);
# especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
# but spark.*.extraClassPath / .extraJavaOptions are expected to be unset
KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
export HADOOP_JAAS_DEBUG=true
export SPARK_PRINT_LAUNCH_COMMAND=True
spark-submit --master yarn-client \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX#Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX#Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
PS: when using a HBaseContext you don't need /etc/hbase/conf/ in the executor's CLASSPATH, the conf is propagated automatically.
PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN in log4j.properties because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)
PPS: instead of that verbose SPARK_SUBMIT_OPTS var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties and the rest in $SPARK_CONF_DIR/java-opts; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf and env variables in $SPARK_CONF_DIR/spark-env.sh
¤¤¤¤¤ About the "Spark connector" to HBase
Excerpt from the official HBase documentation, chapter 83 Basic Spark
At the root of all Spark and HBase integration is the HBaseContext.
The HBaseContext takes in HBase configurations and pushes them to
the Spark executors. This allows us to have an HBase Connection per
Spark Executor in a static location.
What is not mentioned in the doc is that the HBaseContext uses automatically the HBase "auth token" (when present) to authenticate the executors.
Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition operation on a RDD, using a BufferedMutator for async bulk load into HBase.

Resources