Our project uses gradle and scala to build spark app, but I've added gcp kms library and now when this runs on dataproc it errors with missing guava method:
I followed the recommendations on the following guide to shade the google libraries:
My shadowJar definition in gradle build:
shadowJar {
zip64 true
relocate '', ''
relocate '', ''
relocate '', ''
exclude 'META-INF/**'
exclude "LICENSE*"
archiveFileName = "myjar"
When running java tf on compiled fat jar this shows the guava classes including checkArgument being relocated under shadow.
But it still errors when runnin dataproc spark submit and it still seems to choose the old versions from hadoop at runtime. Below is the stack trace, starting from my KmsSymettric class which is using gcp kms decrypt:
Exception in thread "main" java.lang.NoSuchMethodError:;CLjava/lang/Object;)V
at io.grpc.Metadata$Key.validateName(
at io.grpc.Metadata$Key.<init>(
at io.grpc.Metadata$Key.<init>(
at io.grpc.Metadata$AsciiKey.<init>(
at io.grpc.Metadata$AsciiKey.<init>(
at io.grpc.Metadata$Key.of(
at io.grpc.Metadata$Key.of(
My dataproc submit is:
gcloud dataproc jobs submit spark \
--cluster=${CLUSTER_NAME} \
--project ${PROJECT_ID} \
--region=${REGION} \
--jars=gs://${APP_BUCKET}/${JAR} \
--class=${CLASS} \
--app args --arg1 val1 etc
I'm using dataproc image version 1.4
What am I missing?


org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?
This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', '') \
This way spark will download JAR files through maven.

spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found

How can I configure Spark 3.x on HDP 3.1 using headless ( version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf'-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
| default|
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf'-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).
As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

DataProc Avro Version Causing Error on Image v1.0.0

We are running a few dataproc jobs with dataproc image 1.0 and spark-redshift.
We have two clusters, here are some details:
Cluster A -> Runs PySpark Streaming job, last created 2016. Jul 15. 11:27:12 AEST
Cluster B -> Runs PySpark Batch jobs, the cluster is created everytime the job is run and teardown afterwards.
A & B runs the same code base, use the same init script, same node types etc.
Since sometime last Friday (2016-08-05 AEST), our code stopped working on cluster B with the following error, while cluster A is running without issues.
The following code can reproduce the issue on Cluster B (or any new cluster with image v1.0.0) while it runs fine on cluster A.
Sample PySpark Code:
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
rdd = sc.parallelize([{'user_id': 'test'}])
df = rdd.toDF()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "FOO")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "BAR")
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://") \
.option("dbtable", 'foo') \
.option("tempdir", "s3n://bar") \
.option("extracopyoptions", "TRUNCATECOLUMNS") \
.mode("append") \
The above code fails in both of the following situations on Cluster B, while running fine on A. note that the RedshiftJDBC41- is created via cluster init script.
Running in interactive mode on master node:
PYSPARK_DRIVER_PYTHON=ipython pyspark \
--verbose \
--master "local[*]"\
--jars /usr/lib/hadoop/lib/RedshiftJDBC41- \
--packages com.databricks:spark-redshift_2.10:1.0.0
Submit the job via gcloud dataproc
gcloud --project foo \
dataproc jobs submit pyspark \
--cluster bar \
--properties ^#^spark.jars.packages=com.databricks:spark-redshift_2.10:1.0.0#spark.jars=/usr/lib/hadoop/lib/RedshiftJDBC41- \
The error it produces (Trace):
2016-08-08 06:12:23 WARN TaskSetManager:70 - Lost task 6.0 in stage 45.0 (TID 121275,
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.avro.mapreduce.AvroKeyRecordWriter.<init>(
at org.apache.avro.mapreduce.AvroKeyOutputFormat$RecordWriterFactory.create(
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(
at com.databricks.spark.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:82)
at com.databricks.spark.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:31)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
2016-08-08 06:12:24 ERROR YarnScheduler:74 - Lost executor 63 on kinesis-ma-sw-o7he.c.bupa-ma.internal: Container marked as failed: container_1470632577663_0003_01_000065 on host: kinesis-ma-sw-o7he.c.bupa-ma.internal. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1470632577663_0003_01_000065
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
SparkRedshift:1.0.0 requires com.databricks.spark-avro:2.0.1, which requires org.apache.avro:1.7.6.
Upon checking the version of org.apache.avro.generic.GenericData on Cluster A:
root#foo-bar-m:/home/foo# spark-shell \
> --verbose \
> --master "local[*]" \
> --deploy-mode client \
> --packages com.databricks:spark-redshift_2.10:1.0.0 \
> --jars "/usr/lib/hadoop/lib/RedshiftJDBC41-"
It produces (Trace):
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#496a514f
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.5-hadoop2.jar <no signer certificates>)
While running the same command on Cluster B:
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#72bec302
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.7-hadoop2.jar <no signer certificates>)
Screenshot of Env on Cluster B. (Apologies for all the redactions).
We've tried method described on here and here without any success.
This is really frustrating as the DataProc updates the image content without bumping the release version as the complete opposite of immutable releases. Now our code is broke and there is no way we could roll back to the previous version.
Sorry for the trouble! It's certainly not intended for breaking changes to occur within an image version. Note that subminor versions are rolled out "under the hood" for non-breaking bug fixes and Dataproc-specific patches.
You can revert to using the 1.0.* version from before last week by simply specifying --image-version 1.0.8 when deploying clusters from the command-line:
gcloud dataproc clusters create --image-version 1.0.8
Edit: For additional clarification, we've investigated the Avro versions in question and verified that Avro version numbers actually did not change in any recent subminor Dataproc release. The core issue is that Hadoop itself has had a latent bug where Hadoop itself brings avro-1.7.4 under /usr/lib/hadoop/lib/ and Spark uses avro-1.7.7. Coincidentally Google's bigquery connectory also uses avro-1.7.7 but this turns out to be orthogonal to the known Spark/Hadoop problem with 1.7.4 vs 1.7.7. The recent image update was deemed nonbreaking because versions in fact did not change, but classloading ordering changed in a nondeterministic way where Hadoop's bad avro version used to be hidden from the Spark job by pure luck, and is no longer accidentally hidden in the latest image.
Dataproc's preview image currently includes a fix to the avro version in the Hadoop layer which should make it into any future Dataproc 1.1 version when it comes out; you might want to consider trying the preview version to see if Spark 2.0 is a seamless transition.

Spark 1.6 kafka streaming on dataproc py4j error

I get the following error:
Py4JError(u'An error occurred while calling o73.createDirectStreamWithoutMessageHandler. Trace:\npy4j.Py4JException: Method createDirectStreamWithoutMessageHandler([class, class java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not exist\n\tat py4j.reflection.ReflectionEngine.getMethod(\n\tat py4j.reflection.ReflectionEngine.getMethod(\n\tat py4j.Gateway.invoke(\n\tat py4j.commands.AbstractCommand.invokeMethod(\n\tat py4j.commands.CallCommand.execute(\n\tat\n\tat\n\n',)
I am using spark-streaming-kafka-assembly_2.10-1.6.0.jar (which is present in the /usr/lib/hadoop/lib/ folder on all my nodes + master)
The actual error was: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.util.Apps.crossPlatformify(Ljava/lang/String;)Ljava/lang/String;
This was due to a wrong hadoop version. Therefore spark should be compiled with the correct hadoop version:
mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 -DskipTests clean package
This will result in a jar in the external/kafka-assembly/target folder.
Using image version 1, I've successfully run the pyspark streaming / kafka example wordcount
In each of these examples "ad-kafka-inst" is my test kafka instance with a 'test' topic.
Using a cluster with no initialization actions:
$ gcloud dataproc jobs submit pyspark --cluster ad-kafka2 --properties spark.jars.packages=org.apache.spark:spark-streaming-kafka_2.10:1.6.0 ./ ad-kafka-inst:2181 test
Using initialization actions with a full kafka assembly:
Download / unpack spark-1.6.0.tgz
Build with:
$ mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 package
Upload spark-streaming-kafka-assembly_2.10-1.6.0.jar to a new GCS bucket (MYBUCKET for example).
Create the following initialization action in the same GCS bucket (e.g., gs://MYBUCKET/
$ #!/bin/bash
gsutil cp gs://MY_BUCKET/spark-streaming-kafka-assembly_2.10-1.6.0.jar /usr/lib/hadoop/lib/
chmod 755 /usr/lib/hadoop/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar
Start a cluster with the above initialization action:
$ gcloud dataproc clusters create ad-kafka-init --initialization-actions gs://MYBUCKET/
Start the streaming word count:
$ gcloud dataproc jobs submit pyspark --cluster ad-kafka-init ./ ad-kafka-inst:2181 test

spark-submit classpath issue with --repositories --packages options

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.lang.ClassLoader.loadClass(
at java.lang.ClassLoader.loadClass(
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class \
/app/spark-app.jar kafka-server:9092 mytopic
I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.
