DataProc Avro Version Causing Error on Image v1.0.0 - apache-spark

We are running a few dataproc jobs with dataproc image 1.0 and spark-redshift.
We have two clusters, here are some details:
Cluster A -> Runs PySpark Streaming job, last created 2016. Jul 15. 11:27:12 AEST
Cluster B -> Runs PySpark Batch jobs, the cluster is created everytime the job is run and teardown afterwards.
A & B runs the same code base, use the same init script, same node types etc.
Since sometime last Friday (2016-08-05 AEST), our code stopped working on cluster B with the following error, while cluster A is running without issues.
The following code can reproduce the issue on Cluster B (or any new cluster with image v1.0.0) while it runs fine on cluster A.
Sample PySpark Code:
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
rdd = sc.parallelize([{'user_id': 'test'}])
df = rdd.toDF()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "FOO")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "BAR")
df\
.write\
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://foo.ap-southeast-2.redshift.amazonaws.com/bar") \
.option("dbtable", 'foo') \
.option("tempdir", "s3n://bar") \
.option("extracopyoptions", "TRUNCATECOLUMNS") \
.mode("append") \
.save()
The above code fails in both of the following situations on Cluster B, while running fine on A. note that the RedshiftJDBC41-1.1.10.1010.jar is created via cluster init script.
Running in interactive mode on master node:
PYSPARK_DRIVER_PYTHON=ipython pyspark \
--verbose \
--master "local[*]"\
--jars /usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar \
--packages com.databricks:spark-redshift_2.10:1.0.0
Submit the job via gcloud dataproc
gcloud --project foo \
dataproc jobs submit pyspark \
--cluster bar \
--properties ^#^spark.jars.packages=com.databricks:spark-redshift_2.10:1.0.0#spark.jars=/usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar \
foo.bar.py
The error it produces (Trace):
2016-08-08 06:12:23 WARN TaskSetManager:70 - Lost task 6.0 in stage 45.0 (TID 121275, foo.bar.internal):
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.avro.mapreduce.AvroKeyRecordWriter.<init>(AvroKeyRecordWriter.java:55)
at org.apache.avro.mapreduce.AvroKeyOutputFormat$RecordWriterFactory.create(AvroKeyOutputFormat.java:79)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at com.databricks.spark.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:82)
at com.databricks.spark.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:31)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-08-08 06:12:24 ERROR YarnScheduler:74 - Lost executor 63 on kinesis-ma-sw-o7he.c.bupa-ma.internal: Container marked as failed: container_1470632577663_0003_01_000065 on host: kinesis-ma-sw-o7he.c.bupa-ma.internal. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1470632577663_0003_01_000065
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
SparkRedshift:1.0.0 requires com.databricks.spark-avro:2.0.1, which requires org.apache.avro:1.7.6.
Upon checking the version of org.apache.avro.generic.GenericData on Cluster A:
root#foo-bar-m:/home/foo# spark-shell \
> --verbose \
> --master "local[*]" \
> --deploy-mode client \
> --packages com.databricks:spark-redshift_2.10:1.0.0 \
> --jars "/usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar"
It produces (Trace):
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#496a514f
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: java.security.CodeSource = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.5-hadoop2.jar <no signer certificates>)
While running the same command on Cluster B:
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#72bec302
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: java.security.CodeSource = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.7-hadoop2.jar <no signer certificates>)
Screenshot of Env on Cluster B. (Apologies for all the redactions).
We've tried method described on here and here without any success.
This is really frustrating as the DataProc updates the image content without bumping the release version as the complete opposite of immutable releases. Now our code is broke and there is no way we could roll back to the previous version.

Sorry for the trouble! It's certainly not intended for breaking changes to occur within an image version. Note that subminor versions are rolled out "under the hood" for non-breaking bug fixes and Dataproc-specific patches.
You can revert to using the 1.0.* version from before last week by simply specifying --image-version 1.0.8 when deploying clusters from the command-line:
gcloud dataproc clusters create --image-version 1.0.8
Edit: For additional clarification, we've investigated the Avro versions in question and verified that Avro version numbers actually did not change in any recent subminor Dataproc release. The core issue is that Hadoop itself has had a latent bug where Hadoop itself brings avro-1.7.4 under /usr/lib/hadoop/lib/ and Spark uses avro-1.7.7. Coincidentally Google's bigquery connectory also uses avro-1.7.7 but this turns out to be orthogonal to the known Spark/Hadoop problem with 1.7.4 vs 1.7.7. The recent image update was deemed nonbreaking because versions in fact did not change, but classloading ordering changed in a nondeterministic way where Hadoop's bad avro version used to be hidden from the Spark job by pure luck, and is no longer accidentally hidden in the latest image.
Dataproc's preview image currently includes a fix to the avro version in the Hadoop layer which should make it into any future Dataproc 1.1 version when it comes out; you might want to consider trying the preview version to see if Spark 2.0 is a seamless transition.

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
.format("console").outputMode('complete').start()
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: java.io.InvalidClassException: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?
This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12') \
.getOrCreate()
This way spark will download JAR files through maven.

Spark output when submitting via Yarn cluster vs. client

I am new to Spark and just got it running on my cluster (Spark 2.0.1 on a 9 node cluster running Community version of MapR). I submit the wordcount example via
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md
and get the following output
17/04/07 13:21:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
: 68
help: 1
when: 1
Hadoop: 3
...
Looks like everything is working properly. When I add --deploy-mode cluster I get the following output:
17/04/07 13:23:52 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
So no errors but I am not seeing the wordcount results. What am I missing? I see the job in my History Server and it says it completed successfully. Also I checked my user directory in DFS but no new files were written except for this empty directory: /user/myuser/.sparkStaging
Code (wordcount.py example shipped with Spark):
from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
The reason for your output not printing is:
When you run in spark-client mode then the node on which you are initiating the job is the DRIVER and when you collect the result it is collected on that node and you print it.
In yarn-cluster mode your driver is some other node not the one through which you initiated the job. So when you call the .collect function the result is collected on that and printed on that node. You can find the result being printed in the sys-out of the driver.
A better approach would be to write the output somewhere in HDFS.
The reason for your spark.yarn.jars warning is:
In order to run a spark job yarn needs some binaries available on all the nodes of the cluster if these binaries are not available then as a part of job preparation, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
To solve this :
By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable(chmod 777) location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set spark.yarn.jars to hdfs:///some/path.
After placing your jars run your code like :
./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md --conf spark.yarn.jars="hdfs:///some/path"
Source : http://spark.apache.org/docs/latest/running-on-yarn.html

Spark SQL RDD loads in pyspark but not in spark-submit: "JDBCRDD: closed connection"

I have the following simple code for loading a table from my Postgres database into an RDD.
# this setup is just for spark-submit, will be ignored in pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("GA")#.setMaster("localhost")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# func for loading table
def get_db_rdd(table):
url = "jdbc:postgresql://localhost:5432/harvest?user=postgres"
print(url)
lower = 0
upper = 1000
ret = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("partitionColumn", "id") \
.option("numPartitions", 1024) \
.option("lowerBound", lower) \
.option("upperBound", upper) \
.option("password", "password") \
.load()
ret = ret.rdd
return ret
# load table, and print results
print(get_db_rdd("mytable").collect())
I run ./bin/pyspark then paste that into the interpreter, and it prints out the data from my table as expected.
Now, if I save that code to a file named test.py then do ./bin/spark-submit test.py, it starts to run, but then I see these messages spam my console forever:
17/02/16 02:24:21 INFO Executor: Running task 45.0 in stage 0.0 (TID 45)
17/02/16 02:24:21 INFO JDBCRDD: closed connection
17/02/16 02:24:21 INFO Executor: Finished task 45.0 in stage 0.0 (TID 45). 1673 bytes result sent to driver
Edit: This is on a single machine. I haven't started any masters or slaves; spark-submit is the only command I run after system start. I tried with the master/slave setup with the same results.
My spark-env.sh file looks like this:
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=800m
export SPARK_EXECUTOR_MEMORY=800m
export SPARK_EXECUTOR_CORES=2
export SPARK_CLASSPATH=/home/ubuntu/spark/pg_driver.jar # Postgres driver I need for SQLContext
export PYTHONHASHSEED=1337 # have to make workers use same seed in Python3
It works if I spark-submit a Python file that just creates an RDD from a list or something. I only have problems when I try to use a JDBC RDD. What piece am I missing?
When using spark-submit you should supply the jar to the executors.
As mentioned in spark 2.1 JDBC documents:
To get started you will need to include the JDBC driver for you
particular database on the spark classpath. For example, to connect to
postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Note: The same should be for spark-submit command
Troubleshooting
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
This is a horrible hack. I'm not considering this the answer, but it does work.
Alright, only pyspark works? Fine, then we'll use it. Wrote this Bash script:
cat $1 | $SPARK_HOME/bin/pyspark # pipe the Python file into pyspark
I run that script in my Python script that's submitting jobs. Also, I'm including the code I use to pass arguments between the processes, in case it helps someone:
new_env = os.environ.copy()
new_env["pyspark_argument_1"] = "some param I need in my Spark script" # etc...
p = subprocess.Popen(["pyspark_wrapper.sh {}".format(py_fname)], shell=True, env=new_env)
In my Spark script:
something_passed_from_submitter = os.environ["pyspark_argument_1"]
# do stuff in Spark...
I feel like Spark is better supported and (if this is a bug) less buggy with Scala than with Python 3, so that might be the better solution for now. But my script uses some files we wrote in Python 3, so...

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job.
spark-submit --class com.biz.test \
--packages \
org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
org.apache.hbase:hbase-common:1.0.0 \
org.apache.hbase:hbase-client:1.0.0 \
org.apache.hbase:hbase-server:1.0.0 \
org.json4s:json4s-jackson:3.2.11 \
./test-spark_2.10-1.0.8.jar \
>spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.
According to the docs (spark-submit --help):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?
A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
org.apache.hbase:hbase-common:1.0.0
I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()
#Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:
spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())

Resources