NoSuchMethodError using Databricks Spark-Avro 3.2.0 - apache-spark

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running
df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")
But I'm getting this error:
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
It makes no difference if I try interactively or with spark-submit. These are my loaded packages in spark:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]
spark-submit --version output:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Branch
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision
Url
Type --help for more information.
scala version is 2.11.8
My pyspark command:
PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
My spark-submit command:
spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
I've read here that this can be caused by "an older version of avro being used" so I tried using 1.8.1, but I keep getting the same error. Reading avro works fine. Any help?

The cause of this error is that a apache avro version 1.7.4 is included in hadoop by default, and if the SPARK_DIST_CLASSPATH env variable includes the hadoop common ($HADOOP_HOME/share/common/lib/ ) before the ivy2 jars, the wrong version can get used instead of the version required by spark-avro (>=1.7.6) and installed in ivy2.
To check if this is the case, open a spark-shell and run
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
This should tell you the location of the class like so:
java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class
If that class is pointing to $HADOOP_HOME/share/common/lib/ then you must simply include your ivy2 jars before the hadoop common in the SPARK_DIST_CLASSPATH env variable.
For example, in a Dockerfile
ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: /home/root/.ivy2 is the default location for ivy2 jars, you can manipulate that by setting spark.jars.ivy in your spark-defaults.conf, which is probably a good idea.

I have encountered a similar problem before.
Try using --jars {path to spark-avro_2.11-3.2.0.jar} option in spark-submit

Related

Google PubSub in Apache Spark 2.2.1

I'm trying to use Google Cloud PubSub within a Spark application. For simplicity let's just say that this application is Spark's shell. Trying to instantiate a Publisher throws a NoClassDefFoundError, which is most likely the result of dependency version conflicts. However, with a simple setup like this (just Spark and a Google Cloud PubSub dependency), I can't figure out how to resolve this issue.
bash-4.4# spark-shell --packages com.google.cloud:google-cloud-pubsub:1.105.0
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.google.cloud#google-cloud-pubsub added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.google.cloud#google-cloud-pubsub;1.105.0 in central
found io.grpc#grpc-api;1.28.1 in central
...
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> com.google.cloud.pubsub.v1.Publisher.newBuilder("topic").build
java.lang.NoClassDefFoundError: com/google/api/gax/grpc/InstantiatingGrpcChannelProvider
at com.google.cloud.pubsub.v1.stub.PublisherStubSettings.defaultGrpcTransportProviderBuilder(PublisherStubSettings.java:225)
at com.google.cloud.pubsub.v1.TopicAdminSettings.defaultGrpcTransportProviderBuilder(TopicAdminSettings.java:169)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:674)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:625)
at com.google.cloud.pubsub.v1.Publisher.newBuilder(Publisher.java:621)
... 48 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.gax.grpc.InstantiatingGrpcChannelProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 53 more
Is there any way to get this to work? I could change the pubsub dependency version, but not the Spark version.
That is due to Google's Guava dependency conflict which is known to exist whenever using Spark + Google Libraries. The workaround (with Maven) is using the maven-shade-plugin.

Apache Spark installation error.

I am able to install Apache spark with the given set of commands on ubuntu 16:
dpkg -i scala-2.12.1.deb
mkdir /opt/spark
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz
cp -rv spark-2.0.2-bin-hadoop2.7/* /opt/spark
cd /opt/spark
executing spark shell worked well
./bin/spark-shell --master local[2]
return this output on the shell:
jai#jaiPC:/opt/spark$ ./bin/spark-shell --master local[2]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/05/15 19:00:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/05/15 19:00:55 WARN Utils: Your hostname, jaiPC resolves to a loopback address: 127.0.1.1; using 172.16.16.46 instead (on interface enp4s0)
18/05/15 19:00:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/05/15 19:00:55 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.16.16.46:4040
Spark context available as 'sc' (master = local[2], app id = local-1526391055793).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
but when I tried to access
Spark context Web UI available at http://172.16.16.46:4040
it shows
The page cannot be displayed
How can I resolve this problem
Please help:
Thanks and Regards

How can starting time of a Spark application on a YARN cluster be shorten?

I'm managing a system that submits a Spark application to a YARN cluster in a client mode. It takes about 30 seconds for a submitted Spark application to turn to be ready on a YARN cluster and it seems a bit slow for me.
Are there any ways for shortening starting time of a Spark application on a YARN cluster in a client mode?
Here are specs of my environment:
- Spark version: 2.1.1 (the actual version is 2.1.1.2.6.2.0-205 which is provided by Hortonworks)
- YARN version: 2.7.3 (the actual version is 2.7.3.2.6.2.0-205 which is provided by Hortonworks)
- Number of RM: 2 (HA)
- Number of NM: 300
Here is the sample code and result for checking starting time of a Spark application:
$ /usr/hdp/current/spark2-client/bin/spark-shell -v --master yarn --deploy-mode client --driver-memory 4g --conf spark.default.parallelism=180 --conf spark.executor.cores=6 --conf spark.executor.instances=30 --conf spark.executor.memory=6g --conf spark.yarn.am.cores=4 --conf spark.yarn.containerLauncherMaxThreads=30 --conf spark.yarn.dist.archives=/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr --conf spark.yarn.dist.files=/etc/spark2/conf/hive-site.xml --proxy-user xxxx
Using properties file: /usr/hdp/current/spark2-client/conf/spark-defaults.conf
Adding default property: spark.history.kerberos.keytab=/etc/security/keytabs/spark.headless.keytab
Adding default property: spark.history.fs.logDirectory=hdfs:///spark2-history/
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
Adding default property: spark.yarn.queue=default
Adding default property: spark.yarn.historyServer.address=xxxx
Adding default property: spark.history.kerberos.principal=xxxx#xxxx
Adding default property: spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
Adding default property: spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
Adding default property: spark.eventLog.dir=hdfs:///spark2-history/
Adding default property: spark.history.ui.port=18081
Adding default property: spark.history.kerberos.enabled=true
Parsed arguments:
master yarn
deployMode client
executorMemory 6g
executorCores 6
totalExecutorCores null
propertiesFile /usr/hdp/current/spark2-client/conf/spark-defaults.conf
driverMemory 4g
driverCores null
driverExtraClassPath null
driverExtraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
driverExtraJavaOptions null
supervise false
queue null
numExecutors 30
files null
pyFiles null
archives null
mainClass org.apache.spark.repl.Main
primaryResource spark-shell
name Spark shell
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file /usr/hdp/current/spark2-client/conf/spark-defaults.conf:
(spark.history.kerberos.enabled,true)
(spark.yarn.queue,default)
(spark.default.parallelism,180)
(spark.executor.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.history.kerberos.principal,xxxx#xxxx)
(spark.executor.memory,6g)
(spark.driver.memory,4g)
(spark.driver.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.executor.instances,30)
(spark.yarn.historyServer.address,xxxx)
(spark.eventLog.enabled,true)
(spark.yarn.dist.files,/etc/spark2/conf/hive-site.xml)
(spark.history.ui.port,18081)
(spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider)
(spark.history.fs.logDirectory,hdfs:///spark2-history/)
(spark.yarn.am.cores,4)
(spark.yarn.containerLauncherMaxThreads,30)
(spark.history.kerberos.keytab,/etc/security/keytabs/spark.headless.keytab)
(spark.yarn.dist.archives,/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr)
(spark.eventLog.dir,hdfs:///spark2-history/)
(spark.executor.cores,6)
Main class:
org.apache.spark.repl.Main
Arguments:
System properties:
(spark.yarn.queue,default)
(spark.history.kerberos.enabled,true)
(spark.default.parallelism,180)
(spark.executor.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.history.kerberos.principal,xxxx#xxxx)
(spark.driver.memory,4g)
(spark.executor.memory,6g)
(spark.executor.instances,30)
(spark.driver.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.yarn.historyServer.address,xxxx)
(spark.eventLog.enabled,true)
(spark.yarn.dist.files,file:/etc/spark2/conf/hive-site.xml)
(spark.history.ui.port,18081)
(SPARK_SUBMIT,true)
(spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider)
(spark.app.name,Spark shell)
(spark.history.fs.logDirectory,hdfs:///spark2-history/)
(spark.yarn.am.cores,4)
(spark.yarn.containerLauncherMaxThreads,30)
(spark.jars,)
(spark.history.kerberos.keytab,/etc/security/keytabs/spark.headless.keytab)
(spark.submit.deployMode,client)
(spark.yarn.dist.archives,file:/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr)
(spark.eventLog.dir,hdfs:///spark2-history/)
(spark.master,yarn)
(spark.executor.cores,6)
Classpath elements:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xxxx:4040
Spark context available as 'sc' (master = yarn, app id = application_1519898793082_6925).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.2.0-205
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Thanks.

spark-shell, dependency jars and class not found exception

I'm trying to run my spark app on spark shell. Here is what I tried and many more variants after hours of reading on this error...but none seem to work.
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar,/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
What is get instead is
java.lang.ClassNotFoundException: my_home.myhome.RecommendMatch
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Any ideas please? Thanks!
UPDATE:
Found that the jars must be colon(:) separated and not comma(,) separated as described in several articles/docs
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
However, now the errors have changed. Note ls -la finds the paths although the following lines complain that don't exit. Bizarre..
Warning: Local jar /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar does not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:314)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:268)
UPDATE 2:
spark-shell —class my_home.myhome.RecommendMatch —-jars “/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar”
The above command yields the following on spark-shell.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 01:19:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 01:19:13 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local[*], app id = local-1494877749685).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :load my_home.myhome.RecommendMatch
That file does not exist
scala> :load RecommendMatch
That file does not exist
scala> :load my_home.myhome.RecommendMatch.scala
That file does not exist
scala> :load RecommendMatch.scala
That file does not exist
The jars don't seem to be loaded :( based on what I see at http://localhost:4040/environment/
The URLs supplied to --jars must be separated by commas. Your first command is correct.
You also have to add the jar at last param to spark-submit. Lets say my_home.myhome.RecommendMatch is part of myhome-0.0.1-SNAPSHOT.jar jar file.
spark-submit --class my_home.myhome.RecommendMatch \
—jars "/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar" \
/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar

error: not found: value sqlContext on EMR

I am on EMR using Spark 2. When I ssh into the master node and run spark-shell I can't see to have access to sqlContext. Is there something I'm missing?
[hadoop#ip-172-31-13-180 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/10 21:07:05 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/11/10 21:07:14 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.31.13.180:4040
Spark context available as 'sc' (master = yarn, app id = application_1478720853870_0003).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> sqlContext
<console>:25: error: not found: value sqlContext
sqlContext
^
Since I'm getting same error on my local computer I've tried the following to no avail:
exported SPARK_LOCAL_IP
➜ play grep "SPARK_LOCAL_IP" ~/.zshrc
export SPARK_LOCAL_IP=127.0.0.1
➜ play source ~/.zshrc
➜ play spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/10 16:12:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/10 16:12:19 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://127.0.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1478812339020).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sqlContext
<console>:24: error: not found: value sqlContext
sqlContext
^
scala>
My /etc/hosts contains the following
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
Spark 2.0 doesn't use SQLContext anymore:
use SparkSession (initialized in spark-shell as spark).
for legacy application you can:
val sqlContext = spark.sqlContext

Resources