Google PubSub in Apache Spark 2.2.1 - apache-spark

I'm trying to use Google Cloud PubSub within a Spark application. For simplicity let's just say that this application is Spark's shell. Trying to instantiate a Publisher throws a NoClassDefFoundError, which is most likely the result of dependency version conflicts. However, with a simple setup like this (just Spark and a Google Cloud PubSub dependency), I can't figure out how to resolve this issue.
bash-4.4# spark-shell --packages com.google.cloud:google-cloud-pubsub:1.105.0
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.google.cloud#google-cloud-pubsub added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.google.cloud#google-cloud-pubsub;1.105.0 in central
found io.grpc#grpc-api;1.28.1 in central
...
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> com.google.cloud.pubsub.v1.Publisher.newBuilder("topic").build
java.lang.NoClassDefFoundError: com/google/api/gax/grpc/InstantiatingGrpcChannelProvider
at com.google.cloud.pubsub.v1.stub.PublisherStubSettings.defaultGrpcTransportProviderBuilder(PublisherStubSettings.java:225)
at com.google.cloud.pubsub.v1.TopicAdminSettings.defaultGrpcTransportProviderBuilder(TopicAdminSettings.java:169)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:674)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:625)
at com.google.cloud.pubsub.v1.Publisher.newBuilder(Publisher.java:621)
... 48 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.gax.grpc.InstantiatingGrpcChannelProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 53 more
Is there any way to get this to work? I could change the pubsub dependency version, but not the Spark version.

That is due to Google's Guava dependency conflict which is known to exist whenever using Spark + Google Libraries. The workaround (with Maven) is using the maven-shade-plugin.

Related

Apache Spark installation error.

I am able to install Apache spark with the given set of commands on ubuntu 16:
dpkg -i scala-2.12.1.deb
mkdir /opt/spark
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz
cp -rv spark-2.0.2-bin-hadoop2.7/* /opt/spark
cd /opt/spark
executing spark shell worked well
./bin/spark-shell --master local[2]
return this output on the shell:
jai#jaiPC:/opt/spark$ ./bin/spark-shell --master local[2]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/05/15 19:00:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/05/15 19:00:55 WARN Utils: Your hostname, jaiPC resolves to a loopback address: 127.0.1.1; using 172.16.16.46 instead (on interface enp4s0)
18/05/15 19:00:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/05/15 19:00:55 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.16.16.46:4040
Spark context available as 'sc' (master = local[2], app id = local-1526391055793).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
but when I tried to access
Spark context Web UI available at http://172.16.16.46:4040
it shows
The page cannot be displayed
How can I resolve this problem
Please help:
Thanks and Regards

Spark: Executing the python kinesis streaming example

I'm (very) new to spark, so apologies if this is a stupid question.
I am trying to execute the spark (2.2.0) python spark streaming example, however I keep running into the issue below:
Traceback (most recent call last):
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/kinesis_wordcount_asl.py", line 76, in <module>
ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2)
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/streaming/kinesis.py", line 92, in createStream
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o27.createStream. Trace:
py4j.Py4JException: Method createStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.Integer, class org.apache.spark.streaming.Duration, class org.apache.spark.storage.StorageLevel, null, null, null, null, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
The tarball I downloaded from spark's website did not include the external folder in it (seems like there's some license issue), so this is the command I have been trying to execute (after downloading kinesis_wordcount_asl.py from github)
bin/spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.2.0 kinesis_wordcount_asl.py sparkEnrichedDev relay-enriched-dev https://kinesis.us-west-2.amazonaws.com us-west-2
Happy to provide any additional details if needed.
Based on the exception it looks like there is a version mismatch between core Spark / Spark streaming and spark-kinesis. API changed between Spark 2.1 and 2.2 (SPARK-19405) and version mismatch would cause a similar error.
This makes me think you're submitting using incorrect binaries (just a guess) - it can be PATH, PYTHONPATH or SPARK_HOME issue if you use local mode. Because you get signature mismatch we can assume that spark-kinesis is loaded correctly and org.apache.spark.streaming.kinesis.KinesisUtilsPythonHelper is present on the CLASSPATH.
I case somebody winds up here like I did, this is due to version mismatches. I was having the same problem and I managed to solve it by matching the corresponding versions to the kinesis package. Both numbers should match the Scala version used for compiling the libraries and the Spark version. I for example have the following:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_222
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
This corresponds to Spark 2.4.5 compiled using Scala 2.11.12. Therefore, the corresponding package should be
spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5 kinesis_...

spark-shell, dependency jars and class not found exception

I'm trying to run my spark app on spark shell. Here is what I tried and many more variants after hours of reading on this error...but none seem to work.
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar,/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
What is get instead is
java.lang.ClassNotFoundException: my_home.myhome.RecommendMatch
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Any ideas please? Thanks!
UPDATE:
Found that the jars must be colon(:) separated and not comma(,) separated as described in several articles/docs
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
However, now the errors have changed. Note ls -la finds the paths although the following lines complain that don't exit. Bizarre..
Warning: Local jar /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar does not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:314)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:268)
UPDATE 2:
spark-shell —class my_home.myhome.RecommendMatch —-jars “/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar”
The above command yields the following on spark-shell.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 01:19:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 01:19:13 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local[*], app id = local-1494877749685).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :load my_home.myhome.RecommendMatch
That file does not exist
scala> :load RecommendMatch
That file does not exist
scala> :load my_home.myhome.RecommendMatch.scala
That file does not exist
scala> :load RecommendMatch.scala
That file does not exist
The jars don't seem to be loaded :( based on what I see at http://localhost:4040/environment/
The URLs supplied to --jars must be separated by commas. Your first command is correct.
You also have to add the jar at last param to spark-submit. Lets say my_home.myhome.RecommendMatch is part of myhome-0.0.1-SNAPSHOT.jar jar file.
spark-submit --class my_home.myhome.RecommendMatch \
—jars "/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar" \
/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar

NoSuchMethodError using Databricks Spark-Avro 3.2.0

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running
df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")
But I'm getting this error:
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
It makes no difference if I try interactively or with spark-submit. These are my loaded packages in spark:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]
spark-submit --version output:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Branch
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision
Url
Type --help for more information.
scala version is 2.11.8
My pyspark command:
PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
My spark-submit command:
spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
I've read here that this can be caused by "an older version of avro being used" so I tried using 1.8.1, but I keep getting the same error. Reading avro works fine. Any help?
The cause of this error is that a apache avro version 1.7.4 is included in hadoop by default, and if the SPARK_DIST_CLASSPATH env variable includes the hadoop common ($HADOOP_HOME/share/common/lib/ ) before the ivy2 jars, the wrong version can get used instead of the version required by spark-avro (>=1.7.6) and installed in ivy2.
To check if this is the case, open a spark-shell and run
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
This should tell you the location of the class like so:
java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class
If that class is pointing to $HADOOP_HOME/share/common/lib/ then you must simply include your ivy2 jars before the hadoop common in the SPARK_DIST_CLASSPATH env variable.
For example, in a Dockerfile
ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: /home/root/.ivy2 is the default location for ivy2 jars, you can manipulate that by setting spark.jars.ivy in your spark-defaults.conf, which is probably a good idea.
I have encountered a similar problem before.
Try using --jars {path to spark-avro_2.11-3.2.0.jar} option in spark-submit

error: not found: value sqlContext on EMR

I am on EMR using Spark 2. When I ssh into the master node and run spark-shell I can't see to have access to sqlContext. Is there something I'm missing?
[hadoop#ip-172-31-13-180 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/10 21:07:05 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/11/10 21:07:14 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.31.13.180:4040
Spark context available as 'sc' (master = yarn, app id = application_1478720853870_0003).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> sqlContext
<console>:25: error: not found: value sqlContext
sqlContext
^
Since I'm getting same error on my local computer I've tried the following to no avail:
exported SPARK_LOCAL_IP
➜ play grep "SPARK_LOCAL_IP" ~/.zshrc
export SPARK_LOCAL_IP=127.0.0.1
➜ play source ~/.zshrc
➜ play spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/10 16:12:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/10 16:12:19 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://127.0.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1478812339020).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sqlContext
<console>:24: error: not found: value sqlContext
sqlContext
^
scala>
My /etc/hosts contains the following
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
Spark 2.0 doesn't use SQLContext anymore:
use SparkSession (initialized in spark-shell as spark).
for legacy application you can:
val sqlContext = spark.sqlContext

Resources