Spark: Executing the python kinesis streaming example - apache-spark

I'm (very) new to spark, so apologies if this is a stupid question.
I am trying to execute the spark (2.2.0) python spark streaming example, however I keep running into the issue below:
Traceback (most recent call last):
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/kinesis_wordcount_asl.py", line 76, in <module>
ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2)
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/streaming/kinesis.py", line 92, in createStream
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Users/rmanoch/Downloads/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o27.createStream. Trace:
py4j.Py4JException: Method createStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.String, class java.lang.Integer, class org.apache.spark.streaming.Duration, class org.apache.spark.storage.StorageLevel, null, null, null, null, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
The tarball I downloaded from spark's website did not include the external folder in it (seems like there's some license issue), so this is the command I have been trying to execute (after downloading kinesis_wordcount_asl.py from github)
bin/spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.2.0 kinesis_wordcount_asl.py sparkEnrichedDev relay-enriched-dev https://kinesis.us-west-2.amazonaws.com us-west-2
Happy to provide any additional details if needed.

Based on the exception it looks like there is a version mismatch between core Spark / Spark streaming and spark-kinesis. API changed between Spark 2.1 and 2.2 (SPARK-19405) and version mismatch would cause a similar error.
This makes me think you're submitting using incorrect binaries (just a guess) - it can be PATH, PYTHONPATH or SPARK_HOME issue if you use local mode. Because you get signature mismatch we can assume that spark-kinesis is loaded correctly and org.apache.spark.streaming.kinesis.KinesisUtilsPythonHelper is present on the CLASSPATH.

I case somebody winds up here like I did, this is due to version mismatches. I was having the same problem and I managed to solve it by matching the corresponding versions to the kinesis package. Both numbers should match the Scala version used for compiling the libraries and the Spark version. I for example have the following:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_222
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
This corresponds to Spark 2.4.5 compiled using Scala 2.11.12. Therefore, the corresponding package should be
spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5 kinesis_...

Related

Hive queries failing with "Unable to fetch table test_table. Invalid method name: 'get_table_req'" with pyspark 3.0.0 & Hive 1.1.0

Digging into a POC for spark in a fairly new environment and checking spark capabilities but Having issues with running sql queries in pyspark terminal whereas Hive is working since we can query metadata.
Any idea whats happening here and how to resolve this?
$ pyspark --driver-class-path /etc/spark2/conf:/etc/hive/conf
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession \
... .builder \
... .appName("sample_query_test") \
... .enableHiveSupport() \
... .getOrCreate()
>>> spark.sql("show tables in user_tables").show(5)
20/08/18 19:57:01 WARN conf.HiveConf: HiveConf of name hive.enforce.sorting does not exist
20/08/18 19:57:01 WARN conf.HiveConf: HiveConf of name hive.enforce.bucketing does not exist
20/08/18 19:57:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+-----------+--------------------+-----------+
| database| tableName|isTemporary|
+-----------+--------------------+-----------+
|user_tables| a_2019| false|
|user_tables|abcdefgjeufjdsahh...| false|
|user_tables|testtesttesttestt...| false|
|user_tables|newnewnewnewnenwn...| false|
|user_tables|blahblahblablahbl...| false|
+-----------+--------------------+-----------+
only showing top 5 rows
>>> spark.sql("select count(*) from user_tables.test_table where date_partition='2020-08-17'").show(5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py", line 646, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/conda/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
**pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table test_spark_cedatatransfer. Invalid method name: 'get_table_req';**
Information on the cluster :
$ hive --version
Hive 1.1.0-cdh5.13.0
Subversion file:///data/jenkins/workspace/generic-package-ubuntu64-16-04/CDH5.13.0-Packaging-Hive-2017-10-04_10-50-44/hive-1.1.0+cdh5.13.0+1269-1.cdh5.13.0.p0.34~xenial -r Unknown
Compiled by jenkins on Wed Oct 4 11:46:53 PDT 2017
$ pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_252
Branch HEAD
Compiled by user ubuntu on 2020-06-06T11:32:25Z
Revision 3fdfce3120f307147244e5eaf46d61419a723d50
Url https://gitbox.apache.org/repos/asf/spark.git
$ hadoop version
Hadoop 2.6.0-cdh5.13.0
Subversion http://github.com/cloudera/hadoop -r 42e8860b182e55321bd5f5605264da4adc8882be
Compiled by jenkins on 2017-10-04T18:50Z
Compiled with protoc 2.5.0
From source with checksum 5e84c185f8a22158e2b0e4b8f85311
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
As its evident, I added hive conf to make sure same metastore is being used and while simple queries working the insert overwrite is failing!
I have run into the same issue, trying to use Spark 3.0.1 with HDP2.6.
Problem was resolved by removing all hive*.jar files from jars folder, and copying hive*jar from Spark2, shipped with HDP distribution.
Setting the following properties in spark-defaults.conf have worked for me:
spark.sql.hive.metastore.jars=maven
spark.sql.hive.metastore.version=1.2.1
spark.sql.catalogImplementation=hive
spark.sql.warehouse.dir=/user/hive/warehouse
Of course the last line may be customized, to the appropriate location. This path is for Cloudera (CDH).
These settings will instruct Spark to download and cache jar files in ~/.ivy2/jars for Hive 1.2.1. You may also choose to provide your own jars in an own path, but maven yield the most comfortable option.
See Spark configuartion properties for more details!

Google PubSub in Apache Spark 2.2.1

I'm trying to use Google Cloud PubSub within a Spark application. For simplicity let's just say that this application is Spark's shell. Trying to instantiate a Publisher throws a NoClassDefFoundError, which is most likely the result of dependency version conflicts. However, with a simple setup like this (just Spark and a Google Cloud PubSub dependency), I can't figure out how to resolve this issue.
bash-4.4# spark-shell --packages com.google.cloud:google-cloud-pubsub:1.105.0
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.google.cloud#google-cloud-pubsub added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.google.cloud#google-cloud-pubsub;1.105.0 in central
found io.grpc#grpc-api;1.28.1 in central
...
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> com.google.cloud.pubsub.v1.Publisher.newBuilder("topic").build
java.lang.NoClassDefFoundError: com/google/api/gax/grpc/InstantiatingGrpcChannelProvider
at com.google.cloud.pubsub.v1.stub.PublisherStubSettings.defaultGrpcTransportProviderBuilder(PublisherStubSettings.java:225)
at com.google.cloud.pubsub.v1.TopicAdminSettings.defaultGrpcTransportProviderBuilder(TopicAdminSettings.java:169)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:674)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:625)
at com.google.cloud.pubsub.v1.Publisher.newBuilder(Publisher.java:621)
... 48 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.gax.grpc.InstantiatingGrpcChannelProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 53 more
Is there any way to get this to work? I could change the pubsub dependency version, but not the Spark version.
That is due to Google's Guava dependency conflict which is known to exist whenever using Spark + Google Libraries. The workaround (with Maven) is using the maven-shade-plugin.

how to write spark dataframe into avro file format in jupyter notebook?

I have configured Amazon EMR cluster with 1 master node and 2 cores. Following are the software installation on EMR:
Hive 2.3.4, Pig 0.17.0, Hue 4.3.0, Ganglia 3.7.2, Spark 2.4.0, TensorFlow 1.12.0.
I have not configured any bootstrap action. Now, that the clusters are up and waiting for step. I have started notebook from EMR and below are the details of the code.
sdf = spark.read.csv('hdfs://i....:8020/user/root/temp.csv')
This executes perfectly, and I am able to see my dataframe through sdf.show()
However, when I try to write into the avro file, it fails
sdf.write.format("avro").save("avro_file.avro")
ERR:
u'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 736, in save
self._jwrite.save(path)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
I tried:
sdf.write.format("org.apache.spark.sql.avro").save("avro_file.avro")
gave same error
u'Failed to find data source: org.apache.spark.sql.avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 736, in save
self._jwrite.save(path)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Failed to find data source: org.apache.spark.sql.avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
I also tried through spark interactive session:
[ec2-user#ip-xxxx conf]$ sudo pyspark --packages org.apache.spark:spark-avro_2.12:2.4.2
Python 2.7.16 (default, Mar 18 2019, 18:38:44)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e8c82e1e-629a-4d83-844d-a86057fc5ae7;1.0
confs: [default]
found org.apache.spark#spark-avro_2.12;2.4.2 in central
found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 209ms :: artifacts dl 6ms
:: modules in use:
org.apache.spark#spark-avro_2.12;2.4.2 from central in [default]
org.spark-project.spark#unused;1.0.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e8c82e1e-629a-4d83-844d-a86057fc5ae7
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/6ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/02 07:23:00 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/02 07:23:03 WARN Client: Same path resource file:///root/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.2.jar added multiple times to distributed cache.
19/05/02 07:23:03 WARN Client: Same path resource file:///root/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added multiple times to distributed cache.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 2.7.16 (default, Mar 18 2019 18:38:44)
SparkSession available as 'spark'.
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> df.write.format("avro").save("avro_file.avro")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 736, in save
self._jwrite.save(path)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:244)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 24 more
>>>
I have also tried updating the /etc/spark/conf/spark-defaults.conf to have
spark.jars.packages org.apache.spark:spark-avro_2.12:2.4.2, com.databricks:spark-csv_2.11:1.5.0
However, post this configuration jupyter notebook could not start spark and gave below error:
The code failed because of a fatal error:
Session 4 did not start up in 60 seconds..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
On spark 2.4.3 :
Going back a spark_arvo vesion to org.apache.spark:spark-avro_2.11:2.4.3, fixed this issue for me.
Also, In your jupyter-notebook before initiating the spark-context add the following line:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark avro_2.11:2.4.3 pyspark-shell'

spark-shell, dependency jars and class not found exception

I'm trying to run my spark app on spark shell. Here is what I tried and many more variants after hours of reading on this error...but none seem to work.
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar,/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
What is get instead is
java.lang.ClassNotFoundException: my_home.myhome.RecommendMatch
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Any ideas please? Thanks!
UPDATE:
Found that the jars must be colon(:) separated and not comma(,) separated as described in several articles/docs
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
However, now the errors have changed. Note ls -la finds the paths although the following lines complain that don't exit. Bizarre..
Warning: Local jar /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar does not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:314)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:268)
UPDATE 2:
spark-shell —class my_home.myhome.RecommendMatch —-jars “/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar”
The above command yields the following on spark-shell.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 01:19:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 01:19:13 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local[*], app id = local-1494877749685).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :load my_home.myhome.RecommendMatch
That file does not exist
scala> :load RecommendMatch
That file does not exist
scala> :load my_home.myhome.RecommendMatch.scala
That file does not exist
scala> :load RecommendMatch.scala
That file does not exist
The jars don't seem to be loaded :( based on what I see at http://localhost:4040/environment/
The URLs supplied to --jars must be separated by commas. Your first command is correct.
You also have to add the jar at last param to spark-submit. Lets say my_home.myhome.RecommendMatch is part of myhome-0.0.1-SNAPSHOT.jar jar file.
spark-submit --class my_home.myhome.RecommendMatch \
—jars "/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar" \
/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar

NoSuchMethodError using Databricks Spark-Avro 3.2.0

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running
df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")
But I'm getting this error:
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
It makes no difference if I try interactively or with spark-submit. These are my loaded packages in spark:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]
spark-submit --version output:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Branch
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision
Url
Type --help for more information.
scala version is 2.11.8
My pyspark command:
PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
My spark-submit command:
spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
I've read here that this can be caused by "an older version of avro being used" so I tried using 1.8.1, but I keep getting the same error. Reading avro works fine. Any help?
The cause of this error is that a apache avro version 1.7.4 is included in hadoop by default, and if the SPARK_DIST_CLASSPATH env variable includes the hadoop common ($HADOOP_HOME/share/common/lib/ ) before the ivy2 jars, the wrong version can get used instead of the version required by spark-avro (>=1.7.6) and installed in ivy2.
To check if this is the case, open a spark-shell and run
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
This should tell you the location of the class like so:
java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class
If that class is pointing to $HADOOP_HOME/share/common/lib/ then you must simply include your ivy2 jars before the hadoop common in the SPARK_DIST_CLASSPATH env variable.
For example, in a Dockerfile
ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: /home/root/.ivy2 is the default location for ivy2 jars, you can manipulate that by setting spark.jars.ivy in your spark-defaults.conf, which is probably a good idea.
I have encountered a similar problem before.
Try using --jars {path to spark-avro_2.11-3.2.0.jar} option in spark-submit

Resources