Zeppelin + Spark: Reading Parquet from S3 throws NoSuchMethodError: com.fasterxml.jackson

Zeppelin + Spark: Reading Parquet from S3 throws NoSuchMethodError: com.fasterxml.jackson - apache-spark

Using Zeppelin 0.7.2 binaries from the main download, and Spark 2.1.0 w/ Hadoop 2.6, the following paragraph:
val df = spark.read.parquet(DATA_URL).filter(FILTER_STRING).na.fill("")
Produces the following:
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.parallelize(SparkContext.scala:715)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:594)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:235)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 47 elided
This error does not happen in the normal spark-shell, only in Zeppelin. I have attempted the following fixes, which do nothing:
Download jackson 2.6.2 jars to the zeppelin lib folder and restart
Add jackson 2.9 dependencies from the maven repositories to the interpreter settings
Deleting the jackson jars from the zeppelin lib folder
Googling is turning up no similar situations. Please don't hesitate to ask for more information, or make suggestions. Thanks!

I had the same problem. I added com.amazonaws:aws-java-sdk and org.apache.hadoop:hadoop-aws as dependencies for the Spark interpreter. These dependencies bring in their own versions of com.fasterxml.jackson.core:* and conflict with Spark's.
You also must exclude com.fasterxml.jackson.core:* from other dependencies, this is an example ${ZEPPELIN_HOME}/conf/interpreter.json Spark interpreter depenency section:
"dependencies": [
{
"groupArtifactVersion": "com.amazonaws:aws-java-sdk:1.7.4",
"local": false,
"exclusions": ["com.fasterxml.jackson.core:*"]
},
{
"groupArtifactVersion": "org.apache.hadoop:hadoop-aws:2.7.1",
"local": false,
"exclusions": ["com.fasterxml.jackson.core:*"]
}
]

Another way is to include it right in the notebook cell:
%dep
z.load("com.fasterxml.jackson.core:jackson-core:2.6.2")

Related

Getting scalatest to run from databricks

What's missing in order to get this simple scalatest working in databricks? (mostly as per their userguide)
I have cluster, into which I've uploaded the following latest scalatest jar and restarted cluster:
https://oss.sonatype.org/content/groups/public/org/scalatest/scalatest_2.13/3.2.13/scalatest_2.13-3.2.13.jar
Following: https://www.scalatest.org/user_guide/using_the_scalatest_shell
I'm running their sample code:
import org.scalatest._
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
test("addition works") {
1 + 1 should equal (2)
}
}
run(new ArithmeticSuite)
The errors start with:
command-1362189841383727:3: error: not found: type AnyFunSuite
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
^
command-1362189841383727:3: error: object should is not a member of package org.scalatest.matchers
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {

Databricks clusters are using Scala 2.12, while you have installed library compiled for Scala 2.13 that is binary incompatible with Databricks cluster. Take version for Scala 2.12 (with _2.12 suffix).
P.S. Although scalatest already could be a part of Databricks Runtime - check it by running %sh ls -ls /dataricks/jars/*scalatest*

Needed to use scalatest test structure for 3.0.8
https://www.scalatest.org/scaladoc/3.0.8/org/scalatest/FeatureSpec.html
<artifactId>scalatest_2.12</artifactId>
<version>3.0.8</version>
to match the version 3.0.8 included with databricks runtime
This resolved our errors

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

I am experimenting with spark reading from a kafka topic through "Structured Streaming + Kafka Integration Guide".
Spark version: 3.2.1
Scala version: 2.12.15
Following their guide on the spark-shell including the dependencies, I start my shell:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
However, once I run something like the following in my shell:
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","http://HOST:PORT").option("subscribe", "my-topic").load()
I get the following exception:
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer`
Any ideas how to overcome this issue?
My assumption was with using --packages, all dependencies should be loaded as well. But this does not seem to be the case. From the logs I assume that the package gets loaded successfully, including the kafka-clients dependency:
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
resolving dependencies :: org.apache.spark#spark-submit-parent-3b04f646-471c-4cc8-88fb-7e32bc3226ed;1.0
confs: \[default\]
found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.1 in central
found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.1 in central
found org.apache.kafka#kafka-clients;2.8.0 in central
found org.lz4#lz4-java;1.7.1 in central
found org.xerial.snappy#snappy-java;1.1.8.4 in central
found org.slf4j#slf4j-api;1.7.30 in central
found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
found org.spark-project.spark#unused;1.0.0 in central
found org.apache.hadoop#hadoop-client-api;3.3.1 in central
found org.apache.htrace#htrace-core4;4.1.0-incubating in central
found commons-logging#commons-logging;1.1.3 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found org.apache.commons#commons-pool2;2.6.2 in central

The logs seem fine, but you can try to include kafka-clients dependency in --packages argument as well
Otherwise, I'd suggest creating an uber jar instead of downloading libraries every time you submit the app

Error java.lang.NoSuchFieldError: NO_INTS

Getting the below error when running spark streaming application to fetch the data from kinesis.
Exception in thread "Kinesis Receiver 0" java.lang.NoSuchFieldError: NO_INTS
at com.fasterxml.jackson.dataformat.cbor.CBORParser.<init>(CBORParser.java:285)
at com.fasterxml.jackson.dataformat.cbor.CBORParserBootstrapper.constructParser(CBORParserBootstrapper.java:91)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory._createParser(CBORFactory.java:392)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory.createParser(CBORFactory.java:308)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory.createParser(CBORFactory.java:295)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory.createParser(CBORFactory.java:26)
at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2294)
at com.amazonaws.protocol.json.JsonContent.parseJsonContent(JsonContent.java:72)
at com.amazonaws.protocol.json.JsonContent.<init>(JsonContent.java:64)
at com.amazonaws.protocol.json.JsonContent.createJsonContent(JsonContent.java:54)
at com.amazonaws.http.JsonErrorResponseHandler.handle(JsonErrorResponseHandler.java:89)
at com.amazonaws.http.JsonErrorResponseHandler.handle(JsonErrorResponseHandler.java:40)
at com.amazonaws.http.AwsErrorResponseHandler.handleAse(AwsErrorResponseHandler.java:53)
at com.amazonaws.http.AwsErrorResponseHandler.handle(AwsErrorResponseHandler.java:41)
at com.amazonaws.http.AwsErrorResponseHandler.handle(AwsErrorResponseHandler.java:26)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1383)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1359)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2809)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2776)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2765)
at com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1557)
at com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.listShards(KinesisProxy.java:325)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.getShardList(KinesisProxy.java:440)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.getShardList(KinesisShardSyncer.java:349)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.syncShardLeases(KinesisShardSyncer.java:159)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.checkAndCreateLeasesForNewShards(KinesisShardSyncer.java:112)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncTask.call(ShardSyncTask.java:84)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.initialize(Worker.java:683)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:614)
at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:191)
Here is the command that was executed:
spark-submit --jars spark-streaming-kinesis-asl_2.11-2.4.5.jar,amazon-kinesis-client-1.13.2.jar,aws-java-sdk-kinesis-1.11.745.jar,aws-java-sdk-core-1.11.745.jar,aws-java-sdk-sts-1.11.745.jar,aws-java-sdk-1.11.745.jar,aws-java-sdk-dynamodb-1.11.745.jar,aws-java-sdk-cloudwatch-1.11.745.jar,jackson-core-2.9.8.jar,jackson-dataformat-cbor-2.9.8.jar,jackson-databind-2.9.8.jar snowplow_spark/src/main.py
And code is very basic:
kinesisStream = KinesisUtils.createStream(
ssc, kinesisAppName=appName, streamName=streamName, endpointUrl=endpointUrl,
regionName=regionName, initialPositionInStream=InitialPositionInStream.LATEST,
checkpointInterval=10)
I have been stuck on this since days and have no idea what to do. I know that somewhere the jackson version is not matching in spark and aws-sdk but dont know which one to put in --jars.

Not sure if you have it resolved already, but https://issues.apache.org/jira/browse/SPARK-25455 issue is related to that. I ran into same issue. I got it to work with the below dependency in pom.xml. Also all the aws libraries are of 2.16.0 version in my project. Hope this helps.
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-cbor</artifactId>
<version>2.6.7</version>
</dependency>

ClassNotFoundException for parquet on CentOS 7.3

I have been using spark-1.5.2 built with hadoop-1.0.4 along with spark-csv_2.10-1.4.0 and commons-csv:1.1(for reading data). I run naivebayes/randomforest algorithms in CentOS 6.7 and it is working fine. When I upgraded the OS to 7.3, I get the following exception :
Caused by: java.lang.ClassNotFoundException: Failed to load class for data source: parquet.
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:167)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at org.apache.spark.mllib.classification.NaiveBayesModel$SaveLoadV2_0$.save(NaiveBayes.scala:206)
at org.apache.spark.mllib.classification.NaiveBayesModel.save(NaiveBayes.scala:169)
at com.zlabs.ml.core.algo.classification.naivebayes.NaiveBayesClassificationImpl.createModel(NaiveBayesClassificationImpl.java:111)
... 10 more
This exception happens when I save the model. Training is completed successfully.
Is there anyone else facing this issue ? Are there any OS level package dependencies for spark?

Spark SQL: TwitterUtils Streaming fails for unknown reason

I am using the latest Spark master and additionally, I am loading these jars:
- spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar
- twitter4j-core-4.0.2.jar
- twitter4j-stream-4.0.2.jar
My simple test program that I execute in the shell looks as follows:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
System.setProperty("twitter4j.oauth.consumerKey", "jXgXF...")
System.setProperty("twitter4j.oauth.consumerSecret", "mWPvQRl1....")
System.setProperty("twitter4j.oauth.accessToken", "26176....")
System.setProperty("twitter4j.oauth.accessTokenSecret", "J8Fcosm4...")
var ssc = new StreamingContext(sc, Seconds(1))
var tweets = TwitterUtils.createStream(ssc, None)
var statuses = tweets.map(_.getText)
statuses.print()
ssc.start()
However, I won't get any tweets. The main error I see is
14/08/04 10:52:35 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
....
And then for each iteration:
INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks
I'm not sure where the problem lies.
How can I verify that my twitter credentials are correctly recognized?
Might there be another jar missing?

NoSuchMethodError should always cause you to ask whether you are running with the same versions of libraries and classes that you compiled with.
If you look at the pom.xml file for the Spark examples module, you'll see that it uses twitter4j 3.0.3. You're bringing incompatible 4.0.2 with you at runtime and that breaks it.

Yes, Sean Owen has given the good reason, after I add two dependency files on the pom.xml file:
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>3.0.6</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.6</version>
</dependency>
In this way we change the default twitter4j version from 4.0.x to 3.0.x (http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core), then the incompatible problem will be solved.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Zeppelin + Spark: Reading Parquet from S3 throws NoSuchMethodError: com.fasterxml.jackson - apache-spark

Another way is to include it right in the notebook cell: %dep z.load("com.fasterxml.jackson.core:jackson-core:2.6.2")

Related

Getting scalatest to run from databricks

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

Error java.lang.NoSuchFieldError: NO_INTS

ClassNotFoundException for parquet on CentOS 7.3

Spark SQL: TwitterUtils Streaming fails for unknown reason

Categories

Resources