Spark SQL: TwitterUtils Streaming fails for unknown reason - apache-spark

I am using the latest Spark master and additionally, I am loading these jars:
- spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar
- twitter4j-core-4.0.2.jar
- twitter4j-stream-4.0.2.jar
My simple test program that I execute in the shell looks as follows:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
System.setProperty("twitter4j.oauth.consumerKey", "jXgXF...")
System.setProperty("twitter4j.oauth.consumerSecret", "mWPvQRl1....")
System.setProperty("twitter4j.oauth.accessToken", "26176....")
System.setProperty("twitter4j.oauth.accessTokenSecret", "J8Fcosm4...")
var ssc = new StreamingContext(sc, Seconds(1))
var tweets = TwitterUtils.createStream(ssc, None)
var statuses = tweets.map(_.getText)
statuses.print()
ssc.start()
However, I won't get any tweets. The main error I see is
14/08/04 10:52:35 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
....
And then for each iteration:
INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks
I'm not sure where the problem lies.
How can I verify that my twitter credentials are correctly recognized?
Might there be another jar missing?

NoSuchMethodError should always cause you to ask whether you are running with the same versions of libraries and classes that you compiled with.
If you look at the pom.xml file for the Spark examples module, you'll see that it uses twitter4j 3.0.3. You're bringing incompatible 4.0.2 with you at runtime and that breaks it.

Yes, Sean Owen has given the good reason, after I add two dependency files on the pom.xml file:
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>3.0.6</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.6</version>
</dependency>
In this way we change the default twitter4j version from 4.0.x to 3.0.x (http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core), then the incompatible problem will be solved.

Related

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

I am experimenting with spark reading from a kafka topic through "Structured Streaming + Kafka Integration Guide".
Spark version: 3.2.1
Scala version: 2.12.15
Following their guide on the spark-shell including the dependencies, I start my shell:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
However, once I run something like the following in my shell:
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","http://HOST:PORT").option("subscribe", "my-topic").load()
I get the following exception:
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer`
Any ideas how to overcome this issue?
My assumption was with using --packages, all dependencies should be loaded as well. But this does not seem to be the case. From the logs I assume that the package gets loaded successfully, including the kafka-clients dependency:
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
resolving dependencies :: org.apache.spark#spark-submit-parent-3b04f646-471c-4cc8-88fb-7e32bc3226ed;1.0
confs: \[default\]
found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.1 in central
found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.1 in central
found org.apache.kafka#kafka-clients;2.8.0 in central
found org.lz4#lz4-java;1.7.1 in central
found org.xerial.snappy#snappy-java;1.1.8.4 in central
found org.slf4j#slf4j-api;1.7.30 in central
found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
found org.spark-project.spark#unused;1.0.0 in central
found org.apache.hadoop#hadoop-client-api;3.3.1 in central
found org.apache.htrace#htrace-core4;4.1.0-incubating in central
found commons-logging#commons-logging;1.1.3 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found org.apache.commons#commons-pool2;2.6.2 in central
The logs seem fine, but you can try to include kafka-clients dependency in --packages argument as well
Otherwise, I'd suggest creating an uber jar instead of downloading libraries every time you submit the app

Quarkus native and org.kohsuke:github-api - Failed to deserialize exception

I'm playing with Quarkus native and org.kohsuke:github-api:1.111 and I see Failed to deserialize exception in native mode when doing simple new GitHubBuilder().withOAuthToken(ghToken).build();. This works in JVM mode.
The main problem is probably the fact that org.kohsuke:github-api is not ready for native mode.
I still want to ask if there are any options to workaround this, maybe some jackson tricks for https://github.com/github-api/github-api/blob/master/src/main/java/org/kohsuke/github/GHMyself.java (stacktrace contains Cannot construct instance of org.kohsuke.github.GHMyself (no Creators, like default construct, exist)).
Exception details:
2020-05-05 10:47:06,891 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (executor-thread-1) HTTP Request to /hello failed, error id: a71018e5-de46-43a0-a194-80bd0b477f3d-1: org.jboss.resteasy.spi.UnhandledException: org.kohsuke.github.HttpException: Server returned HTTP response code: 200, message: '200 OK' for URL: https://api.github.com/user
...
Caused by: java.io.IOException: Failed to deserialize {"login":"rsvoboda","id":925259,"node_id":"MDQ6VXNlcjkyNTI1OQ==","avatar_url":"https://avatars0.githubusercontent.com/u/925259?v=4","gravatar_id":"","url":"https://api.github.com/users/rsvoboda","html_url":"https://github.com/rsvoboda","followers_url":"https://api.github.com/users/rsvoboda/followers","following_url":"https://api.github.com/users/rsvoboda/following{/other_user}","gists_url":"https://api.github.com/users/rsvoboda/gists{/gist_id}","starred_url":"https://api.github.com/users/rsvoboda/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/rsvoboda/subscriptions","organizations_url":"https://api.github.com/users/rsvoboda/orgs","repos_url":"https://api.github.com/users/rsvoboda/repos","events_url":"https://api.github.com/users/rsvoboda/events{/privacy}","received_events_url":"https://api.github.com/users/rsvoboda/received_events","type":"User","site_admin":false,"name":"Rostislav Svoboda","company":"JBoss by Red Hat by IBM","blog":"https://twitter.com/r_svoboda","location":"Brno, Czech Republic","email":"rsvoboda#redhat.com","hireable":null,"bio":null,"public_repos":138,"public_gists":3,"followers":18,"following":2,"created_at":"2011-07-19T12:18:08Z","updated_at":"2020-04-29T14:38:31Z"}
at org.kohsuke.github.GitHubResponse.parseBody(GitHubResponse.java:87)
at org.kohsuke.github.GitHubClient.lambda$fetch$0(GitHubClient.java:146)
at org.kohsuke.github.GitHubClient.createResponse(GitHubClient.java:404)
at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:358)
... 37 more
Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Cannot construct instance of `org.kohsuke.github.GHMyself` (no Creators, like default construct, exist): cannot deserialize from Object value (no delegate- or property-based Creator)
at [Source: (String)"{"login":"rsvoboda","id":925259,"node_id":"MDQ6VXNlcjkyNTI1OQ==","avatar_url":"https://avatars0.githubusercontent.com/u/925259?v=4","gravatar_id":"","url":"https://api.github.com/users/rsvoboda","html_url":"https://github.com/rsvoboda","followers_url":"https://api.github.com/users/rsvoboda/followers","following_url":"https://api.github.com/users/rsvoboda/following{/other_user}","gists_url":"https://api.github.com/users/rsvoboda/gists{/gist_id}","starred_url":"https://api.github.com/users/rsvobod"[truncated 734 chars]; line: 1, column: 2]
at com.fasterxml.jackson.databind.DeserializationContext.reportBadDefinition(DeserializationContext.java:1592)
at com.fasterxml.jackson.databind.DeserializationContext.handleMissingInstantiator(DeserializationContext.java:1058)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1297)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:326)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159)
at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1719)
at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1261)
at org.kohsuke.github.GitHubResponse.parseBody(GitHubResponse.java:84)
Reproducer details:
generate the app - https://quarkus.io/guides/getting-started#bootstrapping-the-project
add dependencies and quarkus.native.enable-https-url-handler propery for native
<dependency>
<groupId>org.kohsuke</groupId>
<artifactId>github-api</artifactId>
<version>1.111</version>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-resteasy-jackson</artifactId>
</dependency>
...
<properties>
<quarkus.package.type>native</quarkus.package.type>
<quarkus.native.enable-https-url-handler>true</quarkus.native.enable-https-url-handler>
</properties>
change GreetingResource
GitHub github = new GitHubBuilder().withOAuthToken(ghToken).build();
GHRepository ghRepo = github.getRepository("quarkusio/quarkus");
return ghRepo.toString();
drop .body(is("hello")) from GreetingResourceTest
run mvn clean verify -Dnative
You can also use #RegisterForReflection(targets = GHObject.class) on any of your classes
My guess is that you will need to register all the GHObject hierarchy for reflection using a ReflectiveHierarchyBuildItem.
So that requires an extension. It would be a useful addition for the Platform IMHO.
If you are running in native mode and getting the error is because of reflection to resolve the annotate your class with #RegisterForReflection also add a no-args constructor in your class then rebuild the application and run your error will be resolved.
for more reference see quarkus guide on the link

Error java.lang.NoSuchFieldError: NO_INTS

Getting the below error when running spark streaming application to fetch the data from kinesis.
Exception in thread "Kinesis Receiver 0" java.lang.NoSuchFieldError: NO_INTS
at com.fasterxml.jackson.dataformat.cbor.CBORParser.<init>(CBORParser.java:285)
at com.fasterxml.jackson.dataformat.cbor.CBORParserBootstrapper.constructParser(CBORParserBootstrapper.java:91)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory._createParser(CBORFactory.java:392)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory.createParser(CBORFactory.java:308)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory.createParser(CBORFactory.java:295)
at com.fasterxml.jackson.dataformat.cbor.CBORFactory.createParser(CBORFactory.java:26)
at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2294)
at com.amazonaws.protocol.json.JsonContent.parseJsonContent(JsonContent.java:72)
at com.amazonaws.protocol.json.JsonContent.<init>(JsonContent.java:64)
at com.amazonaws.protocol.json.JsonContent.createJsonContent(JsonContent.java:54)
at com.amazonaws.http.JsonErrorResponseHandler.handle(JsonErrorResponseHandler.java:89)
at com.amazonaws.http.JsonErrorResponseHandler.handle(JsonErrorResponseHandler.java:40)
at com.amazonaws.http.AwsErrorResponseHandler.handleAse(AwsErrorResponseHandler.java:53)
at com.amazonaws.http.AwsErrorResponseHandler.handle(AwsErrorResponseHandler.java:41)
at com.amazonaws.http.AwsErrorResponseHandler.handle(AwsErrorResponseHandler.java:26)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1383)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1359)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2809)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2776)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2765)
at com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1557)
at com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.listShards(KinesisProxy.java:325)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.getShardList(KinesisProxy.java:440)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.getShardList(KinesisShardSyncer.java:349)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.syncShardLeases(KinesisShardSyncer.java:159)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.checkAndCreateLeasesForNewShards(KinesisShardSyncer.java:112)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncTask.call(ShardSyncTask.java:84)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.initialize(Worker.java:683)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:614)
at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:191)
Here is the command that was executed:
spark-submit --jars spark-streaming-kinesis-asl_2.11-2.4.5.jar,amazon-kinesis-client-1.13.2.jar,aws-java-sdk-kinesis-1.11.745.jar,aws-java-sdk-core-1.11.745.jar,aws-java-sdk-sts-1.11.745.jar,aws-java-sdk-1.11.745.jar,aws-java-sdk-dynamodb-1.11.745.jar,aws-java-sdk-cloudwatch-1.11.745.jar,jackson-core-2.9.8.jar,jackson-dataformat-cbor-2.9.8.jar,jackson-databind-2.9.8.jar snowplow_spark/src/main.py
And code is very basic:
kinesisStream = KinesisUtils.createStream(
ssc, kinesisAppName=appName, streamName=streamName, endpointUrl=endpointUrl,
regionName=regionName, initialPositionInStream=InitialPositionInStream.LATEST,
checkpointInterval=10)
I have been stuck on this since days and have no idea what to do. I know that somewhere the jackson version is not matching in spark and aws-sdk but dont know which one to put in --jars.
Not sure if you have it resolved already, but https://issues.apache.org/jira/browse/SPARK-25455 issue is related to that. I ran into same issue. I got it to work with the below dependency in pom.xml. Also all the aws libraries are of 2.16.0 version in my project. Hope this helps.
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-cbor</artifactId>
<version>2.6.7</version>
</dependency>

ClassNotFoundException for parquet on CentOS 7.3

I have been using spark-1.5.2 built with hadoop-1.0.4 along with spark-csv_2.10-1.4.0 and commons-csv:1.1(for reading data). I run naivebayes/randomforest algorithms in CentOS 6.7 and it is working fine. When I upgraded the OS to 7.3, I get the following exception :
Caused by: java.lang.ClassNotFoundException: Failed to load class for data source: parquet.
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:167)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at org.apache.spark.mllib.classification.NaiveBayesModel$SaveLoadV2_0$.save(NaiveBayes.scala:206)
at org.apache.spark.mllib.classification.NaiveBayesModel.save(NaiveBayes.scala:169)
at com.zlabs.ml.core.algo.classification.naivebayes.NaiveBayesClassificationImpl.createModel(NaiveBayesClassificationImpl.java:111)
... 10 more
This exception happens when I save the model. Training is completed successfully.
Is there anyone else facing this issue ? Are there any OS level package dependencies for spark?

Zeppelin + Spark: Reading Parquet from S3 throws NoSuchMethodError: com.fasterxml.jackson

Using Zeppelin 0.7.2 binaries from the main download, and Spark 2.1.0 w/ Hadoop 2.6, the following paragraph:
val df = spark.read.parquet(DATA_URL).filter(FILTER_STRING).na.fill("")
Produces the following:
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.parallelize(SparkContext.scala:715)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:594)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:235)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 47 elided
This error does not happen in the normal spark-shell, only in Zeppelin. I have attempted the following fixes, which do nothing:
Download jackson 2.6.2 jars to the zeppelin lib folder and restart
Add jackson 2.9 dependencies from the maven repositories to the interpreter settings
Deleting the jackson jars from the zeppelin lib folder
Googling is turning up no similar situations. Please don't hesitate to ask for more information, or make suggestions. Thanks!
I had the same problem. I added com.amazonaws:aws-java-sdk and org.apache.hadoop:hadoop-aws as dependencies for the Spark interpreter. These dependencies bring in their own versions of com.fasterxml.jackson.core:* and conflict with Spark's.
You also must exclude com.fasterxml.jackson.core:* from other dependencies, this is an example ${ZEPPELIN_HOME}/conf/interpreter.json Spark interpreter depenency section:
"dependencies": [
{
"groupArtifactVersion": "com.amazonaws:aws-java-sdk:1.7.4",
"local": false,
"exclusions": ["com.fasterxml.jackson.core:*"]
},
{
"groupArtifactVersion": "org.apache.hadoop:hadoop-aws:2.7.1",
"local": false,
"exclusions": ["com.fasterxml.jackson.core:*"]
}
]
Another way is to include it right in the notebook cell:
%dep
z.load("com.fasterxml.jackson.core:jackson-core:2.6.2")

Resources