I am trying to run my kafka spark streaming application using spark submit. Its java maven project and I have created a fat jar using assembly plugin. The same jar I am trying to execute using spark submit but it fails with below error. My pom.xml dependencies are as below.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<!-- https://mvnrepository.com/artifact/com.influxdb/influxdb-client-java -->
<!-- https://mvnrepository.com/artifact/com.influxdb/influxdb-client-core -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/clients/consumer/Consumer
at org.apache.spark.streaming.kafka010.ConsumerStrategies$.Subscribe(ConsumerStrategy.scala:299)
at org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe(ConsumerStrategy.scala)
at consumer.consumer.consume(consumer.java:156)
at consumer.consumer.main(consumer.java:80)
Tried giving all dependencies in classpath using --jars in spark submit. But no luck.
Tried lowering versions for spark and kafka and scala. But still getting same error.
Tried using same scala version for kafka and scala but no luck.
Tried using the same versions as Kafka and Scala installation jars, but couldn't fix it.
I am using spark-2.4.6-bin-hadoop2.7 and kafka_2.13-2.6.0 on standalone machine.
Anything I am missing here? I tried implementing answers from similar questions as well but its still erroring out. Appreciate any help with this.Thanks!
I am unable to build a SparkSession in Scala without an error. I am using Maven as my build tool. I have Spark 3.3.0 installed locally. I also have an Azure Databricks 3.3.0 cluster. Whenever I pass the databricks master URL it bombs out with this error:
Error: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
package sample
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object SparkMain extends App {
val sparkConf = new SparkConf()
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val spark = SparkSession.builder()
val df = spark.read.option("header","true")
I have an apache beam pipeline to index some data to elasticsearch. I was trying to use spark or Flink runner to run the job in AWS EMR. When I tried to run the job on a stand-alone spark on local setup, pipeline works with source files in the local disk, however, when I read the file from GCS it's not working. It is the same when I am running in the EMR cluster.
The configs that I set on the Hadoop core-site.xml
as EMR config
"Classification": "core-site",
"Properties": {
"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
"fs.gs.project.id": "data-warehouse",
"google.cloud.auth.service.account.enable": "true",
"fs.gs.auth.service.account.json.keyfile": "/home/hadoop/utils/key.json"
Also, GCS-connector jar is in the spark jar path and hadoop jar path
The pom file of the maven for the pipeline
<project xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<name>Apache Development Snapshot Repository</name>
<!-- https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-extensions-google-cloud-platform-core -->
<!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-google-cloud-dataflow-java -->
<!-- <scope>runtime</scope>-->
<!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-flink -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<!-- slf4j API frontend binding with JUL backend -->
<!-- <dependency>-->
<!-- <groupId>org.slf4j</groupId>-->
<!-- <artifactId>slf4j-api</artifactId>-->
<!-- <version>${slf4j.version}</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>org.slf4j</groupId>-->
<!-- <artifactId>slf4j-jdk14</artifactId>-->
<!-- <version>${slf4j.version}</version>-->
<!-- </dependency>-->
There is no error but EMR shows task com[pleted but the pipeline has not run.
I could not figure out if its an apache beam problem or cluster config problem.
I figured out the issue. Apache beam sdk uses gsutil to access the GCS files. As per flink documentation, hadoop connectors were responsible for any other files system access, but in the case of apache beam using flink runner the data is read using gsutil and fed into the downstream. So I installed google could SDK and activated the service account.
I want to connect my Spark cluster to TIDB by TiSpark but I got a problem when I run my Spark application, an error occur:
java.io.InvalidClassException: com.pingcap.tikv.region.TiRegion; local class incompatible: stream classdesc serialVersionUID = -3091715739322916126, local class serialVersionUID = -3556238418089320368
I'm setting up a TIDB cluster follow the guide at https://pingcap.com/docs/v3.0/how-to/get-started/deploy-tidb-from-binary/
After that I follow the guide at https://pingcap.com/docs/v3.0/reference/tispark/ to download tispark-core-2.2.0-SNAPSHOT-jar-with-dependencies.jar and copy it to my jars folder in Spark.
I also config:
spark.sql.extensions org.apache.spark.sql.TiExtensions
Here is my pom file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<name>Scala-Tools Maven2 Repository</name>
<name>Scala-Tools Maven2 Repository</name>
My Spark Session is:
val _spark = SparkSession.builder()
.config("spark.tispark.pd.addresses", "")
When I call a simple query to database:
_spark.sql("use locdb")
val df = _spark.sql("select * from bang")
I got an error:
java.io.InvalidClassException: com.pingcap.tikv.region.TiRegion; local class incompatible: stream classdesc serialVersionUID = -3091715739322916126, local class serialVersionUID = -3556238418089320368
My full log is here:
I think the reason is I using TiSpark 2.1.1-2.4 in maven pom file but the Tispark jar file I download and copy to jars folder is 2.2.0. But I cant see any other version of TiSpark like tispark-core-2.1.1-SNAPSHOT-jar-with-dependencies.jar
I'm a dev of tispark.
Yes your educated guess is correct :). There are two different versions of tispark jars during your run which caused problem. Version 2.2 (in your cluster env) is not officially released to maven repo artifacts yet.
Since you already have tispark jars deployed in your cluster you can just remove tispark dependency in your pom. In most of the cases you don't need any special api from tispark unless you are using older version (< 2.0) and you still can query tidb directly.
Or you might remove all jars in your cluster environment and rely on tispark in your pom (if so, please pack it with dependencies).
I have a flink jar that sink a datastream of serializable datatype on elastic and cassandra with a beahaviour that differ from stand-alone context.
I have read about netty conflict with Flink process and I excluded it from the pom file but it follow to be included in the package
Any suggestions?
This is the exception:
java.lang.ClassCastException: io.netty.channel.epoll.EpollEventLoopGroup cannot be cast to io.netty.channel.EventLoopGroup
at com.datastax.driver.core.NettyUtil.newEventLoopGroupInstance(NettyUtil.java:134)
at com.datastax.driver.core.NettyOptions.eventLoopGroup(NettyOptions.java:99)
at com.datastax.driver.core.Connection$Factory.<init>(Connection.java:774)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1446)
at com.datastax.driver.core.Cluster.init(Cluster.java:159)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:330)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:305)
at com.datastax.driver.core.Cluster.connect(Cluster.java:247)
at it.almaviva.wtf.mms.integratemobilitystatusevent.repository.cassandra.ScheduledEventRepositoryImpl.findProgrammed(ScheduledEventRepositoryImpl.java:51)
at it.almaviva.wtf.mms.integratemobilitystatusevent.transformation.ObservedEventDelayProcessFunction.loadScheduledEvents(ObservedEventDelayProcessFunction.java:368)
at it.almaviva.wtf.mms.integratemobilitystatusevent.transformation.ObservedEventDelayProcessFunction.processElement(ObservedEventDelayProcessFunction.java:64)
at it.almaviva.wtf.mms.integratemobilitystatusevent.transformation.ObservedEventDelayProcessFunction.processElement(ObservedEventDelayProcessFunction.java:1)
at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:94)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:207)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.NullPointerException
at com.datastax.driver.core.Cluster$Manager.close(Cluster.java:1676)
at com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:1354)
at com.datastax.driver.core.Cluster.closeAsync(Cluster.java:566)
at com.datastax.driver.core.Cluster.close(Cluster.java:578)
at it.almaviva.wtf.mms.integratemobilitystatusevent.repository.cassandra.ScheduledEventRepositoryImpl.findProgrammed(ScheduledEventRepositoryImpl.java:68)
this is the pom file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<name>Passing Event Process</name>
<name>Apache Development Snapshot Repository</name>
<!-- Execute "mvn clean package -Pbuild-jar" to build a jar file out of
this project! How to use the Flink Quickstart pom: a) Adding new dependencies:
You can add dependencies to the list below. Please check if the maven-shade-plugin
below is filtering out your dependency and remove the exclude from there.
b) Build a jar for running on the cluster: There are two options for creating
a jar from this project b.1) "mvn clean package" -> this will create a fat
jar which contains all dependencies necessary for running the jar created
by this pom in a cluster. The "maven-shade-plugin" excludes everything that
is provided on a running Flink cluster. b.2) "mvn clean package -Pbuild-jar"
-> This will also create a fat-jar, but with much nicer dependency exclusion
handling. This approach is preferred and leads to much cleaner jar files. -->
<!-- Apache Flink dependencies -->
<!-- This dependency is required to actually execute jobs. It is currently pulled in by
flink-streaming-java, but we explicitly depend on it to safeguard against future changes. -->
<!-- explicitly add a standard loggin framework, as Flink does not have
a hard dependency on one specific framework by default -->
<!-- https://mvnrepository.com/artifact/ch.qos.logback/logback-classic -->
<!-- DATASTAX -->
<!-- Profile for packaging correct JAR files -->
<!-- DTO -->
<!-- disable the exclusion rules -->
<excludes combine.self="override"/>
<!-- We use the maven-shade plugin to create a fat jar that contains all
dependencies except flink and it's transitive dependencies. The resulting
fat-jar can be executed on a cluster. Change the value of Program-Class if
your program entry point changes. -->
<!-- Run shade goal on package phase -->
<!-- This list contains all dependencies of flink-dist Everything
else will be packaged into the fat-jar -->
<!-- <exclude>org.apache.flink:flink-scala_2.10</exclude> -->
<!-- Also exclude very big transitive dependencies of Flink WARNING:
You have to remove these excludes if your code relies on other versions of
these dependencies. -->
<!-- exclude shaded google but include shaded curator -->
<!-- Do not copy the signatures in the META-INF folder. Otherwise,
this might cause SecurityExceptions when using the JAR. -->
<!-- If you want to use ./bin/flink run <quickstart jar> uncomment
the following lines. This will add a Main-Class entry to the manifest file -->
You can disable the Netty’s native epoll transport and force the default NIO-based transport by adding the JVM argument -Dcom.datastax.driver.FORCE_NIO=true.
In Flink you have to set env.java.opts into the conf/flink-conf.yaml with that argument.
I added the argument to the conf/flink.yaml and great, it works like a charm!!!!
I lost hours checking the pom file with my collegues. :)
I getting NoClassDefFoundErrorerror while using Spark streaming API. Here is my Streaming code.
I know this is a problem with some mising jars and dependencies, but i couldnt figure out exactly what that is.
I am using kafka 0.9.0, spark 1.6.1 - Are these dependecies fine or do i need to change them? I have attached pom.xml below.
Here is the streaming API i am using.
JavaPairInputDStream directKafkaStream = KafkaUtils.createDirectStream(jsc, String.class,
byte[].class, StringDecoder.class, DefaultDecoder.class, kafkaParams, topicSet);
here is my code piece. I am receiving error at while(itr.next())
directKafkaStream.foreachRDD(rdd -> {
rdd.foreachPartition(itr -> {
try {
while (itr.hasNext()) {
java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream
Here is my POM.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<!-- http://mvnrepository.com/artifact/org.springframework/spring-core -->
<!-- http://mvnrepository.com/artifact/org.springframework/spring-jdbc -->
<!-- http://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.10 -->
<!-- http://mvnrepository.com/artifact/ojdbc/ojdbc -->
<!-- <dependency> <groupId>ojdbc</groupId> <artifactId>ojdbc</artifactId> <version>14</version> </dependency>-->
<!-- https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver -->
<!-- http://mvnrepository.com/artifact/org.springframework.data/spring-data-mongodb -->
<!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
-<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
-<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
KafkaLZ4BlockOutputStream is in kafka-clients jar.
Till kafka-clients version it is in org/apache/kafka/common/message/KafkaLZ4BlockOutputStream
From it is in /org/apache/kafka/common/record/
Though my kafka cluster version is .
And I use maven pom like this to process kafka with Spark Streaming.
,but I get error as above described.
Then I try to add dependence as follow and it works.
I used kafka jar for version to resolve this issue.