Consume Kafka in Spark Streaming (Spark 2.0) - apache-spark

I found there're two methods to consume Kafka topic in Spark Streaming (Spark 2.0):
1) using KafkaUtils.createDirectStream to get DStream every k seconds, please refer to this document
2) using kafka: sqlContext.read.format(“json”).stream(“kafka://KAFKA_HOST”) to create an infinite DataFrame for Spark 2.0's new feature: Structured Streaming, related doc is here
Method 1) works, but 2) doesn't, I got the following error
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.stream(Ljava/lang/String;)Lorg/apache/spark/sql/Dataset;
...
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My questions are:
What's “kafka://KAFKA_HOST” referring to?
How should I fix this problem?
Thank you in advance!

Spark 2.0 doesn't yet support Kafka as a source of infinite DataFrames/Sets. Support is planned to be added in 2.1
Edit: (6.12.2016)
Kafka 0.10 is now expiramentaly supported in Spark 2.0.2:
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
ds1
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]

Related

Apache Spark with kafka stream - Missing Kafka

I have trying to setup the Apache Spark with kafka and wrote simple program in local and its failing and not able figure out from debug.
build.gradle.kts
implementation ("org.jetbrains.kotlin:kotlin-stdlib:1.4.0")
implementation ("org.jetbrains.kotlinx.spark:kotlin-spark-api-3.0.0_2.12:1.0.0-preview1")
compileOnly("org.apache.spark:spark-sql_2.12:3.0.0")
implementation("org.apache.kafka:kafka-clients:3.0.0")
Main function code is
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Ship metrics").orCreate
val shipmentDataFrame = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("includeHeaders", "true")
.load()
val query = shipmentDataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query.writeStream()
.format("console")
.outputMode("append")
.start()
.awaitTermination()
and getting error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:194)
at com.tgt.ff.axon.shipmetriics.stream.ShipmentStream.run(ShipmentStream.kt:23)
at com.tgt.ff.axon.shipmetriics.ApplicationKt.main(Application.kt:12)
21/12/25 22:22:56 INFO SparkContext: Invoking stop() from shutdown hook
The Kotlin API for Spark by JetBrains (https://github.com/Kotlin/kotlin-spark-api) has support for streaming since the 1.1.0 update.
There is also an example with a Kafka stream which might be of help to you: https://github.com/Kotlin/kotlin-spark-api/blob/spark-3.2/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/streaming/KotlinDirectKafkaWordCount.kt
It does use the Spark DStream API instead of the Spark Structured Streaming API you appear to be using.
You can, of course, also still use the structured streaming one, if you prefer that, but then it needs to be deployed like is described here.

Cassandra Sink for PySpark Structured Streaming from Kafka topic

I want to write Structure Streaming Data into Cassandra using PySpark Structured Streaming API.
My data flow is like below:
REST API -> Kafka -> Spark Structured Streaming (PySpark) -> Cassandra
Source and Version in below:
Spark version: 2.4.3
DataStax DSE: 6.7.6-1
initialize spark:
spark = SparkSession.builder\
.master("local[*]")\
.appName("Analytics")\
.config("kafka.bootstrap.servers", "localhost:9092")\
.config("spark.cassandra.connection.host","localhost:9042")\
.getOrCreate()
subscribe topic from Kafka:
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "topic") \
.load()
Write into Cassandra:
w_df_3 = df...
write_db = w_df_3.writeStream \
.option("checkpointLocation", '/tmp/check_point/') \
.format("org.apache.spark.sql.cassandra") \
.option("keyspace", "analytics") \
.option("table", "table") \
.outputMode(outputMode="update")\
.start()
executed with the following command:
$spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0,datastax:spark-cassandra-connector:2.4.0-s_2.11 Analytics.py localhost:9092 topic
I am facing below issue/exception while writestream into Cassandra:
py4j.protocol.Py4JJavaError: An error occurred while calling o81.start.
: java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.cassandra does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:297)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Could anyone help me out on how to resolve and proceed further? Any help will be appreciated.
Thanks in advance.
As i mentioned in the comment, if you're using DSE, you can use OSS Apache Spark with so-called BYOS (bring your own spark) - special jar that contains the DataStax's version of Spark Cassandra Connector (SCC) that contains direct support for structured streaming.
Since SCC 2.5.0 support for structured streaming is also available in open source version, so you can simply use writeStream with format for Cassandra. 2.5.0 also contains a lot of good things previously not available in the open source, such as additional optimizations, etc. There is a blog post that describes them in great details.
Thanks a lot for your response.
I have implemented using it with ForeachBatch Sink instead of a direct sink.
w_df_3.writeStream\
.trigger(processingTime='5 seconds')\
.outputMode('update')\
.foreachBatch(save_to_cassandra)\
.start()
It's working. Thank you all.

Spark Dataframe to Kafka

I am trying to stream the Spark Dataframe to Kafka consumer. I am unable to do , Can you please advice me.
I am able to pick the data from Kafka producer to Spark , and I have performed some manipulation, After manipulating the data , I am interested to stream it back to Kafka (Consumer).
Here is an example of producing to kafka in streaming, but the batch version is almost identical
streaming from a source to kafka:
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
writing a static dataframe (not streamed from a source) to kafka
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
please keep in mind that
each row will be a message.
the dataframe must be a streaming dataframe. If you have a static dataframe then use the static version.
take a look at the basic documentation: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
it sounds like you have a static dataframe, that isn't streaming from a source.

Spark Structured Streaming Batch

I am running batch in Structured programming of Spark. The below snippet code throws error saying "kafka is not a valid Spark SQL Data Source;". The version I am using for the same is --> spark-sql-kafka-0-10_2.10. Your help is appreciated. Thanks.
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "*****")
.option("subscribePattern", "test.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load();
Exception in thread "main" org.apache.spark.sql.AnalysisException: kafka is not a valid Spark SQL Data Source.;
I had the same problem and like me you are using read instead of readStream.
Changing spark.read() to spark.readStream worked fine for me.
Use the spark-submit mechanism and pass along -jars spark-sql-kafka-0-10_2.11-2.1.1.jar
Adjust the version of kafka, scala and spark in that library according to ur own situation.

How to write streaming dataset to Kafka?

I'm trying to do some enrichment to the topics data. Therefore read from Kafka sink back to Kafka using Spark structured streaming.
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("subscribe", "topicname")
.load()
val enriched = ds.select("key", "value", "topic").as[(String, String, String)].map(record => enrich(record._1,
record._2, record._3)
val query = enriched.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("topic", "desttopic")
.start()
But im getting an exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source kafka does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:266)
at kafka_bridge.KafkaBridge$.main(KafkaBridge.scala:319)
at kafka_bridge.KafkaBridge.main(KafkaBridge.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Any workarounds?
As T. Gawęda mentioned, there is no kafka format to write streaming datasets to Kafka (i.e. a Kafka sink).
The currently recommended solution in Spark 2.1 is to use foreach operator.
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java. To use this, you will have to implement the interface ForeachWriter (Scala/Java docs), which has methods that get called whenever there is a sequence of rows generated as output after a trigger. Note the following important points.
Spark 2.1 (which is currently the latest release of Spark) doesn't have it. The next release - 2.2 - will have Kafka Writer, see this commit.
Kafka Sink is the same as Kafka Writer.
Try this
ds.map(_.toString.getBytes).toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092"))
.option("topic", topic)
.start
.awaitTermination()

Resources