Spark Dataframe to Kafka - apache-spark

I am trying to stream the Spark Dataframe to Kafka consumer. I am unable to do , Can you please advice me.
I am able to pick the data from Kafka producer to Spark , and I have performed some manipulation, After manipulating the data , I am interested to stream it back to Kafka (Consumer).

Here is an example of producing to kafka in streaming, but the batch version is almost identical
streaming from a source to kafka:
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
writing a static dataframe (not streamed from a source) to kafka
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
please keep in mind that
each row will be a message.
the dataframe must be a streaming dataframe. If you have a static dataframe then use the static version.
take a look at the basic documentation: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
it sounds like you have a static dataframe, that isn't streaming from a source.

Related

Kafka Sink seems not to be working properly

I have a streaming job which is supposed to read some data from kafka, transform it to desired form using aggregations, and push it back to kafka. When I've used Kafka Source + Console Sink, everything worked fine. The problems started when I started using Kafka Sink instead of Console - messages are not being sent to the topic. When I try to push messages to the same topic using kafka-console-producer, everything is working fine. Note that I am trying to run this job in Intellij. I am using Spark 3.2
Here is my kafka-reader code:
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "demo-topic")
.option("startingOffsets", "earliest")
.load()
And kafka-writer:
val queryX = df
.select(to_json(struct("*")).as("value"))
.withColumn("key", lit("key"))
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "result-data")
.option("checkpointLocation", "/temp/checkpoint")
.start()
queryX.awaitTermination()
checkpointLocation is set to dir in my local file system.
Is there anything that I am doing wrong here?

Using two WriteStreams in same spark structured streaming job

I have a scenario where I would like to save the same streaming dataframe to two different streaming sinks.
I have created a streaming dataframe which I need to send to both Kafka topic and delta lake.
I thought of using forEachBatch, but looks like it doesn't support multiple STREAMING SINKS.
Also, I tried using spark session.awaitAnyTermination() with multiple write streams. But the second stream is not getting processed !
Is there a way through which we can achieve this ?!
This is my code:
I am reading from Kafka stream and creating a single streaming dataframe.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "ingestionTopic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
writing the above dataframe to a Kafka topic
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9082")
.option("topic", "outputTopic1")
.start()
writing the same streaming dataframe to delta lake
val ds2 = df.format("delta")
.outputMode("append")
.option("checkpointLocation", "/test/delta/events/_checkpoints/etlflow")
.start("/test/delta/events")
ds1.awaitTermination
ds2.awaitTermination
There are a few things you need to follow to use one input stream for multiple output streams:
You need to make sure to have two different checkpointLocations in the two output streams.
Furthermore, you need to ensure to have the writeStream call also on your second output query.
Overall, it is important to start both of the queries before waiting for the termination of both queries. (You are already doing this)

Spark-Streaming hangs with kafka starting offset at earliest (Kafka 2, spark 2.4.3)

i'm having an issue with Spark-Streaming and Kafka. While running a sample program to consume from a Kafka topic and output micro-batched results to the terminal, my job seems to hang when i set the option:
df.option("startingOffsets", "earliest")
Starting the job from the latest offset works fine, results are printed to the terminal as each micro batch streams through.
I was thinking maybe this was a resouces issue--i'm trying to read from a topic with quite a bit of data. However i don't seem to have memory/cpu issues (running this job with a local[*] cluster). The job never really seems to start, but just hangs on the line:
19/09/17 15:21:37 INFO Metadata: Cluster ID: JFXVL24JQ3K4CEbE-VA58A
val sc = new SparkConf().setMaster("local[*]").setAppName("spark-test")
val streamContext = new StreamingContext(sc, Seconds(1))
val spark = SparkSession.builder().appName("spark-test")
.getOrCreate()
val topic = "topic.with.alotta.data"
//subscribe tokafka
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
//write
df.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
I'd expect to see results printed to the console....but, the application just seems to hang as I mentioned. Any thoughts? It feels like a spark resource issue (because i'm running a local "cluster" against a topic that has a lot of data. Is there something about the nature of streaming dataframes that i'm missing?
Writing to console causes all data to be collected in memory in the driver every trigger. Since you're currently not limiting the size of your batches, this means the entire topic contents is being accumulated in the driver. See https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-sinks
Setting a limit on your batch sizes should fix your issue.
Try adding the maxOffsetsPerTrigger setting when reading from Kafka...
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.load()
See https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html for details.

Spark Streaming failing due to error on a different Kafka topic than the one being read

For the following write topic/read topic air2008rand tandem :
import org.apache.spark.sql.streaming.Trigger
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "air2008rand")
.load()
.groupBy('value.cast("string").as('key))
.agg(count("*").cast("string") as 'value)
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("includeTimestamp", true)
.option("topic","t1")
.trigger(Trigger.ProcessingTime("2 seconds"))
.outputMode("update")
.option("checkpointLocation","/tmp/cp")
.start)
There is an error generated due to a a different topic air2008m1-0:
scala> 19/07/14 13:27:22 ERROR MicroBatchExecution: Query [id = 711d44b2-3224-4493-8677-e5c8cc4f3db4, runId = 68a3519a-e9cf-4a82-9d96-99be833227c0]
terminated with error
java.lang.IllegalStateException: Set(air2008m1-0) are gone.
Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.org$apache$spark$sql$kafka010$KafkaMicroBatchReader$$reportDataLoss(KafkaMicroBatchReader.scala:261)
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.planInputPartitions(KafkaMicroBatchReader.scala:124)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions$lzycompute(DataSourceV2ScanExec.scala:76)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions(DataSourceV2ScanExec.scala:75)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:65)
This behavior is repeatable by stopping the read/write code (in spark-shell repl) and then re-running it.
Why is there "cross-talk" between different kafka topics here?
The problem is due to a checkpoint directory containing data from an earlier spark streaming operation. The resolution is to change the checkpoint directory.
The solution was found as a comment (from #jaceklaskowski himself) in this question [IllegalStateException]: Spark Structured Streaming is termination Streaming Query with Error

How to write streaming dataset to Kafka?

I'm trying to do some enrichment to the topics data. Therefore read from Kafka sink back to Kafka using Spark structured streaming.
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("subscribe", "topicname")
.load()
val enriched = ds.select("key", "value", "topic").as[(String, String, String)].map(record => enrich(record._1,
record._2, record._3)
val query = enriched.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("topic", "desttopic")
.start()
But im getting an exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source kafka does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:266)
at kafka_bridge.KafkaBridge$.main(KafkaBridge.scala:319)
at kafka_bridge.KafkaBridge.main(KafkaBridge.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Any workarounds?
As T. Gawęda mentioned, there is no kafka format to write streaming datasets to Kafka (i.e. a Kafka sink).
The currently recommended solution in Spark 2.1 is to use foreach operator.
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java. To use this, you will have to implement the interface ForeachWriter (Scala/Java docs), which has methods that get called whenever there is a sequence of rows generated as output after a trigger. Note the following important points.
Spark 2.1 (which is currently the latest release of Spark) doesn't have it. The next release - 2.2 - will have Kafka Writer, see this commit.
Kafka Sink is the same as Kafka Writer.
Try this
ds.map(_.toString.getBytes).toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092"))
.option("topic", topic)
.start
.awaitTermination()

Resources