I have a streaming job which is supposed to read some data from kafka, transform it to desired form using aggregations, and push it back to kafka. When I've used Kafka Source + Console Sink, everything worked fine. The problems started when I started using Kafka Sink instead of Console - messages are not being sent to the topic. When I try to push messages to the same topic using kafka-console-producer, everything is working fine. Note that I am trying to run this job in Intellij. I am using Spark 3.2
Here is my kafka-reader code:
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "demo-topic")
.option("startingOffsets", "earliest")
.load()
And kafka-writer:
val queryX = df
.select(to_json(struct("*")).as("value"))
.withColumn("key", lit("key"))
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "result-data")
.option("checkpointLocation", "/temp/checkpoint")
.start()
queryX.awaitTermination()
checkpointLocation is set to dir in my local file system.
Is there anything that I am doing wrong here?
Related
I want to extract streaming data from a Kafka cluster in batches of one hour, so I run a script every hour, having set writeStream to .trigger(once=True) and the startingOffsets set to earliest, like this:
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config.get("kafka_servers")) \
.option("subscribe", config.get("topic_list")) \
.option("startingOffsets", "earliest") \
.load()
df.writeStream \
.format("parquet") \
.option("checkpointLocation", config.get("checkpoint_path")) \
.option("path", config.get("s3_path_raw")) \
.trigger(once=True)
.partitionBy('date', 'hour') \
.start()
But every time the script gets triggered it only writes to the S3 the messages that are coming from the Kafka cluster in that precise moment, instead of taking all the messages from the last hour as I was expecting it to do.
What might be the problem?
Edit: I should mention that the kafka cluster retention is set to 24 hours
Option 1:
.trigger(once=True) is supposed to process only one patch of data.
Please try to replace it with .trigger(availableNow=True)
Option 2:
To have an up and running job with a 1-hour processing interval; .trigger(processingTime='60 minutes')
Also, you will need to set the following option while reading the stream
.option("failOnDataLoss", "false")
I have a scenario where I would like to save the same streaming dataframe to two different streaming sinks.
I have created a streaming dataframe which I need to send to both Kafka topic and delta lake.
I thought of using forEachBatch, but looks like it doesn't support multiple STREAMING SINKS.
Also, I tried using spark session.awaitAnyTermination() with multiple write streams. But the second stream is not getting processed !
Is there a way through which we can achieve this ?!
This is my code:
I am reading from Kafka stream and creating a single streaming dataframe.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "ingestionTopic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
writing the above dataframe to a Kafka topic
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9082")
.option("topic", "outputTopic1")
.start()
writing the same streaming dataframe to delta lake
val ds2 = df.format("delta")
.outputMode("append")
.option("checkpointLocation", "/test/delta/events/_checkpoints/etlflow")
.start("/test/delta/events")
ds1.awaitTermination
ds2.awaitTermination
There are a few things you need to follow to use one input stream for multiple output streams:
You need to make sure to have two different checkpointLocations in the two output streams.
Furthermore, you need to ensure to have the writeStream call also on your second output query.
Overall, it is important to start both of the queries before waiting for the termination of both queries. (You are already doing this)
i'm having an issue with Spark-Streaming and Kafka. While running a sample program to consume from a Kafka topic and output micro-batched results to the terminal, my job seems to hang when i set the option:
df.option("startingOffsets", "earliest")
Starting the job from the latest offset works fine, results are printed to the terminal as each micro batch streams through.
I was thinking maybe this was a resouces issue--i'm trying to read from a topic with quite a bit of data. However i don't seem to have memory/cpu issues (running this job with a local[*] cluster). The job never really seems to start, but just hangs on the line:
19/09/17 15:21:37 INFO Metadata: Cluster ID: JFXVL24JQ3K4CEbE-VA58A
val sc = new SparkConf().setMaster("local[*]").setAppName("spark-test")
val streamContext = new StreamingContext(sc, Seconds(1))
val spark = SparkSession.builder().appName("spark-test")
.getOrCreate()
val topic = "topic.with.alotta.data"
//subscribe tokafka
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
//write
df.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
I'd expect to see results printed to the console....but, the application just seems to hang as I mentioned. Any thoughts? It feels like a spark resource issue (because i'm running a local "cluster" against a topic that has a lot of data. Is there something about the nature of streaming dataframes that i'm missing?
Writing to console causes all data to be collected in memory in the driver every trigger. Since you're currently not limiting the size of your batches, this means the entire topic contents is being accumulated in the driver. See https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-sinks
Setting a limit on your batch sizes should fix your issue.
Try adding the maxOffsetsPerTrigger setting when reading from Kafka...
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.load()
See https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html for details.
For the following write topic/read topic air2008rand tandem :
import org.apache.spark.sql.streaming.Trigger
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "air2008rand")
.load()
.groupBy('value.cast("string").as('key))
.agg(count("*").cast("string") as 'value)
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("includeTimestamp", true)
.option("topic","t1")
.trigger(Trigger.ProcessingTime("2 seconds"))
.outputMode("update")
.option("checkpointLocation","/tmp/cp")
.start)
There is an error generated due to a a different topic air2008m1-0:
scala> 19/07/14 13:27:22 ERROR MicroBatchExecution: Query [id = 711d44b2-3224-4493-8677-e5c8cc4f3db4, runId = 68a3519a-e9cf-4a82-9d96-99be833227c0]
terminated with error
java.lang.IllegalStateException: Set(air2008m1-0) are gone.
Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.org$apache$spark$sql$kafka010$KafkaMicroBatchReader$$reportDataLoss(KafkaMicroBatchReader.scala:261)
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.planInputPartitions(KafkaMicroBatchReader.scala:124)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions$lzycompute(DataSourceV2ScanExec.scala:76)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions(DataSourceV2ScanExec.scala:75)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:65)
This behavior is repeatable by stopping the read/write code (in spark-shell repl) and then re-running it.
Why is there "cross-talk" between different kafka topics here?
The problem is due to a checkpoint directory containing data from an earlier spark streaming operation. The resolution is to change the checkpoint directory.
The solution was found as a comment (from #jaceklaskowski himself) in this question [IllegalStateException]: Spark Structured Streaming is termination Streaming Query with Error
I am trying to stream the Spark Dataframe to Kafka consumer. I am unable to do , Can you please advice me.
I am able to pick the data from Kafka producer to Spark , and I have performed some manipulation, After manipulating the data , I am interested to stream it back to Kafka (Consumer).
Here is an example of producing to kafka in streaming, but the batch version is almost identical
streaming from a source to kafka:
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
writing a static dataframe (not streamed from a source) to kafka
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
please keep in mind that
each row will be a message.
the dataframe must be a streaming dataframe. If you have a static dataframe then use the static version.
take a look at the basic documentation: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
it sounds like you have a static dataframe, that isn't streaming from a source.