I use the spark to write data to kafka in this way.
df.write(). format("kafka"). save()
can I control the writing speed to kafka to avoid pressure on kafka?
Is there some options that helps to slow down the speed?
I think setting linger.ms to a non-zero value would help. As it controls the amount of time to wait for additional messages before sending the current batch. Code can look like the following
df.write.format("kafka").option("linger.ms", "100").save()
But this really depends on a lot of things. If your Kafka is 'big' enough and configured properly, I wouldn't worry too much about the speed. After all, kafka is designed to cope with this situation (traffic spike).
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source.
val streamingETLQuery = cloudtrailEvents
.withColumn("date", $"timestamp".cast("date") // derive the date
.writeStream
.trigger(ProcessingTime("10 seconds")) // check for files every 10s
.format("parquet") // write as Parquet partitioned by date
.partitionBy("date")
.option("path", "/cloudtrail")
.option("checkpointLocation", "/cloudtrail.checkpoint/")
.start()
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")
Read the following links for more details:
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Related
Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.
I've set up a Spark structured streaming query that reads from a Kafka topic.
If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.
Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?
EDIT:
I'm using Spark 2.4.4. I read from kafka as follows:
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaURL)
.option("startingOffsets", "earliest")
.option("subscribe", topic)
.option("failOnDataLoss", value = false)
.load()
after some processing, I write to HDFS on a Delta Lake table.
Answering my own question. Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms, whose default value is 300000 (5 minutes).
Since my test was lasting far less than that, I wouldn't notice the update. For tests, reduce the value to something smaller, e.g. 100 ms, by setting the following option of the DataStreamReader:
.option("kafka.metadata.max.age.ms", 100)
I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.
My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.
I am running Structured Streaming (2.1.0) on a 3-node yarn cluster and stream json records to parquet. My code fragment looks like this:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.start("/data/kafka_streaming.parquet")
I notice it creates thousands of small files quickly for only 1,000 records. I suspect it has to do frequency of trigger. So I changed it:
val query = ds.writeStream
.format("parquet")
.option("checkpointLocation", "/data/kafka_streaming.checkpoint")
.**trigger(ProcessingTime("60 seconds"))**
.start("/data/kafka_streaming.parquet")
The difference is very obvious. Now I can see much smaller number of files created for the same number of records.
My question: Is there any way to have low latency for the trigger and keep smaller number of larger output files?