Kafka + spark streaming: kafka.common.OffsetOutOfRangeException - apache-spark

I'm new to this whole Kafka/Spark thing. I have Spark Streaming (PySpark) taking in data from a Kafka producer. It runs fine for a minute and then always throws a kafka.common.OffsetOutOfRangeException. The Kafka consumer is version 0.8 (0.10 is not supported, apparently, for PySpark). I have a master node with 3 workers on AWS Ubuntu 14.04. I don't know if this is relevant, but the Kafka logs here are relatively large (~1-10kb) and I've adjusted the producer/broker/consumer configs accordingly. The data is being passed through fine, though maybe slower than what I think the producer is probably producing (this may be the source of the problem?).
A similar problem was solved by increasing the retention time/size here: Kafka OffsetOutOfRangeException
But my retention time is an hour and the size is 1GB in each node's server.properties, and more importantly, there's no change in Spark's time-to-failure and the set retention time/size.
Is there any other possibility for adjustment, maybe on the Spark Streaming configs? All the answers I see online have to do with Kafka provisioning, but it doesn't seem to make a difference in my case.
EDIT 1: I tried a) having multiple streams reading from the producer and b) slowing down the producer stream itself with time.sleep(1.0). Neither had a lasting effect.
i.e.
n_secs = 1
ssc = StreamingContext(sc, n_secs)
kds = [KafkaUtils.createDirectStream(ssc, ['test-video'], {
'bootstrap.servers':'localhost:9092',
'group.id':'test-video-group',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'}) for _ in range(n_streams)]
stream = ssc.union(*kds)

Is it possible that your producer generates too many messages too fast so that 1G is not enough on each broker? 1G seems very low in all reality. After Spark Streaming decides the offset range it needs to process in the micro batch and try to retrieve the messages from the broker based on the offset, the messages are gone due to size limit. Please increase the broker size to something bigger like 100G and see if that fixes your problem.

Related

PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads

We subscribed to 7 topics with spark.readStream in 1 single running spark app.
After transforming the event payloads, we save them with spark.writeStream to our database.
For one of the topics, the data is inserted only batch-wise (once a day) with a very high load. This delays our reading from all other topics, too. For example (grafana), the delay between a produced and consumed record over all topics stays below 1m the whole day. When the bulk-topic receives its events, our delay increases up to 2 hours on all (!) topics.
How can we solve this? we already tried 2 successive readStreams (the bulk-topic separately), but it didn't help.
Further info: We use 6 executors, 2 executor-cores. The topics have a different number of partitions (3 to 30). Structured Streaming Kafka Integration v0.10.0.
General question: How can we scale the consumers in spark structured streaming? Is 1 readStream equal to 1 consumer? or 1 executor? or what else?
Partitions are main source of parallelism in Kafka so I suggest you increase number of partitions (at least for topic which has performance issues). Also you may tweak some of consumer caching options mentioned in doc. Try to keep number of partitions 2^n. At the end you may increase size of driver machine if possible.
I'm not completely sure, but I think Spark will try to keep same number of consumer as number of partitions per topic. Also I think that actually stream is fetched from Spark driver always (not from workers).
We found a solution for our problem:
Our grafana after the change shows, that the batch-data topic still peaks but without blocking the consumption on other topics.
What we did:
We still have 1 spark app. We used 2 successive spark.readStreams but also added a sink for each.
In code:
priority_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', ','.join([T1, T2, T3])).load()
bulk_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', BULK_TOPIC).load()
priority_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
bulk_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
spark.streams.awaitAnyTermination()
To minimize the peak on the bulk-stream we will try out increasing its partitions like adviced from #partlov. But that would have only speeded up the consumption on the bulk-stream but not resolved the issue from blocking our reads from the priority-topics.

How to stream 100GB of data in Kafka topic?

So, in one of our kafka topic, there's close to 100 GB of data.
We are running spark-structured streaming to get the data in S3
When the data is upto 10GB, streaming runs fine and we are able to get the data in S3.
But with 100GB, it is taking forever to stream the data in kafka.
Question: How does spark-streaming reads data from Kafka?
Does it take the entire data from current offset?
Or does it take in batch of some size?
Spark will work off consumer groups, just as any other Kafka consumer, but in batches. Therefore it takes as much data as possible (based on various Kafka consumer settings) from the last consumed offsets. In theory, if you have the same number of partitions, with the same commit interval as 10 GB, it should only take 10x longer to do 100 GB. You've not stated how long that currently takes, but to some people 1 minute vs 10 minutes might seem like "forever", sure.
I would recommend you plot the consumer lag over time using the kafka-consumer-groups command line tool combined with something like Burrow or Remora... If you notice an upward trend in the lag, then Spark is not consuming records fast enough.
To overcome this, the first option would be to ensure that the number of Spark executors is evenly consuming all Kafka partitions.
You'll also want to be making sure you're not doing major data transforms other than simple filters and maps between consuming and writing the records, as this also introduces lag.
For non-Spark approaches, I would like to point out that the Confluent S3 connector is also batch-y in that it'll only periodically flush to S3, but the consumption itself is still closer to real-time than Spark. I can verify that it's able to write very large S3 files (several GB in size), though, if the heap is large enough and the flush configurations are set to large values.
Secor by Pinterest is another option that requires no manual coding

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

spark streaming DirectKafkaInputDStream: kafka data source can easily stress the driver node

I am building a prototype with spark streaming 1.5.0. DirectKafkaInputDStream is used.
And a simple stage to read from kafka by DirectKafkaInputDStream can't handle massive amount of messages. The stage spends longer time then batch interval, once the message rate reach or exceed a certain value. And the rate is much lower than I expect. ( I have done another benchmark of my kafka cluster with multiple consumer instances in different servers)
JavaPairInputDStream<String, String> recipeDStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringKeyDecoder.class,
StringDecoder.class,
kafkaParams, kafkaTopicsSet);
After reading this article, I realize that the DirectKafkaInputDStream is run on the same node as the driver program. is it ture? If so, then DirectKafkaInputDStream can easily be stressed as it read all message in one node then dispatch to all executors.
And it means JavaPairReceiverInputDStream has better performance in handling high volume data, since receivers runs on multiple executor instances.
Am I right? Can someone explain this? Thank you.
No, the direct stream is only communicating from the driver to kafka in order to find the latest available offsets. Actual messages are read only on the executors.
Switching .createStream to .createDirectStream should in general perform better, not worse. If you've got a small reproducible example to the contrary, please share it on the spark mailing list or jira.

New directStream API reads topic's partitions sequentially. Why?

I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one.
So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel.
How can I achieve that?
Thank you, in advance.
An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.
They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.

Resources