Kafka OffsetOutOfRangeException - apache-spark

I am streaming loads of data through kafka. And then I have spark streaming which is consuming these messages. Basically down the line, spark streaming throws this error:
kafka.common.OffsetOutOfRangeException
Now I am aware what this error means. So I changed the retention policy to 5 days. However I still encountered the same issue. Then I listed all the messages for a topic using --from-beginning in kafka. Surely enough, ton of messages from the beginning of the kafka streaming part were not present and since spark streaming is a little behind the kafka streaming part, spark streaming tries to consume messages that have been deleted by kafka. However I thought changing the retention policy would take care of this:
--add-config retention.ms=....
What I suspect is happening that kafka is deleting messages from the topic to free up space (because we are streaming tons of data) for the new messages. Is there a property which I can configure that specifies how much bytes of data kafka can store before deleting the prior messages?

You can set the maximum size of the topic when u create the topic using the topic configuration property retention.bytes via console like:
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1 --replication-factor 1 --config retention.bytes=10485760 --config
or u can use global broker configuration property log.retention.bytes to set the maximum size for all topics.
what is important to know is that log.retention.bytes doesn't enforce a hard limit on a topic size, but it just signal to Kafka when to start deleting the oldest messages

Another way to solve this problem is to specify in the configuration the spark parameter :
spark.streaming.kafka.maxRatePerPartition

Related

Confluent Kafka: Setting Retention.ms for an individual topic not working as expected

I'm using confluent Kafka in my project where the messages sent to a particular topic need to be deleted after a retention time. So I set retention.ms for the individual topics but it's not working (Still I can see messages after its retention time)
I have browsed most of the stack questions but still, I can't able to find a proper reason/ solution for Kafka retention.ms not working problem.
I followed the below steps to create and set retention time in ms.
Created a topic say. 'user_status'
updated its retention.ms time by following below code
from confluent_kafka.admin import AdminClient, ConfigResource, NewTopic, NewPartitions
from confluent_kafka import Producer, Consumer, KafkaError, KafkaException
topic_config = ConfigResource('topic', 'user_status')
topic_config.set_config('retention.ms', '5000')
admin.alter_configs([topic_config])
Send message from producer end.
waited for 6000 ms time.
Tried to receive the message from particular topic. But I received the message, And still the message was not deleted by retention policy.
Note:
I ensured below
After updating retention.ms I verified the same has been updated in kafka topic information (Topic describe).
Also, I updated server.properties with the log.retention.check.interval.ms=1 ms and restarted Kafka service after I updated the properties file.
what I expect from the above question
I want to set a retention.ms to an individual topic and the message passing that time should be automatically deleted as defined in Kafka policy.
what happening now with my current code.
The messages are still received by consumer even after retention.ms time.

Parameter to control the handshake interval between Kafka and spark

While the kafka brokers are up and running, spark process running in cluster mode is able to read the messages from the kafka topic. But when the brokers were shutdown intentionally, the spark consumer is still in RUNNING status.
Is there any parameter to control the handshake interval between spark consumer and the zookeeper process, so that the spark process can fail if the brokers are not reachable. Or is there any alternate way to fail the consumer. Please suggest.
No there is not.
KAFKA & Spark Structured Steaming (SSS) are loosely coupled, and given high availability scenarios, failure etc. SSS just waits and will process rebalanced topics when KAFKA rebalances the load.
The whole premise is that KAFKA will do something to alleviate the situation - if a Broker goes down. Even if there are zero Brokers after a while, SSS will wait as you have noted already. It is of course knows nothing, but just waits.
As long as the topics still exist and the "fail on data loss" is not set to true if a topic is deleted, the SSS Apps will go on.

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

Download data from http using Python Spark streaming

I am new for PySpark and I installed Kafka single node and single broker on my Ubuntu 14.04.
After installation I tested the Kafka that sending and receiving data by using kafka-console-producer and kafka-console-consume.
Below are the steps I followed
Starting a consumer to consuming messages.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic kafkatopic --from-beginning
Starting a producer to sending messages in a new terminal window.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kafkatopic
[2016-09-25 7:26:58,179] WARN Property topic is not valid (kafka.utils.VerifiableProperties)
Good morning
Future big data
this is test message
In the consumer terminal
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic kafkatopic --from-beginning
Good morning
Future big data
this is test message
The below link from meetup.com produce streaming data
http://stream.meetup.com/2/rsvps
My requirement is how to collect the streaming data from http site to spark using Kafka. What is the transformation command to download streamin data?
After downloading the data I can find the count by city and other analysis for a particular time interval.
There are different ways to process realtime streaming. The one i am considering is like below one.

Spark Streaming from Kafka Consumer

I might need to work with Kafka and I am absolutely new to it. I understand that there are Kafka producers which will publish the logs(called events or messages or records in Kafka) to the Kafka topics.
I will need to work on reading from Kafka topics via consumer. Do I need to set up consumer API first then I can stream using SparkStreaming Context(PySpark) or I can directly use KafkaUtils module to read from kafka topics?
In case I need to setup the Kafka consumer application, how do I do that? Please can you share links to right docs.
Thanks in Advance!!
Spark provide internal kafka stream in which u dont need to create custom consumer there is 2 approach to connect with kafka 1 with receiver 2. direct approach.
For more detail go through this link http://spark.apache.org/docs/latest/streaming-kafka-integration.html
There's no need to set up kafka consumer application,Spark itself creates a consumer with 2 approaches. One is Reciever Based Approach which uses KafkaUtils class and other is Direct Approach which uses CreateDirectStream Method.
Somehow, in any case of failure ion Spark streaming,there's no loss of data, it starts from the offset of data where you left.
For more details,use this link: http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Resources