spark running 10 hrs even after kafka showing 0 message lag - apache-spark

I am running spark streaming and it is consuming message from kafka.I have also defined checkpoint directory in my spark code.
We did a bulk message upload in kafka yesterday. When I check the offset status in kafka using -
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group xxx- \
streaming-consumer-group --zookeeper xxx.xxx.xxx.xxx:2181
It shows there is no message lag. However, my spark job is still running for last 10 hrs.
My understanding is spark-streaming code should read the messages sequentially and it should update offset in kafka accordingly.
I am not able to figure out why spark is still running even if there is no message lag in kafka. Can someone explain?

Related

gracefully exit without running pending batches in spark Dstreams

I am trying spark Dstreams with Kafka.
I am not able to commit for pending/ lag batches that are consumed by spark Dstream. This issues happens after the spark streaming context is stopped using ssc.stop(true,true). Here in this case the streaming context is stopped but spark context is still running the pending batched.
Here are a few things I have done.
create Dstream to get data from kafka topic. (Successfullyy)
commit offset manually back to kafka using
stream.asInstanceOf[canCommitOffsets].commitAsync()
Batch Interval is 60 seconds.
Batch time (time taken to perform some operation on incoming data ) is 2 mins.
streamingContext.stop(true,true)
Please tell me if there is a way to commit the offset for pending batches as well, or gracefully exit after the currently running batch and discard the pending batches, that way the offset for pending batches is not commited.

Restarting a PySpark job doesn't get the records which were inserted into Kafka Topic while the pyspark consumer is down

I am running a pyspark job and the data streaming is from Kafka.
I am trying to replicate a scenario in my windows system to find out what happens when the consumer goes down while the data is continuously being fed into Kafka.
Here is what i expect.
producer is started and produces message 1, 2 and 3.
consumer is online and consumes messages 1, 2 and 3.
Now the consumer goes down for some reason while the producer produces messages 4, 5 and 6 and so on...
when the consumer comes up, it is my expectation that it should read where it left off. So the consumer must be able to read from message 4, 5 , 6 and so on....
My pyspark application is not able to achieve what i expect. here is how I created a Spark Session.
session.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "clickapijson")
.option("startingoffsets" , "latest") \
.load()
I googled and gathered quite a bit of information. It seems like the groupID is relevant here. Kafka maintains the track of offsets read by each consumer in a particular groupID. If a consumer subscribes to a topic with a groupId, say, G1, kafka registers this group and consumerID and keeps a track of this groupID and ConsumerID. If at all, the consumer has to go down for some reason, and restarts with the same groupID, then the kafka will have the information of the already read offsets so the consumer will read the data from where it left off.
This is exactly happening when i use the following command to invoke consumer job in CLI.
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic "clickapijson" --consumer-property group.id=test
Now when my producer produces the messages 1,2 and 3, consumer is able to consume. I killed the running consumer job(CLI .bat file) after the 3 rd message is read. My producer produces the message 4, 5 and 6 and so on....
Now I bring back my consumer job (CLI .bat file) and it able to read the data from where it left off ( from message 4). This is behaving as I expect.
I am unable to do the same thing in pyspark.
when I am including the option("group.id" , "test"), it throws an error saying Kafka option group.id is not supported as user-specified consumer groups are not used to track offsets.
Upon observing the console output, each time my pyspark consumer job is kicked off, it is creating a new groupID. If my pyspark job has run previously with a groupID and failed, when it is restarted it is not picking up the same old groupID. It is randomly getting a new groupID. Kafka has the offset information of the previous groupID but not the current newly generated groupID. Hence my pyspark application is not able to read the data fed into Kafka while it was down.
If this is the case, then wont I lose my data when the consumer job has gone down due to some failure?
How can i give my own groupid to the pyspark application or how can i restart my pyspark application with same old groupid?
In the current Spark version (2.4.5) it is not possible to provide your own group.id as it gets automatically created by Spark (as you already observed). The full details on the offset management in Spark reading from Kafka is given here and summarised below:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
group.id: Kafka source will create a unique group id for each query automatically.
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
enable.auto.commit: Kafka source doesn’t commit any offset.
For Spark to be able to remember where it left off reading from Kafka, you need to have checkpointing enabled and provide a path location to store the checkpointing files. In Python this would look like:
aggDF \
.writeStream \
.outputMode("complete") \
.option("checkpointLocation", "path/to/HDFS/dir") \
.format("memory") \
.start()
More details on checkpointing are given in the Spark docs on Recovering from Failures with Checkpointing.

Spark kafka Streaming pull more messages

I'm using Kafka 0.9 and Spark 1.6. Spark Streaming application streams messages from Kafka through direct stream API (Version 2.10-1.6.0).
I have 3 workers with 8 GB memory each. For every minute I get 4000 messages to Kafka and in spark each worker is streaming 600 messages. I always see a lag on the Kafka offset to Spark offset.
I have 5 Kafka partitions.
Is there a way to make Spark stream more messages for each pull from Kafka?
My streaming frequency is 2 seconds
spark configurations in the app
"maxCoresForJob": 3,
"durationInMilis": 2000,
"auto.offset.reset": "largest",
"autocommit.enable": "true",
Would you please explain more? did you check which piece of code taking longer to execute? From cloudera manager-> Yarn--> Application -> selection your application --> Application master --> Streaming, then select one batch and click. Try to find out what task is taking longer time to execute. How many executors are you using? for 5 partitions, it is better to have 5 executors.
You can post your transformation logic, there could be some way to tune.
Thanks

how can spark streaming rerun the failed batch job

my question is that.
I use spark-streaming read data from kafka with directSteam Api,process rdd then update zookeeper offset manually.
data from kafka will read and insert into hive table.
now I meet a question.
sometime the hive-meta store process exit for some reason.(now the hive-metastore is single)
some batch job will fail for that reason and the spark streaming job won't exit just log some warning.
then when I restart the hive metastore process, the program go on and the new batch job will succeed.
but I find that the failed batch read data from kafka is missing.
I see the meta data from the job detail.
image that one batch job read 20 offset from kafka.
the batch1 job read offset 1 20,
the batch2 job read offset 21 40
if batch1 job fail ,the batch2 succeed, the failed1 job 's data will missied.
how can I do this?
how can I rerun the failed batch job?

Spark kinesis stuck on reading records after 25 days

I am working on spark streaming with kinesis on EMR. For roughly 25 days my streaming was fine running for every minute and precessing around 1000 records.
I have written job in scala
Today when i looked at the master ui i saw spark was not processing any recorda and had queued 40 jobs with zero records. It was also stuck in reading records from kinesis.
Has anyone experienced this issue earlier. I am using spark 1.6

Resources