Spark Streaming - Restarting from checkpoint replays last batch - apache-spark

We are trying to build a fault tolerant spark streaming job, there's one problem we are running into. Here's our scenario:
1) Start a spark streaming process that runs batches of 2 mins
2) We have checkpoint enabled. Also the streaming context is configured to either create a new context or build from checkpoint if one exists
3) After a particular batch completes, the spark streaming job is manually killed using yarn application -kill (basically mimicking a sudden failure)
4) The spark streaming job is then restarted from checkpoint
The issue that we are having is that after the spark streaming job is restarted it replays the last successful batch. It always does this, just the last successful batch is replayed, not the earlier batches
The side effect of this is that the data part of that batch is duplicated. We even tried waiting for more than a minute after the last successful batch before killing the process (just in case writing to checkpoint takes time), that didn't help
Any insights? I have not added the code here, hoping someone has faced this as well and can give some ideas or insights. Can post the relevant code as well if that helps. Shouldn't spark streaming checkpoint right after a successful batch so that it is not replayed after a restart? Does it matter where I place the ssc.checkpoint command?

You have the answer in the last line of your question. The placement of ssc.checkpoint() matters. When you restart the job using the saved checkpointing, the job comes up with whatever is being saved. So in your case when you killed the job after the batch is completed, the recent one is the last successful one. By this time, you might have understood that checkpointing is mainly to pick up from where you left off-especially for failed jobs.

There are two things those need to be taken care.
1] Ensure that the same checkpoint directory is being used in getOrCreate streaming context method when you restart the program.
2] Set “spark.streaming.stopGracefullyOnShutdown" to "true". This allows spark to complete processing current data and update the checkpoint directory accordingly. If set to false, it may lead to corrupt data in checkpoint directory.
Note: Please post code snippets if possible. And yes, the placement of ssc.checkpoint does matter.

In Such a scenario, one should ensure that checkpoint directory used in streaming context method is same after restart of Spark application. Hopefully it will help

Related

How to get spark streaming to continue where spark batch left off

I have monthly directories of parquet files (~10TB each directory). Files are being atomically written to this directory every minute or so. When we get to a new month, a new directory is created and data is written there. Once data is written, it cannot be moved.
I easily run batch queries on this data using spark (batch mode). I can also easily run spark streaming queries.
I am wondering how I can reconcile the two modes: batch and stream.
For example: Lets say I run a batch query on the data. I get the results of the query and do something with them. I can then checkpoint this dataframe. Now let's say I want to start a streaming job to only process new files relative to what was processed in the batch job, ie. only files not processed in the batch job should now be processed.
Is this possible with spark streaming? If start a spark streaming job and use the same checkpoint that the batch job used, will it proceed as I want it to?
Or, with the batch job, do I need to keep track of what files were processed and then somehow pass this to spark streaming so it can know to not process these.
This seems like a pretty common problem, so I am asking here to see what some other big data software developers have done.
I apologize for not having any code to post in this question, but I hope that my explanation is all it takes for someone to see a potential solution. If needed, I can come up with some snippets

How are writes managed in Spark with speculation enabled?

Let's say I have a Spark 2.x application, which has speculation enabled (spark.speculation=true), which writes data to a specific location on HDFS.
Now if the task (which writes data to HDFS) takes long, Spark would create a copy of the same task on another executor, and both the jobs would be running in parallel.
How does Spark handle this? Obviously both the tasks shouldn't be trying to write data at the same file location at the same time (which seems to be happening in this case).
Any help would be appreciated.
Thanks
As I understand what is happening in my tasks:
If one of the speculative tasks is finished, the other is killed
When spark kills this task, it deletes temporary file written by this task
So no data will be duplicated
If you choose mode overwrite, some specilative tasks may fail with this exception:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
Failed to CREATE_FILE /<hdfs_path>/.spark-staging-<...>///part-00191-.c000.snappy.parquet
for DFSClient_NONMAPREDUCE_936684547_1 on 10.3.110.14
because this file lease is currently owned by DFSClient_NONMAPREDUCE_-1803714432_1 on 10.0.14.64
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2629)
I will continue to study this situation, so maybe the answer will be more helpful some day

Spark Streaming - Jobs run concurrently with default spark.streaming.concurrentJobs setting

I have come across a wierd behaviour in Spark Streaming Job.
We have used the default value for spark.streaming.concurrentJobs which is 1.
The same streaming job was running for more than a day properly with the batch interval being set to 10 Minutes.
Suddenly that same job has started running concurrently for all the batches that come in without putting them in Queue.
Has anyone faced this before?
This would be of great help!
This kind of behavior seems to be curious but I believe seems to happen when there is only 1 job running at a time and if batch processing time < batch interval, then the system seems to be stable.
Spark Streaming creator Tathagata hs mentioned about this: How jobs are assigned to executors in Spark Streaming?.

Spark Streaming directory fills disk

I have a stream job which is intended to run continuously with a single step which uses mapWithState and therefore requires checkpointing to be configured. I set it up with a local directory as this is only running on a single node at this stage.
I'm observing that the checkpoint directory grows quickly and continuously. Over the course of a few days it grows to over a million files and exhausts the inodes on the disk.
Questions:
Is this expected behavior?
Assuming not, how can I isolate what might be causing the snapshots not to be pruned?
The error is that checkpointing was enable by sparkContext.checkpoint(checkpointDir) rather than sparkStreamingContext.checkpoint(checkpointDir).
The former was enough to make Spark run the stateful stream instead of complaining that checkpointing was not enabled, but the appropriate logic for streaming checkpoints was not being called because sparkStreamingContext.checkpointDir was null.

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

Resources