I noticed that when I start the Spark Streaming application the first job takes more time than the following ones even when there are no input data. I also noticed that the first job when input data arrives requires a processing time greater than the following. Is there a reason for this behavior?
Thank You
Related
I have two streaming dataframes - firstDataFrame and secondDataframe. I want to stream firstDataframe completely. And if the first streaming finishes successfully, only then i would like to stream the other dataframe
For example, in the below code, I would like the first streaming action to execute completely and only then the second to begin
firstDataframe.writeStream.format("console").start
secondDataframe.writeStream.format("console").start
Spark follows FIFO job scheduling by default. This means it would give priority to the first streaming job. However, if the first streaming job does not require all the available resources, it would start the second streaming job in parallel. I essentially want to avoid this parallelism. Is there a way to do this?
Reference: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
I am trying to optimize a Spark Streaming application which collects data from a Kafka cluster, processes it and saves the results to various database tables. The Jobs tab in the Spark UI shows the duration of each job as well as the time it was submitted.
I would expect that for a specific batch, a job starts processing when the previous job is done. However, in the attached screenshot, the "Submitted" time of a job is not right after the previous job finishes. For example, job 1188 has a duration of 1 second and it was submitted at 12:02:12. I would expect that the next job would be submitted one second later, or at least close to it, but instead it was submitted six seconds later.
Any ideas on how this delay can be explained? These jobs belong to the same batch and are done sequentially. I know that there is some scheduling delay between jobs and tasks, but I would not expect it to be that large. Moreover, the Event Timeline of a Stage does not show large Scheduling Delay.
I am using Pyspark in a Standalone mode.
We are trying to build a fault tolerant spark streaming job, there's one problem we are running into. Here's our scenario:
1) Start a spark streaming process that runs batches of 2 mins
2) We have checkpoint enabled. Also the streaming context is configured to either create a new context or build from checkpoint if one exists
3) After a particular batch completes, the spark streaming job is manually killed using yarn application -kill (basically mimicking a sudden failure)
4) The spark streaming job is then restarted from checkpoint
The issue that we are having is that after the spark streaming job is restarted it replays the last successful batch. It always does this, just the last successful batch is replayed, not the earlier batches
The side effect of this is that the data part of that batch is duplicated. We even tried waiting for more than a minute after the last successful batch before killing the process (just in case writing to checkpoint takes time), that didn't help
Any insights? I have not added the code here, hoping someone has faced this as well and can give some ideas or insights. Can post the relevant code as well if that helps. Shouldn't spark streaming checkpoint right after a successful batch so that it is not replayed after a restart? Does it matter where I place the ssc.checkpoint command?
You have the answer in the last line of your question. The placement of ssc.checkpoint() matters. When you restart the job using the saved checkpointing, the job comes up with whatever is being saved. So in your case when you killed the job after the batch is completed, the recent one is the last successful one. By this time, you might have understood that checkpointing is mainly to pick up from where you left off-especially for failed jobs.
There are two things those need to be taken care.
1] Ensure that the same checkpoint directory is being used in getOrCreate streaming context method when you restart the program.
2] Set “spark.streaming.stopGracefullyOnShutdown" to "true". This allows spark to complete processing current data and update the checkpoint directory accordingly. If set to false, it may lead to corrupt data in checkpoint directory.
Note: Please post code snippets if possible. And yes, the placement of ssc.checkpoint does matter.
In Such a scenario, one should ensure that checkpoint directory used in streaming context method is same after restart of Spark application. Hopefully it will help
I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.
hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?
Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation