I need to set up a Spark Streaming application. Jobs of the application need to make some decisions based on the whole application running time.
For example, assume the Spark Streaming application was submitted at 08:00. The jobs run between 08:00 and 10:00 should do a plus operation, while the jobs run after 10:00 should do a minus operation.
How can I record the first job's (or the application's) start time and determine the interval between each job and the first job? Or is there any other good solution?
SparkContext's startTime() method returns the time when it became active.
Related
I'm using Spark 3.0.2 and I have a streaming job that consumes data from Kafka with trigger duration of "1 minute".
I see in Spark UI that there is a new job every 1 minute as defined, but I see method onQueryProgress is being called every 5~6 minutes. I thought this method should be called directly after each microbatch.
Is there a way to control this duration and make it equals the trigger duration?
The inQueryProgress method of the StreamingQueryListener is called asynchronously after the data has been completely processed within each micro-batch.
You are seeing this listener being triggered only every 5~6 minutes because it takes the streaming job that time to process all the data fetched in the micro-batch. Setting the Trigger duration to 1 minute will have Spark to plan tasks accordingly but it does not mean that the job is also able to process all available data within this time frame of 1 minute.
To reduce the amount of data being fetched by your query from Kafka you can play around with the source option maxOffsetsPerTrigger.
By the way, if you are not processing any data, this method is called every 10 seconds by default. In case you want to avoid this from happening you can do an if(event.progress.numInputRows > 0).
I found the reason for my case that onQueryProgress method was taking 5 minutes to complete.
as Mike mentioned that onQueryProgress is being called asynchronously, but I think it's using the same thread to call this method. So it's waiting for the method call to finish to call it again.
So the solution in my case was to figure out why it was taking that long and to make it faster than the trigger duration.
I am using NodeJS,MongoDB and node-cron npm module to schedule jobs. For 10K of jobs it is taking less time and less memory. But when i am scheduling 100k jobs it is taking more than 10 minutes to schedule jobs and taking nearly 1.5GB of RAM and some times out of memory. Is there any best way achieve this like using activemq or rabbitmq?
One strategy is that you only schedule the next job to run. When it runs, you query the database and find the next job and schedule it.
If you add a new job, you check if it wants to run sooner than the now current next job and, if so, you schedule it and deschedule the previous next job (it will get rescheduled later after this new job runs).
If you remove a job, you check if it is the current next job. If it is, you deschedule it and find the next job in the database and schedule it.
If your database is configured for efficiently querying by job run time, this can be very efficient, uses hardly any memory and scales to an infinitely large number of jobs.
After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.
I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?
Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.
I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853