Scaling Spark application with adding new clusters

Scaling Spark application with adding new clusters - apache-spark

I have the next SLA for my big data project: regardless amount of concurrent spark tasks, the max execution time shouldn't exceed 5 minutes. For example: there are 10 spark concurrent tasks,
the slowest task shall take < 5 mins, as the number of tasks increases I have to be sure that this time wouldn't exceed 5 mins. Using the usual autoscaling is not appropriate here because adding new nodes takes a couple of minutes and it doesn't solve a problem with the exponential growth of tasks (e.g. skew from 10 concurrent tasks to 30 concurrent tasks).
I came across the idea to spin up a new cluster on demand so meet the SLA requirements. Let's say, I found the max number of the concurrent tasks (all of them are almost equal and take the same resources) which can be executed simultaneously on my cluster within 5 mins e.g. - 30 tasks. when the number of tasks approaches the threshold, the new cluster will be spinned up. The idea of this pattern is to struggle slowness during autoscaling and meet the SLA.
My question is next: are there any alternative options to my pattern except autoscaling on a single cluster (it's not suitable for my usecase because of slowness of my spark provider).

Related

Bulls Queue Performance and Scalability: Queue.add(), Queue.getJob(jobId), Job.remove()

My use case is to create dynamic delayed job. (I am Using Bulls Queue which can be used to create delayed Jobs.)
Based on some event add some more delay to the delayed interval (further delay the job).
Since I could not find any function to update the Delayed Interval for a Job I came up with the following steps:
onEvent(jobId):
// queue is of Type Bull.Queue
// job is of type bull.Job
job = queue.getJob(jobId)
data = job.data
delay = job.toJSON().delay
job.remove()
queue.add("jobName", {value: 1}, {jobId: jobId, delayed: delay + someValue})
This pretty much solves my problem.
But I am worried about the SCALE at which these operations will happen.
I am expecting nearly 50K events per minute or even more in near future.
My Queue size is expected to grow based on unique JobId.
I am expecting more than:
1 million daily entry
around 4-5 million weekly entry
10-12 million monthly entry.
Also, after 60-70 days delayed interval for jobs will reach, and older jobs will be removed one by one.
I can run multiple processor to handle these delayed job which is not an issue.
My queue size will be stabilise after 60-70 days and more or less my queue will have around 10 million jobs.
I can vertically scale my REDIS as required.
But I want to understand the time complexity for below queries:
queue.getJob(jobId) // Get Job By Id
job.remove() // remove job from queue
queue.add(name, data, opts) // add a delayed job to this queue
If any of these operations are O(N) OR the QUEUE can keep some max number of Jobs which is less than 10 million.
Then I might have to discard this design and come up with something entirely different.
Need advice from experienced folks who can guide me on how solve this problem.
Any kind of help is appreciated.

Taking reference from the source code:
queue.getJob(jobId)
This should be O(1) since it's mostly using hash based solutions using hmget. You're only requesting one job and according to official redis docs, the time complexity is O(N) where N is the requested number of keys which will be in the order of O(1) since I'm expecting bull is storing few number of fields at the hash key.
job.remove()
Considering that a considerable number of your jobs is going to be delayed and a fraction of them are moved to waiting or active queue. This should be O(logN) on an amortized level as it's mostly using zrem for these operations.
queue.add(name, data, opts)
For job addition in a delayed queue, bull is using zadd so this is again O(logN).

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance

I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

High CPU usage issue because of cron jobs

I apologize for the long question but I have to explain it.
I am doing a change point detection through python for almost 50 customers and for different processes.
I am getting minute by minute interval numeric data from influx.
I am calculating the Z-Score and saving that in mongoDB locally on
the
machine on which the cron job is running.
I am comparing the Z-score with historic Z-score and then I am
alerting
the system.
Issues:
As i am doing this for 50 customers which can scale to say 500 or
5000 and each customer will
be having say 10 process so its not practical to have that many cron jobs.
Increasing cron jobs are creating high cpu usage and as my data is
sitting locally and if I
will loose the linux machine then i will loose all my data sitting there and won't be able to
compare it with historic data.
Solutions:
Create a clustered mongoDB server rather than having the data saved
locally.
Replace the cron jobs with multi threading and multi processing.
Suggestions:
What could be the best way to implement this to decrease the load on
cpu considering this is going to run all the time in a loop?
After fixing the above, what could be the best way to decrease the
false positive numbers in alerts?
Keep in mind:
Its a time series data.
Every time stamp has 5 or 6 variables and as of now i am doing the
same
operation for each variable separately.
Time var1 var2 var3.................varN
09:00 PM 10,000 5,000 150,000..............10
09:01 PM 10,500 5,050 160,000..............25
There is possibility that there is a correlation in these numbers.
Thanks!

Why does web UI show different durations in Jobs and Stages pages?

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?

Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

What do priority and parallelism value mean in Azure Data Lakes (Hadoop)?

In other words, what does a parallelism value of 5 and a priority value of 1000 mean?

They impact how and when your job can run. Priority determines in which order a job can run in relation to other queued jobs, parallelism sets how many parallel processes are started for it (more means it runs faster but costs more)
https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-manage-use-portal
Priority
Lower number has higher priority. If two jobs are both queued, the one with lower priority runs first
The default value is 1000 for this.
Parallelism
Max number of compute processes that can happen at the same time. Increasing this number can improve performance but can also increase cost.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string