Hill climbing search algorithm stopping criteria for job assignment - search

Let's say there are 10 jobs and 15 workers. The objective is to assign jobs to workers which could satisfy the jobs’ requirement and minimizing total job processing time.
For each iteration, a job is selected randomly and assigned to worker with next less processing time than the current assigned worker. For example, current assigned worker, let’s say worker 3: processing time is 10. The next less processing time is 8 at worker 5, so the job is assigned to worker 5.
My question is, how do I determine the stopping criteria for the iterations? For the time being, I just set the number of iterations to the number of jobs or to the number of workers.

Related

How does Hazelcast Jet assign task-to-CPU priority?

If I have the following code and let's say I'm running on 10 nodes of 32 cores each:
IList<...> ds = ....; //large collection, eg 1e6 elements
ds
.map() //expensive computation
.flatMap()//generates 10,000x more elements for every 1 incoming element
.rebalance()
.map() //expensive computation
....//other transformations (ie can be a sink, keyby, flatmap, map etc)
What will Hazelcast do with respect to task-to-CPU assignment priority when the SECOND map operation wants to process 10,000 elements that was generated from the 1st original element? Will it devote the 320 CPU cores (from 10 nodes) to processing the 1st original element's 10,000 generated elements? If so, will it "boot off" already running tasks? Or, will it wait for already running tasks to complete, and then give priority to the 10,000 elements resulting from the output of the flatmap-rebalance operations? Or, would the 10,000 elements be forced to run on a single core, since the remaining 319 cores are already being consumed by the output of the ds operation (ie the input of the 1st map). Or, is there some random competition for who gets access to the CPU cores?
What I would ideally like to have happen is that a) Hazelcast does NOT boot off running tasks (it lets them complete), but when deciding which tasks gets priority to run on a core, it chooses the path that would lead to the lowest latency, ie it would process all 10,000 elements which result from the output of the flatmap-rebalance operation on all 320 cores.
Note: I asked a virtually identical question to Flink a few weeks ago, but have since switched to trying out Hazelcast: How does Flink (in streaming mode) assign task-to-CPU priority?
First, IList is a non-distributed data structure, all its data are stored on a single node. The IList source therefore produces all data on that node. So the 1st expensive map will be all done on that member, but map is backed, by default by as many workers as there are cores, so 32 workers in your case.
The rebalance stage will cause that the 2nd map is run on all members. Each of the 10,000 elements produced in the 1st map is handled separately, so if you have 1 element in your IList, the 10k elements produced from it will be processed concurrently by 320 workers.
The workers backing different stages of the pipeline compete for cores normally. There will be total 96 workers for the 1st map, 2nd map and for the flatMap together. Jet uses cooperative scheduling for these workers, which means it cannot preempt the computation if it's taking too long. This means that one item taking a long time to process will block other workers.
Also keep in mind that the map and flatMap functions must be cooperative, that means they must not block (by waiting on IO, sleeping, or by waiting for monitors). If they block, you'll see less than 100% CPU utilization. Check out the documentation for more information.

Scaling Spark application with adding new clusters

I have the next SLA for my big data project: regardless amount of concurrent spark tasks, the max execution time shouldn't exceed 5 minutes. For example: there are 10 spark concurrent tasks,
the slowest task shall take < 5 mins, as the number of tasks increases I have to be sure that this time wouldn't exceed 5 mins. Using the usual autoscaling is not appropriate here because adding new nodes takes a couple of minutes and it doesn't solve a problem with the exponential growth of tasks (e.g. skew from 10 concurrent tasks to 30 concurrent tasks).
I came across the idea to spin up a new cluster on demand so meet the SLA requirements. Let's say, I found the max number of the concurrent tasks (all of them are almost equal and take the same resources) which can be executed simultaneously on my cluster within 5 mins e.g. - 30 tasks. when the number of tasks approaches the threshold, the new cluster will be spinned up. The idea of this pattern is to struggle slowness during autoscaling and meet the SLA.
My question is next: are there any alternative options to my pattern except autoscaling on a single cluster (it's not suitable for my usecase because of slowness of my spark provider).

How to submit jobs across multiple partitions at the same time (Slurm)

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

What do priority and parallelism value mean in Azure Data Lakes (Hadoop)?

In other words, what does a parallelism value of 5 and a priority value of 1000 mean?
They impact how and when your job can run. Priority determines in which order a job can run in relation to other queued jobs, parallelism sets how many parallel processes are started for it (more means it runs faster but costs more)
https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-manage-use-portal
Priority
Lower number has higher priority. If two jobs are both queued, the one with lower priority runs first
The default value is 1000 for this.
Parallelism
Max number of compute processes that can happen at the same time. Increasing this number can improve performance but can also increase cost.

Spark Streaming Processing Time vs Total Delay vs Processing Delay

I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?
I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.
I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.
Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:
inputDStream.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.
A live example from a working Streaming application:
We see that:
The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.
We're experiencing stable processing time, however increasing scheduling delay.
Based on the answer, the scheduling delay should be influenced only by processing time of previous runs.
Spark is running only streaming, nothing else.
Time window is 1 minute, processing 120K records.
If your window is 1 minute, and the average processing time is 1 minute 7 seconds, you have a problem : each batch will delay the next one by 7 seconds.
Your processing time graph shows a stable processing time, but always higher than batch time.
I think after a given amount of time, your driver will crash on GC overhead limit exceeded, as it will be full of pending batch waiting to be excecuted.
You can change this by reducing the processing time so that it goes under the expected microbatch max duration (requires code and/or resources allocation changes), or increase the microbatch size, or go to continuous streaming.
Rgds

Resources