How to submit jobs across multiple partitions at the same time (Slurm)

How to submit jobs across multiple partitions at the same time (Slurm) - slurm

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.

Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

Related

How to schedule millions of jobs in a node js properly?

I am using NodeJS,MongoDB and node-cron npm module to schedule jobs. For 10K of jobs it is taking less time and less memory. But when i am scheduling 100k jobs it is taking more than 10 minutes to schedule jobs and taking nearly 1.5GB of RAM and some times out of memory. Is there any best way achieve this like using activemq or rabbitmq?

One strategy is that you only schedule the next job to run. When it runs, you query the database and find the next job and schedule it.
If you add a new job, you check if it wants to run sooner than the now current next job and, if so, you schedule it and deschedule the previous next job (it will get rescheduled later after this new job runs).
If you remove a job, you check if it is the current next job. If it is, you deschedule it and find the next job in the database and schedule it.
If your database is configured for efficiently querying by job run time, this can be very efficient, uses hardly any memory and scales to an infinitely large number of jobs.

Edit the Job ID number of a pbs submitted job to achieve submission before other jobs in queue

Dear UNIX/PBS experts:
I am user of a UNIX HPC system (CentOS Linux 7 (Core),Linux 3.10.0-693.5.2.el7.x86_64) and I do not have any root privileges.
Various jobs have been submitted at an HPC system and almost all resources are being used.
Jobs from other users may run for weeks while my submitted job would finish in less than a day.
My goal is to run my job exactly after the first resources will be freed instead of waiting for
all other users to have their jobs finished.
My submitted job has a number qid 66005.pbs.
However the last job running at this moment has number 55004.pbs.
By checking the status of job: qstat 55005,
I obtain: qstat: Unknown Job Id 55005.pbs
Thus my question is whether it is possible to change the name of job 66005.pbs to 55005.pbs, and if this action will allow my job to run?
If yes, how can this be achieved?
If not, are there any other solutions/alternatives for making sure that my jobs run before those ones of other users in queue?
Thank you very much for your help and any suggestion.

The good thing about the computer system is that it is not human. It will be unfair to run your job (which clearly was submitted after other users) before other users and because of that "No" it is not possible to change your job-id.
You can work with your admin to move the job to a higher priority queue instead.

Apache Spark DAGScheduler Flow of Data

I am trying to understand how exactly Apache Spark scheduler works. To do so, i've set a local cluster with one master and two workers. I only submit one application, which simply reads 4 files (2 small (~10MB) and 2 big(~1,1GB)),joins them and collects the result. In addition, i cache in memory the two small files.
I am running the standalone cluster mode with FIFO.I've understood how the stages are formed but i cannot figure out how the flow of data is determined(the arrows). When i look at SparkUI, i notice that each time,even though the stages are formed in the same way, the arrows( flow of data and control i guess) are different. It's like the scheduler works non-deterministically.
I've read the relative chapters (about DAG and Task Scheduler) from Jacek Laskowski's book, but it isn't still clear in my head how the flow of control is determined . Thanks in advance for the help.
Cheers,
Jim

It's like the scheduler works non-deterministically.
Yes, there's some randomness in scheduling tasks to make it more "fair". In that sense Spark scheduler does work "non-deterministically", but within acceptable limits of execution placement (i.e. assigning tasks with lesser location preferences to executors).
The component in Apache Spark that does the work of selecting a task for a task set (that corresponds to a stage) is TaskSetManager:
Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of each task, retries tasks if they fail (up to a limited number of times), and handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node, and statusUpdate, which tells it that one of its tasks changed state (e.g. finished).

Why does web UI show different durations in Jobs and Stages pages?

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?

Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

Best practice beanstalkd (queue) and node.js

I currently do service using beanstalkd and node.js.
I would like when jobs fail, retry n time before give up the job.
If the job succede i want do it the same job 10 time.
So, what is the best practice, stock in mongo db with the jobId the error and success count, or delete and put a new job with a an error and success count in the body.
I dont know if i'm clear? so tell me , thanks a lot

There is a stats-job <id>\r\n that should also be available via the API library that returns, among other things, how many times the specific job has been reserved, released, buried, and so on.
This allows for a number of retries of failed jobs by checking previous reservation/releases.
To run the same job multiple times, I would personally create either one additional job, with a success count that would then be incremented (into another new job) - or, all nine new jobs, with optional delays before they start.

You have a couple of ways to do this:
you can release the job, and obtain from stats the number of reserves
you can put a new job with a retry count, and keep track of history in the data payload
You should do the later, and you don't need MongoDB as a second dependency.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string