Can I submit a job that may be executed in different partitions - slurm

When using the parameter -p you can define the partition for your job.
In my case a job can run in different partitions, so I do want to restrict my job to only a given partition.
If my job can perfectly run in partitions "p1" and "p3", how can I configure the sbatch command to allow more than one partition?

The --partition option accepts a list of partition. So in your case you would write
#SBATCH --partition=p1,p3
The job will start in the partition that offers the resources the earliest.

Related

Number of Tasks in Spark UI

I am new to Spark. I have couple of questions regarding the Spark Web UI:-
I have seen that Spark can create multiple Jobs for the same
application. On what basis does it creates the Jobs ?
I understand Spark creates multiple Stages for a single Job around
Shuffle boundaries. Also I understand that there is 1 task per
partition. However, I have seen that a particular Stage (E.g. Stage1)
of a particular Job creating lesser number of tasks than the default
shuffle partitions value (for e.g. only 2/2 completed). And I have
also seen, the next Stage (Stage 2) of the same Job creating
1500 tasks (for E.g. 1500/1500 completed) which is more than
the default shuffle partitions value.
So, how does Spark determine how many tasks should it
create for any particular Stage to execute ?
Can anyone please help me understand the above.
the max number of task in one moment dependent on you cores and exec numbers,
different stage have different task number

Is there any way to run spark scripts and store outputs in parallel with oozie?

I have 3 spark scripts and every one of them has 1 spark sql to read a partitioned table and store to some hdfs location. Every script has a different sql statement and different folder location to store data into.
test1.py - Read from table 1 and store to location 1.
test2.py - Read from table 2 and store to location 2.
test3.py - Read from table 3 and store to location 3.
I run these scripts using fork action in oozie and all three run. But the problem is that the scripts are not storing data in parallel.
Once the store from one script is done then the other store starts.
My expectation is to store all 3 tables data into their respective locations parallely.
I have tried FAIR scheduling and other scheduler techniques in the sparks scripts but those don't work. Can anyone please help.I am stuck with it from last 2 days.
I am using AWS EMR 5.15, Spark 2.4 and Oozie 5.0.0.
For Capacity scheduler
If you are submitting job to a single queue whichever job come first in the queue gets the resources. intra-queue preemption won't work.
I can see a related Jira for Intraqueue preemption in Capacity scheduler. https://issues.apache.org/jira/browse/YARN-10073
You can read more https://blog.cloudera.com/yarn-capacity-scheduler/
For Fair scheduler
Setting the "yarn.scheduler.fair.preemption" parameter to "true" in yarn-site.xml enables preemption at the cluster level. By default this is false i.e no preemption.
Your problem could be:
1 job is taking maximum resources. To verify this please check Yarn UI and Spark UI.
Or if you have more than 1 yarn queue (other than default). Try setting User Limit Factor > 1 for the queue you are using.

how to limit the number of jobs running on the same node using SLURM?

I have a job array of 100 jobs. I want at most 2 jobs from the job array can be allocated to the same node. How can I do this using SLURM? Thanks!
Assuming that jobs can share nodes, and that nodes have homogeneous configuration, and that you are alone on the cluster,
use the sinfo -Nl command to find the number of CPUs per nodes
submit jobs that request half that number with either of #SBATCH --tasks-per-node=... or #SBATCH --cpus-per-task=... based on what your jobs do
If you are administrating a cluster that is shared among other people, you can define GRES of a dummy type, and assign two of them to each node in slurm.conf and then request one per job with --gres=dummy:1

Spark Yarn running 1000 jobs in queue

I am trying to schedule 1000 jobs in Yarn cluster. I want to run more then 1000 jobs daily at same time and yarn to manage the resources. For 1000 files of different category from hdfs i am trying to create spark submit command from python and execute. But i am getting out of memory error due to spark submit using driver memory.
How can schedule 1000 jobs in spark yarn cluster? I even tried oozie job scheduling framework along with spark, it did not work as expected with HDP.
Actually, you might not need 1000 jobs to read from 1000 files in HDFS. You could try to load everything in a single RDD as well (the APIs do support reading multiple files and wildcards in paths). Now, after reading all the files in a single RDD, you should really focus on ensuring if you have enough memory, cores, etc. assigned to it and start looking at your business logic which avoids costly operations like shuffles, etc.
But, if you insist that you need to spawn 1000 jobs, one for each file, you should look at --executor-memory and --executor-cores (along with num-executors for parallelism). These give you leverage to optimise for memory/CPU footprint.
Also curious, you are saying that you get OOM during spark-submit (using driver memory). The driver doesn't really use any memory at all, unless you do things like collect or take with large set, which bring the data from the executors to the driver. Also you are firing the jobs in yarn-client mode? Another hunch is to check if the box where you spawn spark spark jobs has even enough memory just to spawn the jobs in the first place?
It will be easier if you could also paste some logs here.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources