I'm trying to launch an heterogeneous job with SLURM using an arbitrary distribution (--distribution=arbitrary) defined in an host file. It seems that on SLURM there isn't a specific SLURM_HOSTFILE environment variable for each HET_GROUP so I tried to define it in one file including all the tasks in the groups.
The problem is that SLURM_JOB_NUM_NODES is equal to SLURM_JOB_NUM_NODES_HET_GROUP_0 so I get an error:
Requested node configuration is not available
Did someone manage to use an arbitrary distribution with heterogeneous jobs?
Related
I have a practical use case. three notebooks (pyspark) all have one common parameter.
need to schedule all three notebooks in a sequence.
is there any way to run them by setting one parameter value, as they are same in all?
please suggest the best way to do it.
I have a dynamic list for which I'd like to run a given job for each item in the list.
Basically this is a list of tags (~20) for a docker container that would ideally all be processed concurrently. As the list of tags can change I'd like to define the list once and consume it in multiple places, including in the jobs definition.
On Travis-CI this is known as Matrix Expansion, but I've been unable to find a similar feature on Gitlab CI.
use the keyword parallel:matrix
I'm setting up a SLURM cluster with two 'physical' nodes.
Each of the two nodes has two GPUs.
I would like to give the option to use only one of the GPUs (and have the other GPU still available for computation).
I managed to set-up something with gres, but I later realized that even if only 1 of the GPUs is used the node will be occupied and the other GPU can not be used.
Is there a way to set the GPUs as the consumables and have two 'nodes' within a single node? And to assign a limited number of CPUs and memory to each?
I've had the same problem and I managed to make it work by allowing oversubscribing.
Here's the documentation about it:
https://slurm.schedmd.com/cons_res_share.html
Not sure if what I did was exactly right, but I've put
SelectType=select/cons_tres, SelectTypeParameters=CR_Core and put OverSubscribe=FORCE for my partition. Now I can launch several GPU jobs on the same node.
I am working with NGS data and the newest test files are massive.
Normally our pipeline is using just one node and the output from different tools is its ./scratch folder.
To use just one node is not possible with the current massive data set. That's why I would like to use at least 2 nodes to solve the issues such as speed, not all jobs are submitted, etc.
Using multiple nodes or even multiple partitions is easy - i know how which parameter to use for that step.
So my issue is not about missing parameters, but the logic behind slurm to solve the following issue about I/O:
Lets say I have tool-A. Tool-A is running with 700 jobs on two nodes (340 jobs on node1 and 360 jobs on node2) - the ouput is saved on ./scratch on each node separately.
Tool-B is using the results from tool-A - which are on two different nodes.
What is the best approach to fix that?
- Is there a parameter which tells slurm which jobs belongs together and where to find the input for tool-B?
- would it be smarter to change the output on /scratch to a local-folder?
- or would it be better to merge the output from tool-A from both nodes to one node?
- any other ideas?
I hope I made my issue "simply" to understand... Please apologize if that is not the case!
My naive suggestion would be why not share a scratch nfs volume across all nodes ? This way all ouput datas of ToolA would be acessible for ToolB whatever the node. It migth not be the best solution for read/write speed, but to my mind it would be the easiest for your situation.
A more sofware solution (not to hard to develop) can be to implement a database that track where the files have been generated.
I hope it help !
... just for those coming across this via search engines: if you cannot use any kind of shared filesystem (NFS, GPFS, Lustre, Ceph) and you don't have only massive data sets, you could use "staging", meaning data transfer before and after your job really runs.
Though this is termed "cast"ing in the Slurm universe, it generally means you define
files to be copied to all nodes assigned to your job BEFORE the job starts
files to be copied from nodes assigned to your job AFTER the job completes.
This can be a way to get everything needed back and forth from/to your job's nodes even without a shared file system.
Check the man page of "sbcast" and amend your sbatch job scripts accordingly.
Spark has broadcast variables, which are read only, and accumulator variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read?
One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried.
Scala does have ScalaCache, which is a facade for cache implementations such as Google Guava. But how would such a cache be included and accessed within a Spark application?
The cache could be defined as a variable in the driver application which creates the SparkContext. But then there would be two issues:
Performance would presumably be bad because of the network overhead
between the nodes and the driver application.
To my understanding, each rdd will be passed a copy of the variable
(cache in this case) when the variable is first accessed by the
function passed to the rdd. Each rdd would have it's own copy, not access to a shared global variable .
What is the best way to implement and store such a cache?
Thanks
Well, the best way of doing this is not doing it at all. In general Spark processing model doesn't provide any guarantees* regarding
where,
when,
in what order (excluding of course the order of transformations defined by the lineage / DAG)
and how many times
given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular.
These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless.
If all you want is a simple cache then you have multiple options:
use one of the methods described by Tzach Zohar in Caching in Spark
use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local
for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)
If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code.
* This partially changed in Spark 2.4, with introduction of the barrier execution mode (SPARK-24795, SPARK-24822).