Slurm split single node in multiple - slurm

I'm setting up a SLURM cluster with two 'physical' nodes.
Each of the two nodes has two GPUs.
I would like to give the option to use only one of the GPUs (and have the other GPU still available for computation).
I managed to set-up something with gres, but I later realized that even if only 1 of the GPUs is used the node will be occupied and the other GPU can not be used.
Is there a way to set the GPUs as the consumables and have two 'nodes' within a single node? And to assign a limited number of CPUs and memory to each?

I've had the same problem and I managed to make it work by allowing oversubscribing.
Here's the documentation about it:
https://slurm.schedmd.com/cons_res_share.html
Not sure if what I did was exactly right, but I've put
SelectType=select/cons_tres, SelectTypeParameters=CR_Core and put OverSubscribe=FORCE for my partition. Now I can launch several GPU jobs on the same node.

Related

Spark, does size of master node on EMR matter?

When running a Spark ETL job on EMR, does the size of the master node instance matter? Based on my understanding, the master node does not handle processing/computation of data and is responsible for scheduling tasks, communicating with core and task nodes, and other admin tasks.
Does this mean if I have 10 TB of data that I need to transform and then write out, I can use 1 medium instance for master and 10 8xlarge for core nodes?
Based on reading, I see most people suggest master node instance type should be same as core instance type which I currently do and works fine. This would be 1 8xlarge for master and 10 8xlarge for core nodes.
According to AWS docs, we should use m4.large, so I'm confused what's right.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html
The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m4.large instance. For clusters of more than 50 nodes, consider using an m4.xlarge.
The way the question is asked is a little vague. Size does matter, i.e. load, etc. So I answer it from a slightly different perspective. That "most people ..." stuff is neither here nor there.
The way the Master was assigned in the past was a weakness of the EMR approach imho when I trialled it some 9 mths ago for a PoC. Allocate big resources for Workers and by default 1 went to the Master which was complete overkill.
So, if you did things standardly, you paid for non-needed larger than req'd resource for the Master Node. There is a way to define a smaller resource for the Master, but I am on hols and cannot find it back again.
However, look at the url here and you see now that during EMR Cluster
Config you can easily define a smaller Master Node or many such Master
Nodes for fail over, things have moved along since I last looked:
https://confusedcoders.com/data-engineering/how-to-create-emr-cluster-with-apache-spark-and-apache-zeppelin.
See also https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html for multiple such Master Nodes.
In general the Master Node can differ in terms of characteristics from the Workers, usually smaller, but may be not in all cases. That said, the EMR purpose would tend to point to smaller Master Node config.

Cassandra with mixed disk sizes?

Can we use Cassandra on nodes with different disk sizes? If so, how does Cassandra balances nodes and do we have any control over it?
I've found this thread but it's quite old http://grokbase.com/t/cassandra/user/113nvs23r4/cassandra-nodes-with-mixed-hard-disk-sizes
Its highly recommended not to introduce imbalance of nodes in a cluster (at least within the same DC) in terms of hard disk, CPU, Memory. All nodes in the cluster are treated equal and there is no intelligence behind the disk capacity on each node.
Unless you can take the pain of manually distributing tokens instead of using vnodes, this is not advisable. In case of manual distribution, one has control over which node to assign more tokens and where less. Again hoping and praying that the data distribution is uniform and hence the node with less tokens will get less data.

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:
root_dir
web_logs
\input (subdirectory)
\output (subdirectory)
network_logs (same subdirectories as web_logs)
system_logs (same subdirectories as web_logs)
The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.
My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.
The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.
My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?
Thanks so much in advance!
Besides startup overhead, the main other consideration when using single-use clusters per job is that some jobs might be more prone to "stragglers" where data skew leads to a small number of tasks taking much longer than other tasks, so that the cluster isn't efficiently utilized near the end of the job. In some cases this can be mitigated by explicitly downscaling, combined with the help of graceful decommissioning, but if a job is shaped such that many "map" partitions produce shuffle output across all the nodes but there are "reduce" stragglers, then you can't safely downscale nodes that are still responsible for serving shuffle data.
That said, in many cases, simply tuning the size/number of partitions to occur in several "waves" (i.e. if you have 100 cores working, carving the work into something like 1000 to 10,000 partitions) helps mitigate the straggler problem even in the presence of data skew, and the downside is on par with startup overhead.
Despite the overhead of startup and stragglers, though, usually the pros of using new ephemeral clusters per-job vastly outweigh the cons; maintaining perfect utilization of a large shared cluster isn't easy either, and the benefits of using ephemeral clusters includes vastly improved agility and scalability, letting you optionally adopt new software versions, switch regions, switch machine types, incorporate brand-new hardware features (like GPUs) if they become needed, etc. Here's a blog post by Thumbtack discussing the benefits of such "job-scoped clusters" on Dataproc.
A slightly different architecture if your jobs are very short (i.e. if each one only runs a couple minutes and thus amplify the downside of startup overhead) or the straggler problem is unsolveable, is to use "pools" of clusters. This blog post touches on using "labels" to easily maintain pools of larger clusters where you still teardown/create clusters regularly to ensure agility of version updates, adopting new hardware, etc.
You might want to explore my solution for Autoscaling Google Dataproc Clusters
The source code can be found here

How to add new nodes in Cassandra with different machine configuration

I have 6 cassandra nodes in two datacenters with 16GB of memory and 1TB HD drive.
Now I am adding 3 more nodes with 32GB of memory. will these machines will cause overhead for existing machines ( May be in token distribution )? if so please suggest how to configure these machine to avoid those problems.
Thanks in advance.
The "balance" between nodes is best regulated using vnodes. If you recall (if you don't, you should read about it), the ring that Cassandra nodes form is actually consisted out of virtual nodes (vnodes). Each node in the ring has a certain portion of vnodes, which is set up in the Cassandra configuration on each node. Based on that number of vnodes, or rather the proportion between them, the amount of data going to those nodes is calculated. The configuration you are looking for is num_tokens. If you have similarly powerful machines, than an equal vnode number is available. The default is 256.
When adding a new, more powerful machine, you should assign a greater number of vnodes to it. How much? I think it's hard to tell. It's unwise to give it twice more, only be looking at the RAM, since those nodes will have twice as many data than the others. Than you might expect more IO operations on them (remember, you still have the same HDD) and CPU utilization (and the same CPU). You might want to take a look at this answer also.

How does partitions map to tasks in Spark?

If I partition an RDD into say 60 and I have a total of 20 cores spread across 20 machines, i.e. 20 instances of single core machines, then the number of tasks is 60 (equal to the number of partitions). Why is this beneficial over having a single partition per core and having 20 tasks?
Additionally, I have run an experiment where I have set the number of partitions to 2, checking the UI shows 2 tasks running at any one time; however, what has surprised me is that it switches instances on completion of tasks, e.g. node1 and node2 do the first 2 tasks, then node6 and node8 do the next set of 2 tasks etc. I thought by setting the number of partitions to less than the cores (and instances) in a cluster then the program would just use the minimum number of instances required. Can anyone explain this behaviour?
For the first question: you might want to have more granular tasks than strictly necessary in order to load less into memory at the same time. Also, it can help with error tolerance, as less work needs to be redone in case of failure. It is nevertheless a parameter. In general the answer depends on the kind of workload (IO bound, memory bound, CPU bound).
As for the second one, I believe version 1.3 has some code to dynamically request resources. I'm unsure in which version the break is, but older versions just request the exact resources you configure your driver with. As for how comes a partition moves from one node to another, well, AFAIK it will pick the data for a task from the node that has a local copy of that data on HDFS. Since hdfs has multiple copies (3 by default) of each block of data, there are multiple options to run any given piece).

Resources