Can slurm run 3 seperate computers as one "node"? - slurm

I'm an intern that's been tasked with installing slurm across three compute units running ubuntu. How things work now is people ssh into one of the compute units and run a job on there, since all three units share memory through nfs mounting. Otherwise they are separate machines though.
My issue is that from what I've read in the documentation, it seems like when installing slurm I would specify each of these compute units as a completely separate node, and any jobs I'd like to run that use multiple cores would still be limited by how many cores are available on the individual node. My supervisor has told me however that the three units should be installed as a single node, and when a job comes in that needs more cores than available on a single compute unit, slurm should just use all the cores. The intention is that we won't be changing how we execute jobs (like a parallelized R script), just "wrapping" them in a sbatch script before sending them to slurm for scheduling and execution.
So is my supervisor correct in that slurm can be used to run our parallelized scripts unchanged with more cores than available on a single machine?

Running a script on more cores than available is nonsense. It does not provide any performance increase, rather the oposite, as more threads have to be managed but the computing power is the same.
But he is right in the sense that you can wrap your current script and send it to SLURM for execution using the whole node. But the three machines will be three nodes. They cannot work as a single node because, well, they are not a single node/machine. They do not share either memory nor busses, nor peripherals... they just share some disk thru the network.
You say that
any jobs I'd like to run that use multiple cores would still be limited by how many cores are available on the individual node
but that's the current situation with SSH. Nothing is lost by using SLURM to manage the resources. In fact, SLURM will take care of giving each job the proper resources and avoiding other users interfering with your computations.
Your best bet: create a cluster of three nodes as usual and let people send their jobs asking for as many resources they need without exceeding the available resources.

Related

Should You Use PM2, Node Cluster, or Neither in Kubernetes?

I am deploying some NodeJS code into Kubernetes. It used to be that you needed to run either PM2 or the NodeJS cluster module in order to take full advantage of multi-core hardware.
Now that we have Kubernetes, it is unclear if one must use one or the other, to get the full benefit of multiple cores.
Should a person specify the number of CPU units in their pod YAML configuration?
Or is there simply no need to account for multiple cores with NodeJS in Kubernetes?
You'll achieve utilization of multiple cores either way; the difference being that with the nodejs cluster module approach, you'd have to "request" more resources from Kubernetes (i.e., multiple cores), which might be more difficult for Kubernetes to schedule than a few different containers requesting one core (or less...) each (which it can, in turn, schedule on multiple nodes, and not necessarily look for one node with enough available cores).

Is it possible to isolate spark cluster nodes for each individual application

We have a spark cluster comprising of 16 nodes. Is it possible to limit nodes 1 & 2 for application 'A'; nodes 3,4,5 for application 'B'; nodes 10,11,12,15 for application 'C'; and so on?
From the documentation, I understand that we can set some properties to control spark executor cores, number of executors to be launched, memories etc. But, I am curious to know if I can achieve the above use case.
One obvious way to do that is to configure 3 different clusters with the desired topology, otherwise you're out of luck, spark does not have any provision,
because it is usually a bad idea and generally against the design principles of spark and clustering in general. Why? If you assign application A to specific hosts, but it gets idle, while application B is running at 100%, you have 2 idle hosts that could be working for B, so you would be wasting costly computing resources. Usually, what you want is to assign a certain number of resources per application and let the system decide how to allocate them (scheduling.. plain spark is pretty elementary, but running under YARN and Mesos you can be more sophisticated).
Another reason why it's a bad idea is that you don't want rules that specify a specific host or set of hosts. What if you assign node 1&2 to application A and they both go down? Beside not using your resources efficiently, tying your app to specific hosts makes it also difficult to make them resilient to failure by rescheduling them on other hosts.
You may have other ways to do something similar though, if you're running spark under YARN or Mesos, you can define queues or quotas and limit the amount of resources that each application can use at a given time.
In general, it depends on the reason, why do you want to statically allocate resources to applications. If it's for resource management, you should instead looking at schedulers and queues. If it's for security, you should have multiple clusters, keeping in mind that you'd be losing in performance.

Separate cpuset for jobs

Is there a way to restrict cpus and memory for users running scripts directly, but allow more cpus and memory on job submission?
I am running torque/pbs on Ubuntu 14.04 server and want to allow "normal" usage of 8 cpus and 16GB RAM, and the rest to be dedicated as a "mom" resource for the cluster. Normal cgroups/cpuset configuration also restricts the running jobs.
If you configure Torque with --enable-cpuset the mom will automatically create a cpuset for each job. Torque isn't really equipped to use part of a machine, but a hack that might work to make this work in conjunction with only using half the machine is to specify np= in the nodes file, and then the mom will restrict the jobs to the first X cpus.

Presto configuration

As I set up a cluster of Presto and try to do some performance tuning, I wonder if there's a more comprehensive configuration guide of Presto, e.g. how can I control how many CPU cores a Presto worker can use. And is it good practice if I start multiple presto workers on a single server (in which case I don't need a dedicated server to run the coordinator)?
Besides, I don't quite understand the task.max-memory argument. Will the presto worker start multiple tasks for a single query? If yes, maybe I can use task.max-memory together with the -Xmx JVM argument to control the level of parallelism?
Thanks in advance.
Presto is a multithreaded Java program and works hard to use all available CPU resources when processing a query (assuming the input table is large enough to warrant such parallelism). You can artificially constrain the amount of CPU resources that Presto uses at the operating system level using cgroups, CPU affinity, etc.
There is no reason or benefit to starting multiple Presto workers on a single machine. You should not do this because they will needlessly compete with each other for resources and likely perform worse than a single process would.
We use a dedicated coordinator in our deployments that have 50+ machines because we found that having the coordinator process queries would slow it down while it performs the query coordination work, which has a negative impact on overall query performance. For small clusters, dedicating a machine to coordination is likely a waste of resources. You'll need to run some experiments with your own cluster setup and workload to determine which way is best for your environment.
You can have a single Presto process act as both a coordinator and worker, which can be useful for tiny clusters or testing purposes. To do so, add this to the etc/config.properties file:
coordinator=true
node-scheduler.include-coordinator=true
Your idea of starting a dedicated coordinator process on a machine shared with a worker process is interesting. For example, on a machine with 16 processors, you could use cgroups or CPU affinity to dedicate 2 cores to the coordinator process and restrict the worker process to 14 cores. We have never tried this, but it could be a good option for small clusters.
A task is a stage in a query plan that runs on a worker (the CLI shows the list of stages while the query is running). For a query like SELECT COUNT(*) FROM t, there will be a task on every work that performs the table scan and partial aggregation, and another task on a single worker for the final aggregation. More complex queries that have joins, subqueries, etc., can result in multiple tasks on every worker node for a single query.
-Xmx must be higher than task.max-memory, or at least equal.
otherwise you will be likely to see OOM issue as I have experienced that before.
and also, since Presto-0.113 they have changed the way Presto manages the query memory and according configurations.
please refer to this link:
https://prestodb.io/docs/current/installation/deployment.html
For your question regarding "many CPU cores a Presto worker can use", I think it's controlled by the parameter task.concurrency, which by default is 16

How to make sure that nodejs cluster assigns processes to different cores?

When utilizing multi cores via Node.js' cluster module is it guaranteed that each forked node worker is assigned to a different core?
If it's not guaranteed is there any way to control or manage it and eventually guarantee that all end up in different cores? Or the OS' scheduler distributes them evenly?
A while ago I did some tests with cluster module which you can check in this post that I wrote. Looking at the system monitor screenshots it is this pretty straightforward to understand what happens under the hood (with and without cluster module).
It is indeed up to the OS to distribute processes over the cores. You could obtain the pid of a child process and use an external utility to set the CPU affinity of that process. That would, of course, not be very portable.

Resources