SLURM: Run programs in another node - linux

I've just finished the SLURM installation in our cluster. How do I run programs in node2 (slave) from node1 (head/controller node)? Is there a special configuration for this kind of setup?

Make sure that UIDs are uniform across the cluster nodes, and that a common filesystem (for instance NFS) is mounted on all nodes.
Then you run programs typically by submitting jobs with the sbatch command.
You can refer to this set of slides for a quick start.

Related

Slurm jobs queued but not running

I'm trying to install slurm on Virtualbox running Ubuntu. We're using it to run long-running jobs via a web interface and we use slurm to queue and run the jobs. I'm using VirtualBox to create a sandbox for development.
I've set slurm up, but when I queue a job and run squeue I get:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 debug test.sh pchandle PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
When I run it on my actual hardware, the jobs run successfully.
The output of sinfo is:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 0 n/a
Yes, it says nodes are 0, but the output is the same on my actual hardware, and jobs run fine. Any suggestions on why it's saying 0 nodes?
Is this an issue with my setup, or is simply not possible to run slurm on VirtualBox due to the hardware limitations? I'm running 4 CPUs. The only obvious difference I can see is that threads per core is only 1 (there are 2 on my local hardware).
Is there anyway to debug why the nodes aren't running jobs? Or why there are no nodes available?
It turned out to be a configuration error.
In the config file /etc/slurm-llnl/slurm.conf, I'd left the configuration NodeName as the default NodeName=localhost[0-31]. Since I am running on a single host it should have been set to NodeName=localhost for a single node on the same machine.
Slurm Single Instance had a description of what the values should be set to, which helped me find the answer.
Install Slurm on a stand alone Ubuntu had the instructions I originally followed.

Per-node default partition in SLURM

I'm configuring a small cluster, controlled by SLURM.
This cluster has one master node and two partitions.
Users submit their jobs from worker nodes, I've restricted their access to the master node.
Each partition in the cluster is dedicated to a team in our company.
I'd like that members of different teams submit their jobs to different partitions without bothering with additional command line switches.
That is, I'd like default partition for srun or sbatch to be different depending on the node, running these commands.
For example: all jobs, submitted from the host worker1 should go to the partition1,
and all jobs, submitted from the hosts worker[2-4] should go to the partition2.
And all invocations of sbatch or srun should not contain -p (or --partition) switch.
I've tried setting default=YES on different lines in slurm.conf files on different computers, but this did not help.
This can be solved using SLURM_PARTITION and SBATCH_PARTITION environment variables, put in the /etc/environment file.
Details on environment variables are in manual pages for sbatch and srun

Spark Standalone Mode multiple shell sessions (applications)

In Spark 1.0.0 Standalone mode with multiple worker nodes, I'm trying to run a Spark shell from two different computers (same Linux user).
In the documentation, it says "By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes."
The number of cores per worker is set to 4 with 8 being available (via SPARK_JAVA_OPTS="-Dspark.cores.max=4"). Memory is also limited such that enough should be available for both.
However, when looking at the Spark Master WebUI, the shell application that was started later will always remain in state "WAITING" until the first one is exited. The number of cores assigned to it is 0, the Memory per node 10G (same as the one that is already running)
Is there a way to have both shells running at the same time without using Mesos?
Before a shell will start processing on a spark standalone cluster, there has to be sufficient cores and memory. You must specify from each spark shell the number of cores you want, or it will use them all. If you specify 5 cores, with executor memory=10G (the amount of memory you allocated for the executors), and the second spark shell to run with 2 cores, and 10G of memory, the second one will still not start, because the first shell is using both executors, and is using all of the memory on both. If you specify 5G of executor memory for each spark shell, then they can concurrently run.
Essentially you want to have multiple jobs running on a standalone cluster -- unfortunately, it is really not designed to handle this case well. If you want to do that you should use either mesos or yarn.
One workaround to this is to restrict the number of cores per spark shell using total-executor-cores. For example to restrict it to 16 cores, launch it like this:
bin/spark-shell --total-executor-cores 16 --master spark://$MASTER:7077
In this case each shell will use only 16 cores, so you can have two shells running on your 32 cores cluster. They can then run simultaneously but never use more than 16 cores each :(
This solution is far from ideal, I know. You depend on users to restrict themselves, to shut down their shells, and resources are wasted when a user is not running code. I have created a request to fix this on JIRA, which you can vote for.
The application ends when your shell dies. So, you cannot run concurrently two spark-shells on two laptops. What you can do is launch one spark-shell, launch the other, and have the second start when the first one dies.
Contrarily to spark-shell, spark-submit does terminate once computation is over. So you can spark-submit one app, launch a spark-shell, and have the shell take over the moment the application is done.
Or you can run two apps sequentially (one after the other) with two spark-submit launches.

How Can I run more than one cassandra server in single machine and form one cluster ring?

I would like know is there any way to run multiple Cassandra servers on a single machine, so tall the servers on that machine form one ring (cluster).
I would like know is there any way to run the cassandra servers in a single machine ?
There's always a way!
There is an excellent tool available that allows you to configure a multi-node cluster locally, but it's currently not supported under windows. When you build a cluster and start it, it will configure the ring for you. You can check out the ring using ./nodetool -h 127.0.0.1 -p 7100 ring after it has started.
*Just a side-note, the ccm tool starts the cluster as a background process.

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?
Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH
Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.
same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Resources