Queuing deep learning training Jobs on a shared server - linux

We have multiple GPU servers for deep learning training. We train our code inside Docker containers.
Today every user can log into these machines interactively and run a job. There is no restriction on job duration, or how many GPUs one can allocate. We haggle over GPUs via a chat channel.
What's the simplest framework for queuing jobs and allocating them to GPUs? I'm looking for GPU-level granularity, and am fine with multiple user running on the same machine.
The desired workflow is:
Allocate GPU
Create docker container
Run job (and grab requirements on top of vanilla container if necessary). Can be a bash file?
Finish job / kill job if running too long and close container
Additionally, I'd like an interactive workflow similar to the above, where the time limits are reduced (say, 2 hours max of interactive time?)
I'd like to prevent someone from hogging a machine for a week, so a job should have a finite runtime and be killed afterward. Also, multiple people shouldn't be able to use the same GPU simultaneously.
I realize I can do this with a cron job that monitors and kills jobs, but am looking for something more elegant, readymade, and preferably with a nice UI. I've tried ClearML, but can't figure out how to use it for this purpose. I know SLURM is used for allocating entire machines. It's unclear to me if it's possible to allocate specific GPUs.

Related

How does a project like folding#home communicate with your computer to solve protein folding?

How does a program like folding#home work? Does my computer
individually perform a unit of "work" on it completely separate to other computers running folding#home? Then send the answer back when it's completed?
Or does Folding#home see all the computers connected to it as the project having let's say 1000 cores and then when work is done it's the equivalent of saying something like make -j <total number of cores>
Projects likes Folding#Home and BOINC are examples of loosely-coupled parallel computing where each task is fully self-contained and can be completed without communication with other computing entities. They are also examples of a pattern known as controller/worker (used to be known as master/worker), in which a central controller splits a large task into a pool of small(er) subtasks and distributes it to a bunch of worker processes on a first come first served basis, which corresponds to your first point.
In F#H (and BOINC), client computers connect to the server, request a task, work on it until it's complete, then connect to the server again to return the result and request a new task. The benefits of this are automatic load balancing, fault tolerance (via redundancy), and no need for scheduling.
When you run make -j #cores, make launches a number of parallel jobs but those jobs are usually interdependent, so make has to schedule them in an optimal way. The jobs are then run as processes on the same computer which affords make full process control. If a build step fails, the entire build job aborts immediately and the user can quickly look into the problem, fix it, and restart the build. This is not a viable model for when a client computer could have an arbitrary compute speed, could connect and disconnect at any time, and/or could decide to simply stop processing tasks. There are distributed versions of make like dmake that run different parts of the build process on different remote nodes, but that still happens in a tightly controlled environment, typically on a build cluster.
Note that on a very high level of abstraction the two are basically equivalent with the main difference being whether jobs are pushed or pulled. While job pulling works fine on all kinds of systems, job pushing usually requires (tightly-coupled) systems with predictable characteristics and good scheduling algorithms to be efficient.

VM - LLVMPIPE (Software Rendering) and Effect of Core/Thread Usage on Application

I have written an application (Qt/C++) that creates a lot of concurrent worker threads to accomplish its task, utilizing QThreadPool (from the Qt framework). It has worked flawlessly running on a dedicated server/hardware.
A copy of this application is now running in a virtual machine (RHEL 7), and performance has suffered significantly in that the queue (from the thread pool) is being exercised quite extensively. This has resulted in things getting backed up a bit. This, despite having more cores available to the application through this VM version than the dedicated, non-virtualized server.
Today, I did some troubleshooting with the top -H -p <pid> command, and found that there were 16 total llvmpipe-# threads running all at once, apparently for software rendering of my application's very simple graphical display. It looks to me like the presence of so many of these rendering threads has left limited resources available for my actual application's threads to be concurrently running. Meaning, my worker threads are yielding/taking a back seat to these.
As this is a small/simple GUI running on a server, I don't care to dedicate so many threads to software rendering of its display. I read some Mesa3D documentation about utilizing the LP_NUM_THREADS environment variable, to limit its use. I set it to LP_NUM_THREADS=4, and as a result I seem to have effectively opened up 12 cores for my application to now use for its worker threads.
Does this sound reasonable, or will I pay some sort of other consequence for doing this?

Monitor node.js scripts running on ubuntu instance

I have a node.js script that run once in a day on ubuntu EC2 instance. This script pulls data from some hundered thousand remote APIs and save to our local database. Is there any way we can monitor this node.js script on remote server? There have been few instances where script crashed due to some reason and we were unable to figure it out without SSHing into instance and checking the logs. I have however created a small system after first few crashes which send us an email whenever script crashes due to some uncaught exception and also when script completes execution.
However, we need to develop a better system where we can monitor the progress of script via web interface of our admin application which is deployed over some other instance and also trigger start/stop of script via this interface. What are possible options for achieving this?
If you like to stay in Node.js, then there are several process monitoring tools:
PM2 comes with lots of other features besides monitoring processes. You can monitor your processes via CLI or their official web interface: https://keymetrics.io/. A quick search on npm also gives a bunch of nice unofficial gui tools: https://www.npmjs.com/search?q=pm2+web
Forever is not as feature rich as PM2 but will do the basic process operations and couple of gui are also available in npm.
There are two problems here that you are trying to solve:
Scheduling work to be done
Monitoring a process for failure
At a simple level, this is easy: schedule a cron job and restart failed things so they keep trying.
However, when things don't go smoothly, it helps to have a lot more granularity over what you are scheduling, and how it is executed. This would also give you the visibility over each little piece of work.
Adding a little more complexity, you can end up with something like this:
Schedule the script that starts everything (via cron, if that's comfortable)
That script generates several jobs that need to be executed into a queue
A worker process (or n worker processes) consume that queue and execute pending jobs
You can monitor both the progress of the jobs, as well as the state of each worker (# of crashes, failures, jobs completed, etc.). The other tools mentioned above are good candidates for this (forever, pm2, etc.)
When jobs fail, other workers can pick up the small piece of work that was in progress and restart it. This is much more efficient than restarting the entire process, and also lets you parallelize things across n workers based on how you can split up the workloads.
You could easily throw the status onto a web app so you can check in periodically rather than have to dig through server logs.
You can also get more intelligent with different types of failures. Network error? Retry 5 times. Rated limited? Gradual back-off. Crash? Don't retry and notify via email. etc
I have tried this with pm2, you can get the info of the task, then cat out or grab the log files. Or you could have a logging server, see also: https://github.com/papertrail/remote_syslog2

Node.js Clusters with Additional Processes

We use clustering with our express apps on multi cpu boxes. Works well, we get the maximum use out of AWS linux servers.
We inherited an app we are fixing up. It's unusual in that it has two processes. It has an Express API portion, to take incoming requests. But the process that acts on those requests can run for several minutes, so it was build as a seperate background process, node calling python and maya.
Originally the two were tightly coupled, with the python script called by the request to upload the data. But this of course was suboptimal, as it would leave the client waiting for a response for the time it took to run, so it was rewritten as a background process that runs in a loop, checking for new uploads, and processing them sequentially.
So my question is this: if we have this separate node process running in the background, and we run clusters which starts up a process for each CPU, how is that going to work? Are we not going to get two node processes competing for the same CPU. We were getting a bit of weird behaviour and crashing yesterday, without a lot of error messages, (god I love node), so it's bit concerning. I'm assuming Linux will just swap the processes in and out as they are being used. But I wonder if it will be problematic, and I also wonder about someone getting their web session swapped out for several minutes while the longer running process runs.
The smart thing to do would be to rewrite this to run on two different servers, but the files that maya uses/creates are on the server's file system, and we were not given the budget to rebuild the way we should. So, we're stuck with this architecture for now.
Any thoughts now possible problems and how to avoid them would be appreciated.
From an overall architecture prospective, spawning 1 nodejs per core is a great way to go. You have a lot of interdependencies though, the nodejs processes are calling maya which may use mulitple threads (keep that in mind).
The part that is concerning to me is your random crashes and your "process that runs in a loop". If that process is just checking the file system you probably have a race condition where the nodejs processes are competing to work on the same input/output files.
In theory, 1 nodejs process per core will work great and should help to utilize all your CPU usage. Linux always swaps the processes in and out so that is not an issue. You could start multiple nodejs per core and still not have an issue.
One last note, be sure to keep an eye on your memory usage, several linux distributions on EC2 do not have a swap file enabled by default, running out of memory can be another silent app killer, best to add a swap file in case you run into memory issues.

Resource mange external nodes in Jenkins for tests

My problem is that I have code that need a rebooted node. I have many long running Jenkins test jobs that needs to be executed on rebooted nodes.
My existing solution is to define multiple "proxy" machines in Jenkins with the same label (TestLable) and 1 executor per machine. I bind all the test jobs to the label (TestLable). In the test execution script I detect the Jenkins machine (Jenkins env. NODE_NAME) and use that to know what physical physical machine the tests should use.
Do anybody know of a better solution?
The above works but I need to define a high number of “nodes/machines” that may not be needed. What I would like was a plugin that would be able to grant a token to a Jenkins job. This way a job would not be executed before a Jenkins executor and a token was free. The token should be a string so that my test jobs could use it to know what external node it could use.
We have written our own scheduler that allocates stuff before starting Jenkins nodes. There may be a better solution - but this works for us mostly. I've yet to come across an off-the-shelf scheduler that can deal with complicated allocation of different hardware resources. We have n box types, allocated to n build types.
Some build types we have are not compatible together without destroying all persistent data - which may be required as it takes a long time to gather. Some jobs require combinations of these hardware types. We store the details in a DB, and then use business logic to determine how it is allocated. We've often found that particular job types need additional business logic or extra data fields to account for their specific requirements.
So it may be the best way is to write your own scheduler, in your own language of choice, which takes into account your particular needs.

Resources