Remotely access a qsub compute node

Remotely access a qsub compute node - remote-access

Using qsub, I have submitted a long running job that spawns two Java processes; one of which is listening for Java RMI calls on some port. Say qsub assigns that job to node "compute-0-37". How can I communicate with compute-0-37 remotely (on a node other than the head node) over an RPC call (Java RMI in this case)?
I have not been able to find this from reading existing docs (e.g. http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html, http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm). As far as I can tell, the only way to access a compute node directly is from the head node, but it seems like that would be pretty restrictive for use cases like mine.

The reason you can't find any documentation in the resource management software documentation is because this isn't a resource management software question. Accessing worker nodes is simply that: a function of network access. Whether you're hoping to do RPC communication between the nodes themselves, or with some machine on a different subnet, you should be able to do so (provided that site policies and the system administator/s allow it).

Related

Docker containers instead of multiprocessing

One of the main application of Docker containers is load-balancing. For example, in the case of a web application, instead of having only one instance handling all requests, we have many containers doing exactly the same thing, but the requests are split toward all of these instances.
But can it be used to do the same service, but with different "parameters"?
For instance, let's suppose I want to create a platform storing crypto-currency data from different exchange platforms (Bitfinex, Bittrex, etc.).
A lot of these platforms are handling web sockets. So in order to create one socket per platform, I would do something at the "code layer" like (language agnostic):
foreach (platform in platforms)
client = createClient(platform)
socket = client.createSocket()
socket.GetData()
Now of course, this loop would be stuck on the first iteration, because the websocket is waiting (although I could use asynchrony, anyway). To circumvent that, I could use multiprocessing, something like:
foreach (platform in platforms)
client = createClient(platform)
socket = client.createSocket()
process = new ProcessWhichGetData(socket)
process.Launch()
Is there any way to do that at a "Docker layer", I mean to use Docker to make the different containers handling different platforms? I would have one Docker container for Bittrex, one Docker container for Bitfinex, etc.
I know this would imply that either the different containers would communicate between each other (who takes care of Bitfinex? who takes care of Bittrex?), or the container orchestrator (Docker Swarm / Kubernete) would handle itself this "repartition".
Is it something we could do, and, on top of that, is it something we want?

Docker containerization simply adds various layers of isolation around regular user-land processes. It does not by itself introduces coordination among several processes, though it certainly can be exploited in building a multi-process system where each process perform some jobs, no matter if these jobs are redundant or complementary.
If you can design your solution so that one process is launched for each "platform" (for example, passing the specific platform an instance should handle as a command line parameter), then indeed, this can technically be done in Docker.
I should however point out that it is not clear why you would want to run each process in a distinct container. Is isolation pertinent for security reasons? For resource accounting? To have each process dispatched to a distinct host in order to have access to more processing power? Also, is there coordination required among these processes, outside of the having to initially determine which process handle which platform? If so, do they need to have access to a shared storage, or be able to send signals to each others? These questions will help you decide how to approach the dockerization of your solution.
In the most simple case, assuming that all you want is to have the whole process be isolated from the rest of the system, but with no requirement that these processes be isolated from each other, then the most simple strategy would simply to have a single container that contains an entrypoint shell script, which will simply launch one process per platform.
entrypoint.sh (inside your docker image):
#!/bin/bash
platforms=Bitfinex Bittrex
for platform in ${platforms} ; do
./myprogram "${platform}" &
done
If you really need a distinct container for each platform, then you would use a similar script, but this time, it would be run directly on the host machine (that is, outside of a container), and would encapsulate each process inside a docker container.
launch.sh (directly on the host):
#!/bin/bash
for platform in ${platforms} ; do
docker -name "program_${platform}" my_program_docker \
/usr/local/bin/myprogram "$platform"
done
Alternatively, you could use docker-compose to define the list of docker containers to be launched, but I will not discuss more this option at present (just ask if this seems pertinent to your you case).
If you need containers to be distributed among several host machines, then that same loop could be used, but this time, processes would be launched using docker-machine. Alternatively, if using docker-compose, the processes could be distributed using Swarm.

Say you restructured this as a long-running program that handled only one platform at a time, and controlled which platform it was via a command-line option or an environment variable. Instead of having your "launch all the platforms" loop in code, you might write a shell script like
#!/bin/sh
for platform in $(cat platforms.txt); do
./run_platform $platform &
done
This setup is easy to transplant into Docker.
You should not plan on processes launching Docker containers dynamically. This is hard to set up and has significant security implications (by which I mean "a bug in your container launcher could easily root your host").
If the individual processing tasks can all run totally independently (maybe they use a shared database to store data) then you're basically done. You could replace that shell script with something like a Docker Compose YAML file that lists out all of the containers; if you want to run this on multiple hosts you can use tools like Ansible, or Docker Swarm, or Kubernetes to spread the containers out (with varying levels of infrastructure complexity).

You can bunch the different docker containers in a STACK and also configure networking so that the docker containers can remain isolated form the outside world but can communicate with each other.
More info here Docker Stack

Submit & monitor spark jobs via java in cluster mode

I have a java class which manage jobs and execute them via spark(using 1.6).
I am using the API - sparkLauncher. startApplication(SparkAppHandle.Listener... listeners) in order to monitor the state of the job.
The problem is I moved to work in a real cluster environment and this way can’t work when the master and workers are not on the same machine, as the internal implementation is making a use of localhost only (loopback) to open a port for the workers to bind to.
The API sparkLauncher.launch() works but doesn’t let me monitor the status.
What is the best practice for cluster environment using a java code?
I also saw the option of hidden Rest API, is it mature enough? Should I enable it in spark somehow (I am getting access denied, even though the port is open from outside) ?

REST API
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
More details you can find here.

Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://driver-node:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
More details you can find here.

How to consist the containers in Docker?

Now I am developing the new content so building the server.
On my server, the base system is the Cent OS(7), I installed the Docker, pulled the cent os, and establish the "WEB SERVER container" Django with uwsgi and nginx.
However I want to up the service, (Database with postgres), what is the best way to do it?
Install postgres on my existing container (with web server)
Build up the new container only for database.
and I want to know each advantage and weak point of those.

It's idiomatic to use two separate containers. Also, this is simpler - if you have two or more processes in a container, you need a parent process to monitor them (typically people use a process manager such as supervisord). With only one process, you won't need to do this.
By monitoring, I mainly mean that you need to make sure that all processes are correctly shutdown if the container receives a SIGSTOP signal. If you don't do this properly, you will end up with zombie processes. You won't need to worry about this if you only have a signal process or use a process manager.
Further, as Greg points out, having separate containers allows you to orchestrate and schedule the containers separately, so you can do update/change/scale/restart each container without affecting the other one.

If you want to keep the data in the database after a restart, the database shouldn't be in a container but on the host. I will assume you want the db in a container as well.
Setting up a second container is a lot more work. You should find a way that the containers know about each other's address. The address changes each time you start the container, so you need to make some scripts on the host. The host must find out the ip-adresses and inform the containers.
The containers might want to update the /etc/hosts file with the address of the other container. When you want to emulate different servers and perform resilience tests this is a nice solution. You will need quite some bash knowledge before you get this running well.
In about all other situations choose for one container. Installing everything in one container is easier for setting up and for developing afterwards. Setting up Docker is just the environment where you want to do your real work. Tooling should help you with your real work, not take all your time and effort.

Accessing Matlab MDCS Cluster over SSH

I just installed Matlab's Distributed Computing Server on a bunch of machines and it works, but only for those physically connected to the cluster's network. For remote access those machines are 2 SSH hops away. How this problem is usually solved? I thought in setting up a VPN, but to me this seems like last resort.
What I want is that everybody in the lab, using their own versions of Matlab, with the correct Toolbox, just run their code in the cluster somewhat effortlessly. I guess I could ask to everybody just tar-ball their files and access a remote installation of matlab, somehow forwarding the GUI session (VNC or X-Forward), but that seem ugly.
Any help?

It is possible to set up "remote access" to a cluster running MDCS so that clients without direct access can submit jobs there. The documentation for this starts here:
http://www.mathworks.com/help/mdce/configure-parallel-computing-products-for-a-generic-scheduler.html
I'm not quite sure how to configure things so that the submission can work across two SSH connections - the example integration scripts shipping with MDCS all presume only one. However, it should be possible providing that:
The client can put the job and task files somewhere the execution nodes can see them
The client can trigger the appropriate qsub or whatever on the cluster headnode
You might also consider simply contacting MathWorks installation support.

Jenkins - Managing a pool of resources

I'm trying to set up a Jenkins system where a certain program has to be run on a board on the network, accessed using telnet. We're talking about hundreds of such jobs here, therefore we will be setting up multiple boards. Therefore, each job has to be allocated a board, but the catch is that only one job can have a certain board at the same time, otherwise the program fails.
The solution I have right now is using a master-slave set-up where I connect to the same machine using SSH (so a master and multiple slaves on the same machine). Each of the slave nodes then has a label for the IP address the program has to telnet to. This works, scheduling wise, but it might cause issues because all nodes connect using SSH to the same machine. Connecting to the boards using SSH is not an option.
Is there any way to get the same functionality as above, but then without using SSH to connect to the same machine? So basically I want to be able to say: we have n available machines, when a job comes in give it one of those machines and pass it a label belonging to that machine (its IP address in this case); now there are n-1 machines left.
Mutual exclusion comes close, but does not allow the above functionality, and jobs waiting for a resource take up one of the executors of a node.
Thanks a lot!

I realize your problem is probably solved already years ago, but in case someone else is looking for the answer and runs into this.
You can use "Lockable resources" plugin and set the ip address as the name of the resource and use label such use test-board-ip.It is simple and easy to use.
Another possibility is to use "External resources dispatcher" plugin. It provides a bit more possibilities, but it has a bug that causes it to hang sometimes. And it seems there is no maintenance any more (last updates from 2013).

Maybe you should hava a look at the Lock and Latches Plugin. You are able to lock a resource with this plugin with only requireing the job to lock the board you want to.
https://wiki.jenkins-ci.org/display/JENKINS/Locks+and+Latches+plugin

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string