I'm on a slurm cluster with a bunch of nodes. I want to run two seperate notebooks, on two seperate nodes. Unfortunatley, when I run two jupyter lab instances, they wind up clobbering .ipynb_checkpoint and other hidden files that are essential for jupyter, even if they use differrent ports. I think this is because they are sharing the same home directory. I'm wondering if there is some addon which will allow me to select which node to use when initializing a kernel, but I can't find it.
The soloution was to start jupyter lab instances on different nodes from different directories (starting them in the same directory causes them to clobber each other). I don't know why this didn't work before. Watch out for editing the same file in two different instances though.
Related
We recently got a supercomputer (I will call it the "cluster", it has 4 GPUs and 12-core processor with some decent storage and RAM) to our lab for machine learning research. A Linux distro (most possibly CentOS or Ubuntu depending on your suggestions of course) will be installed in the machine. We want to design the remote access in such a way that we have the following user hierarchy:
Admin (1 person, the professor): This will be the only superuser of the cluster.
Privileged User (~3 people, PhD students): These guys will be the more tech-savvy or long-term researchers of the lab that will have a user defined for themselves at the cluster. They should be able to setup their own environment (through docker or conda), remote dev their projects and transfer files in and out of the cluster freely.
Regular User (~3 people, Master's students): We expect these kind of users to only interact with the cluster for its computing capabilities and the data it stores. They should not have their own user at the cluster. It is ok if they can only use Jupyter Notebooks. They should be able to access the read-only data in the cluster as the data we are working on will be too much for them to download it locally. However, they should not be able to change anything within the cluster and only be able to have their notebooks and a number of output files there which they should be able to download to their local system whenever necessary for reporting purposes.
We also want to allocate only a certain portion of our computing capabilities for type 3 users. The others should be able to access all the capabilities when they need.
For all users, it should be easy to access the cluster from whatever OS they have on their personal computers. For type 1 and 2 I think PyCharm for remote developing .py files and tunneling for jupyter notebooks is the best option.
I did a lot of research on this but since I don't have an IT background I cannot be sure if the following approach would work.
Set up JupyterHub for type 3 users. This way we don't have to have these guys to have a user at the cluster. However, I am not sure about the GPU support of this. According to here, we can only limit CPU per user. Also, will they be able to access the data under Admin's home directory when we set up the hub or do we have to duplicate the data for that? We only want them to be able to access specific portions of data (the ones related to whatever project they are working on since they sign a confidentiality to only that project). Is this possible with JuptyterHub?
The rest (type-1 and type-2) will have their (sudo or not) users at the cluster. For this case, is there UI to workaround so that users can more easily transfer files from and to the cluster (that they don't have to use scp)? Is FileZilla an option for example?
Finally, if the type-2 users can resolve the issues type-3 users have so that they don't have refer to the professor each time they have a problem. But afaik, you have to be a superuser to control stuff at JupyterHub.
If anyone had to setup this kind of an environment at their own lab and share their experiences I would be grateful.
I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.
I want to crawl multiple websites by simultaneously running multiple instances of apache nutch-1.6. Should I install multiple copy of apache nutch in different locations and create a single(or master) .sh file for executing nutch crawl command for every copy? OR is it possible to configure a single copy of nutch for multiple instances?
I used the 'bin/crawl' script. Ran it in 2 different terminals simultaneously. Both finished their execution without any bug (atleast as per my judgment). I had supplied different seed directory and crawl directory to each simultaneous instance.
However as per an another thread here it states that you must run bin/nutch command by supplying different 'configuration' file every time you want to run a different simultaneous instance and supply a different /tmp/ path for each instance. I myself didnt had to go through that trouble. The above method worked pretty well for me
I need to design a Clustered application which runs separate instances on 2 nodes. These nodes are both Linux VM's running on VMware. Both application instances need to access a database & a set of files.
My intention is that a shared storage disk (external to both nodes) should contain the database & files. The applications would co-ordinate (via RPC-like mechanism) to determine which instance is master & which one is slave. The master would have write-access to the shared storage disk & the slave will have read-only access.
I'm having problems determining the file system for the shared storage device, since it would need to support concurrent access across 2 nodes. Going for a proprietary clustered file system (like GFS) is not a viable alternative owing to costs. Is there any way this can be accomplished in Linux (EXT3) via other means?
Desired behavior is as follows:
Instance A writes to file foo on shared disk
Instance B can read whatever A wrote into file foo immediately.
I also tried using SCSI PGR3 but it did not work.
Q: Are both VM's co-located on the same physical host?
If so, why not use VMWare shared folders?
Otherwise, if both are co-located on the same LAN, what about good old NFS?
try using heartbeat+pacemaker, it has couple of inbuilt options to monitor cluster. Should have something to look for data too
you might look at an active/passive setup with "drbd+(heartbeat|pacemaker)" ..
drbd gives you a distributed block device over 2 nodes, where you can deploy an ext3-fs ontop ..
heartbeat|pacemaker gives you a solution to handle which node is active and passive and some monitoring/repair functions ..
if you need read access on the "passive" node too, try to configure a NAS too on the nodes, where the passive node may mount it e.g. nfs|cifs ..
handling a database like pgsq|mysql on a network attached storage might not work ..
Are you going to be writing the applications from scratch? If so, you can consider using Zookeeper for the coordination between master and slave. This will put the coordination logic purely into the application code.
GPFS Is inherently a clustered filesystem.
You setup your servers to see the same lun(s), build the GPFS filesystem on the lun(s) and mount the GPFS filesystem on the machines.
If you are familiar with NFS, it looks like NFS, but it's GPFS, A Clustered filesystem by nature.
And if one of your GPFS servers goes down, if you defined your environment correctly, no one is the wiser and things continue to run.
I have 4 computers in a network. I am curios if anyone know how i can i make python to look for some files in different folders from the network or if i can create a mount point that include some folders from different computers. The reason is that i am running a script that needs to open some daemons on different computer. For instance i have the following folders from which i need to run:
/temp on 10.18.2.25
/opt on 10.18.2.35
/var-temp on 10.18.4.12
/spam on 10.18.2.17
I am using the command os.system('exec .....') to launch it , but only works for the current directory.
It sounds like you don't merely want to execute files stored on different machines on one machine, but on the machines they're stored on. Mounting won't help with that.
You'd need a daemon already running on the target, and tell it over the network to execute a file. xinetd is a common one.