Missing IIB Broker Directories - linux

/var/mqsi/components/broker_name is the default directory that the broker called broker_name stores its configuration information, as far as I am aware. Below this directory are the pid directories containing execution group information, stdout/stderr, as well as other things.
I don't know if it's relevant to the issue we're seeing, but for historic reasons we have this directory in a different location for our brokers: (/var/broker_name/data/mqsi/components/broker_name).
We are using IIB 9.0.0.7, and multi-instancing (so all the directories are on the same HNAS with different mounts). We have eight brokers, so /var/broker_name_1, /var/broker_name_2, etc. are each mount points; with directories /data/mqsi/components/broker_name_n below that.
Yesterday afternoon, the directory /var/broker_name_n/data/mqsi/components/broker_name_n disappeared from the file system for two of the brokers. The directory /var/broker_name_n/data/mqsi/components still exists; but the broker_name_n directory below components disappeared.
Because the processes are running in memory, the broker and applications on the execution groups carried on working until we restarted one of the brokers; at which point all of the execution groups disappeared. The other broker was still running, so I looked and saw that the directory was missing. restarting the broker recreated it, but not the execution groups.
Is this behaviour something that anyone has seen before? Could this be caused by something in IIB; or is it likely to be something on the system that caused this to happen? It's weird that it was that specific directory on both of the servers, if it's a file-system issue.
in /var/log/messages, we don't see any issues; except for errors because different processes can't find the files they need - e.g. BIP3108S: Unable to initialize the listener environment. Exception text getListenerParametersFromFile: java.io.FileNotFoundException: /var/broker_name_1/data/mqsi/components/broker_name_1/config/wsplugin6.conf (No such file or directory)
Is there any way to recover from this in IIB without reinstalling everything?

Related

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

Disk Space getting filled up due to jobcache in tmp directory of nutch linux instance

I am a newbie. We have setup solr environment and we see that in nutch we are facing an issue. Disk space is being 100% utilized. When we debug it we see that the jobcache in the below location is utilizing more space (70% appx.).
"/tmp/hadoop-root/mapred/local/taskTracker/root/jobcache/".
I have searched many forums to understand what exactly does this jobcache folder contains.
Can anyone help me in understanding what does this jobcache folder contains and how can I restrict this tmp folder to not to utilize the space.
What effect will it have if I remove the jobcache folder and again create it by using mkdir command?
Thanks in advance.
The directory name you mentioned is /tmp/hadoop-root/mapred/local/taskTracker/root/jobcache/.
This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks
are run on the slaves. When a job completes, the directories under the jobCache must get automatically cleaned up.
This email chain http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3C26850_1357828735_0MGE0023YZCTOO30_99DD75DC8938B743BBBC2CA54F7224A706D2E1AF#NYSGMBXB06.a.wcmc-ad.net%3E discussed a similar problem.

After restarting arch linux, zookeeper deletes all nodes I have added

I am configuring my zookeeper nodes and when I restart my machine all the nodes are gone. I am using the command "create /[node_name] [data]"
If you're using the default zoo.cfg, your zookeeper data directory might be in the tmp directory, which is getting cleared on boot. Try setting it to something else, like /var/lib/zookeeper.
Also ensure you are creating your nodes as CreateMode.PERMANENT
EPHEMERAL nodes will disappear.
Did you figure this out? If not, one more WAG:
The default zoo.cfg uses a relative path for the dataDir. So one time if you start ZK from within the bin dir the data directory will be one place, but if you start it from the directory above (via ./bin/...) then it will be a directory higher. I would always use an absolute path there...

What is the proper place to put named pipes on Linux?

I've got a few processes that talk to each other through named pipes. Currently, I'm creating all my pipes locally, and keeping the applications in the same working directory. At some point, it's assumed that these programs can (and will) be run from different directories. I need to create these pipes I'm using in a known location, so all of the different applications will be able to find the pipes they need.
I'm new to working on Linux and am not familiar with the filesystem structure. In Windows, I'd use something like the AppData folder to keep these pipes. I'm not sure what the equivalent is in Linux.
The /tmp directory looks like it probably could function just nicely. I've read in a few places that it's cleared on system shutdowns (and that's fine, I have no probably re-creating the pipes when I start back up.) but I've seen a few other people say they're losing files while the system is up, as if it's cleaned periodically, which I don't want to happen while my applications are using those pipes!
Is there a place more suited for application specific stores? Or would /tmp be the place that I'd want to keep these (since they are after all, temporary.)?
I've seen SaltStack using /var/run. The only problem is that you need root access to write into that directory, but let's say that you are going to run your process as a system daemon. SaltStack creates /var/run/salt at the installation time and changes the owner to salt so that later on it can be used without root privileges.
I also checked the Filesystem Hierarchy Standard and even though it's not really important so much, even they say:
System programs that maintain transient UNIX-domain sockets must place them in this directory.
Since named pipes are something very similar, I would go the same way.
On newer Linux distros with systemd /run/user/<userid> (created by pam_systemd during login if it doesn't already exist) can be used for opening up sockets and putting .pid files there instead of /var/run where only root has access. Also note that /var/run is a symlink to /run so /var/run/user/<userid> can also be used. For more infos check out this thread. The idea is that system daemons should have a /var/run/<daemon name>/ directory created during installation with proper permissions and put their sockets/pid files in there while daemons run by the user (such as pulseaudio) should use /run/user/<userid>/. Another option is /tmp and /var/tmp.

Linux: Remove application settings after program is uninstalled?

I'm writing a program for Linux that stores its data and settings in the home directory (e.g. /home/username/.program-name/stuff.xml). The data can take up 100 MB and more.
I've always wondered what should happen with the data and the settings when the system admin removes the program. Should I then delete these files from every (!) home directory, or should I just leave them alone? Leaving hundreds of MB in the home directories seems quite wasteful...
I don't think you should remove user data, since the program could be installed again in future, or since the user could choose to move his data on another machine, where the program is installed.
Anyway this kind of stuff is usually handled by some removal script (it can be make uninstall, more often it's an unsinstallation script ran by your package manager). Different distributors have got different policies. Some package managers have got an option to specify whether to remove logs, configuration stuff (from /etc) and so on. None touches files in user homes, as far as I know.
What happens if the home directories are shared between multiple workstations (ie. NFS mounted)? If you remove the program from one of those workstations and then go blasting the files out of every home directory, you'll probably really annoy the people who are still using the program on other workstations.

Resources