Linux File Descriptor logs - linux

One of my java build process in Linux machine is running slow of late. One of the things i suspect causing slowness is the process hitting the max file descriptor limit. I don't have permission to find out how many file descriptors are being used by my build process. So, does Linux log in a file if a process hits the max file descriptor limit, which i can check to see if my build process is slowing because of the max file descriptor limit.

check /proc/PIDOFPROCESS/fd/. This contains all the open descriptors used by the program. ls ..dir.. | wc -l will give you the number.
ulimit -n will give you the maximum number of open descriptors. You can also set this value before running the program.

Related

SYSTEM ERROR: I/O error 0 in writeto, ret 2048, file 56(/mfgtmp/tmp/srtE5yybD), addr 77010944. (290) - PROGRESS 4GL

I am getting below error suddenly when my progress program was executed and running for more than 80 minutes. I think this is OS error and error 0 says its for out of disk space. I checked the disk space as it shows 14 GB available but I am not sure why I am getting this error.
Is it because of on a write out of disk space(exceeding 14 GB) and stopped ? so that available 14 GB kept same as it is?
SYSTEM ERROR: I/O error 0 in writeto, ret 2048, file 56(/mfgtmp/tmp/srtE5yybD), addr 77010944. (290)
By default temp files are created "unlinked". Because of this the space they were using is automatically reclaimed by the OS if the session crashes so you will often have a situation where your temp file ran out of space, the session crashed, and then when you investigate there is plenty of free space.
You can change the default behavior by using the -t (lower case) startup parameter. This will result in the files not being removed if a session crashes - so the space will not be returned to the OS. You will have to manually delete "stale" files if you enable -t.
On UNIX -t will also make the files visible in the -T (upper case) directory so that you can see their growth in real time.
On Windows the files are always visible and the current length is not consistently reported by system tools.
If your temp files are being written to a different filesystem than your working directory (the -T startup parameter is where temp files go) then you should have a "protrace.pid" file corresponding to the crashed session's process id and the timestamp of the crash. This will then lead you to the 4gl code that was creating the very large srt file.
14GB is far beyond "reasonable" so you really should look at that code and see if there is a better way to do whatever it is doing.
There are a number of k-base articles on that issue, for instance: https://knowledgebase.progress.com/articles/Knowledge/000027351
When you check disk space, please make sure you're checking the correct file system (/mfgtmp in this case).
The error messages references an srt file - so you might want to try to use srt file less heavy, see this article for some initial help: https://knowledgebase.progress.com/articles/Knowledge/P95930
Or: https://knowledgebase.progress.com/articles/Knowledge/P84475

If the size of the file exceeds the maximum size of the file system, what happens?

For example, In FAT32 partition, The maximum file size is 4GB. but I was able to create a 5GB file with vim and I saved the file and opened it again, the console output was broken like a staircase. I have three questions.
If the size of the file exceeds the maximum size of the file system, what happens?
In my case, Why break?
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1). Does this have anything to do with the file system? Is there a relationship between the limits of data in stat() and the limits of each feature in the file system?
If the size of the file exceeds the maximum size of the file system, what happens?
By definition, that can never happens. What really happens is that some system call (probably write(2) ...) is failing, and the code doing that should take care of that case.
Notice that FAT32 filesystems restrict the maximal size of files to 2Gigabytes. Use a better file system on your USB key if you want more (or split(1) large files in smaller chunks before copying them to your FAT32-formatted USB key).
If using <stdio.h> notice that fflush(3), fprintf(3), fclose(3) (and most other standard functions) can fail (e.g. because they will do some failing write(2)).
the console output was broken like a staircase
probably because your pseudoterminal was in some broken state. See stty(1), reset(1), termios(3) and read the tty demystified.
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1)
You are misunderstanding stat(2). Read again its documentation
Read Advanced Linux Programming then syscalls(2).
I was able to create a 5GB file with vim
To understand the behavior of vim read first its documentation then study its source code (it is free software, and you can and perhaps should study its code).
You could also use strace(1) to understand what system calls are done by some command or process.

Why does inotify error when watching one directory but not another?

A program that uses inotify with IN_CREATE to watch a directory for file creation fails on some directories but works on others. For example it works on /home/randomtroll/testdir but fails on /home/randomtroll ; both have the same owners and permissions. When it fails, the read returns EINVAL. The inotify descriptor and watch have been successfully created; the buffer into which it reads is aligned properly and large enough to accommodate the data read.
The buffer into which I read the inotify descriptor was too small. I made it large enough to accommodate the names of files I was looking for; the creation of files with names longer than that caused a read error. One cannot limit how many bytes one reads when one reads an inotify descriptor. This seems like a bug to me.

Core dump is created, but not written to a file?

I'm trying to get a core dump of a proprietary application running on an embedded linux system, for which I wrote some plugins.
What I did was:
ulimit -c unlimited
echo "/tmp/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern
kill -3 <PID>`
However, no core dump is created. '/tmp/cores' exists and is writable for everyone, and the disk has enough space available. When I try the same thing with sleep 100 & as an example process and then kill it, the core dump is created.
I tried the example for the pipe syntax from the core manpage, which writes some parameters and the size of the core dump into a file called core.info. This file IS created, and the size is greater than 0. So if the core dump is created, why isn't it written to /tmp/cores? To be sure, I also searched for core* on the file system - it's not there. dmesg doesn't show any errors (but it does if I pipe the core dump to an invalid program).
Some more info: The system is probably based on Debian, but I'm not quite sure. GDB is not available, as well as many other tools - there is only busybox for basic stuff.
The process I'm trying to debug is automatically restarted soon after being killed.
So, I guess one solution would be to modify the example program in order to write the dump to a file instead of just counting bytes. But why doesn't it work just normally if there obviously is some data?
If your proprietary application calls setrlimit(2) with RLIMIT_CORE set to 0, or if it is setuid, no core dump happens. See core(5). Perhaps use strace(1) to find out. And you could install  gdb (perhaps by [cross-] compiling it). See also gcore(1).
Also, check (and maybe set) the limit in the invoking shell. With bash(1) use ulimit builtin. Otherwise, cat /proc/self/limits should display the limits. If you don't have bash you could code a small wrapper in C calling setrlimit then execve ...

VFS: file-max limit 1231582 reached

I'm running a Linux 2.6.36 kernel, and I'm seeing some random errors. Things like
ls: error while loading shared libraries: libpthread.so.0: cannot open shared object file: Error 23
Yes, my system can't consistently run an 'ls' command. :(
I note several errors in my dmesg output:
# dmesg | tail
[2808967.543203] EXT4-fs (sda3): re-mounted. Opts: (null)
[2837776.220605] xv[14450] general protection ip:7f20c20c6ac6 sp:7fff3641b368 error:0 in libpng14.so.14.4.0[7f20c20a9000+29000]
[4931344.685302] EXT4-fs (md16): re-mounted. Opts: (null)
[4982666.631444] VFS: file-max limit 1231582 reached
[4982666.764240] VFS: file-max limit 1231582 reached
[4982767.360574] VFS: file-max limit 1231582 reached
[4982901.904628] VFS: file-max limit 1231582 reached
[4982964.930556] VFS: file-max limit 1231582 reached
[4982966.352170] VFS: file-max limit 1231582 reached
[4982966.649195] top[31095]: segfault at 14 ip 00007fd6ace42700 sp 00007fff20746530 error 6 in libproc-3.2.8.so[7fd6ace3b000+e000]
Obviously, the file-max errors look suspicious, being clustered together and recent.
# cat /proc/sys/fs/file-max
1231582
# cat /proc/sys/fs/file-nr
1231712 0 1231582
That also looks a bit odd to me, but the thing is, there's no way I have 1.2 million files open on this system. I'm the only one using it, and it's not visible to anyone outside the local network.
# lsof | wc
16046 148253 1882901
# ps -ef | wc
574 6104 44260
I saw some documentation saying:
file-max & file-nr:
The kernel allocates file handles dynamically, but as yet it doesn't free them again.
The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit.
Historically, the three values in file-nr denoted the number of allocated file handles, the number of allocated but unused file handles, and the maximum number of file handles. Linux 2.6 always reports 0 as the number of free file handles -- this is not an error, it just means that the number of allocated file handles exactly matches the number of used file handles.
Attempts to allocate more file descriptors than file-max are reported with printk, look for "VFS: file-max limit reached".
My first reading of this is that the kernel basically has a built-in file descriptor leak, but I find that very hard to believe. It would imply that any system in active use needs to be rebooted every so often to free up the file descriptors. As I said, I can't believe this would be true, since it's normal to me to have Linux systems stay up for months (even years) at a time. On the other hand, I also can't believe that my nearly-idle system is holding over a million files open.
Does anyone have any ideas, either for fixes or further diagnosis? I could, of course, just reboot the system, but I don't want this to be a recurring problem every few weeks. As a stopgap measure, I've quit Firefox, which was accounting for almost 2000 lines of lsof output (!) even though I only had one window open, and now I can run 'ls' again, but I doubt that will fix the problem for long. (edit: Oops, spoke too soon. By the time I finished typing out this question, the symptom was/is back)
Thanks in advance for any help.
I hate to leave a question open, so a summary for anyone who finds this.
I ended up reposting the question on serverfault instead (this article)
They weren't able to come up with anything, actually, but I did some more investigation and ultimately found that it's a genuine bug with NFSv4, specifically the server-side locking code. I had an NFS client which was running a monitoring script every 5 seconds, using rrdtool to log some data to an NFS-mounted file. Every time it ran, it locked the file for writing, and the server allocated (but erroneously never released) an open file descriptor. That script (plus another that ran less frequently) resulted in about 900 open files consumed per hour, and two months later, it hit the limit.
Several solutions are possible:
1) Use NFSv3 instead.
2) Stop running the monitoring script.
3) Store the monitoring results locally instead of on NFS.
4) Wait for the patch to NFSv4 that fixes this (Bruce Fields actually sent me a patch to try, but I haven't had time)
I'm sure you can think of other possible solutions.
Thanks for trying.

Resources