SYSTEM ERROR: I/O error 0 in writeto, ret 2048, file 56(/mfgtmp/tmp/srtE5yybD), addr 77010944. (290) - PROGRESS 4GL - linux

I am getting below error suddenly when my progress program was executed and running for more than 80 minutes. I think this is OS error and error 0 says its for out of disk space. I checked the disk space as it shows 14 GB available but I am not sure why I am getting this error.
Is it because of on a write out of disk space(exceeding 14 GB) and stopped ? so that available 14 GB kept same as it is?
SYSTEM ERROR: I/O error 0 in writeto, ret 2048, file 56(/mfgtmp/tmp/srtE5yybD), addr 77010944. (290)

By default temp files are created "unlinked". Because of this the space they were using is automatically reclaimed by the OS if the session crashes so you will often have a situation where your temp file ran out of space, the session crashed, and then when you investigate there is plenty of free space.
You can change the default behavior by using the -t (lower case) startup parameter. This will result in the files not being removed if a session crashes - so the space will not be returned to the OS. You will have to manually delete "stale" files if you enable -t.
On UNIX -t will also make the files visible in the -T (upper case) directory so that you can see their growth in real time.
On Windows the files are always visible and the current length is not consistently reported by system tools.
If your temp files are being written to a different filesystem than your working directory (the -T startup parameter is where temp files go) then you should have a "protrace.pid" file corresponding to the crashed session's process id and the timestamp of the crash. This will then lead you to the 4gl code that was creating the very large srt file.
14GB is far beyond "reasonable" so you really should look at that code and see if there is a better way to do whatever it is doing.

There are a number of k-base articles on that issue, for instance: https://knowledgebase.progress.com/articles/Knowledge/000027351
When you check disk space, please make sure you're checking the correct file system (/mfgtmp in this case).
The error messages references an srt file - so you might want to try to use srt file less heavy, see this article for some initial help: https://knowledgebase.progress.com/articles/Knowledge/P95930
Or: https://knowledgebase.progress.com/articles/Knowledge/P84475

Related

Running python3 multiprocessing job with slurm makes lots of core.###### files. What are they?

So I have a python3 job that is being run by slurm. The python job uses lots of multiprocessing, generating about 20 or so threads. The code is far from perfect, uses lots of memory, and occasionally reaches some unexpected data and throws an error. That in itself is not a problem, I don't need every one of the 20 process to complete.
The issue is that sometimes something is causing the program to create files named like core.356729, (the number after the dot changes), and these files are massive! Like GB of data. Eventually I end up with so many that I don't have any disk space left and all my jobs are stopped. I can't tell what they are, their contents are not human readable. Google searches for "core files slurm" or "core.number files" are not giving anything relevant.
The quick and dirty solution would be just to add a process that deletes these files as soon as they appear. But I'd rather understand why they are being created first.
Does anyone know what would create a file of the format "core.######"? Is there a name for this type of file? Is there any way to identify which slurm job created the file?
Those are core dump files used for debugging. They're essentially the contents of memory for the process that crashed. You can disable their creation with ulimit -c 0

File contents lost after power outage

I am using C++ ofstream to write a log file on Linux. When I monitor the file contents with tail -f command I can see the contents are correctly populated. But if a power outage happens and I check the file again after power cycle, the last couple lines of records are gone. With hexdump I can see those records turned into null characters '\0' instead. I tried flush() and manipulator std::endl and they don't help anyway.
Is it true what tail showed to me was not actually written to the disk and they were just in buffer? The inode table wasn't update before the power outage? I can accept this fact but I don't understand why the records turned to null characters if they weren't written to the file.
Btw, I tried Google's glog and have the same results (a bunch of null characters at the end). I also tried zlog, a C library. and found it only lost the last records but didn't replace them with null chars.
Well, when you have a power outage, and then start the system again, the linux kernel tries to forward the journal log to detect and correct the inconsistencies held from memory to disk when the system crashed. Normally this means to redo and commit all operations possible until the system crash, but undo (and erase) all data not commited on the time of the crash.
Linux (and other un*x kernels, like freebsd) has a facility called ordered data write, that forces metadata (like block pointers from inodes, or directory entries) to be updated after the actual data they point to is effectively written on disk, so inconsistencies reduce to a minimum. I don't know the actual linux implementation, but for example, in freebsd what you point (a block of zeros in a file instead of the actual data written) is completely impossible with freebsd kernel (well, you can do it on purpose, but not accidentally) The most probable thing is that linux probably just manages the blocks info and not the file contents, or it has updated the file size pointer and not the data up to there. This should not happen as it's an already solved problem.
The other thing is how many data you have written or why what you see on the screen doesn't appear after the system crash. Probably you have heard about something called delayed write that allows the kernel to save write operations to disk on busy systems by not writing immediately data onto disk, but waiting some time so updates can be resolved in core memory buffers before they go to disk. Disk writes, anyway, are forced after some time delay, that means 5secs in linux (I try to remember, there's a lot of time I checked that value last time, I'm in doubt between 5 and 30 seconds) so you can lose your last five seconds at most.

Release disk space used by cgi.FieldStorage temp files

I am writing a pyramid application that accepts many large file uploads (as a POST). Similar to How can I serve temporary files from Python Pyramid, I'm having a problem where the the temp files created by cgi.FieldStorage are orphaned, consuming GB's of disk space. lsof indicates that my wsgi process has deleted files from /tmp but the files haven't been closed. Restarting the application clears the orphans.
How can I cause these files to be closed so that the disk space is returned to the OS?
This problem I encountered was unrelated to cgi.FieldStorage, pyramid actually uses WebOb for serializing data.
The cause of the high disk space usage was pyramid_debugtoolbar. The debugger states in it's documentation that it maintains the data from the previous 100 requests, which took up a great amount of memory and disk space in my case. Removing the include for the debugger from __init__.py and restarting the server resolved the problem.

centos free space on disk not updating

I am new to the linux and working with centos system ,
By running command df -H it is showing 82% if full, that is only 15GB is free.
I want some more extra spaces, so using WINSCP i hav done shift deleted the 15G record.
and execured df -H once again, but still it is showing 15 GB free. but the free size of the deleted
file where it goes.
Plese help me out in finding solution to this
In most unix filesystems, if a file is open, the OS will delete the file right way, but will not release space until the file is closed. Why? Because the file is still visible for the user that opened it.
On the other side, Windows used to complain that it can't delete a file because it is in use, seems that in later incarnations explorer will pretend to delete the file.
Some applications are famous for bad behavior related to this fact. For example, I have to deal with some versions of MySQL that will not properly close some files, over the time I can find several GB of space wasted in /tmp.
You can use the lsof command to list open files (man lsof). If the problem is related to open files, and you can afford a reboot, most likely it is the easiest way to fix the problem.

VFS: file-max limit 1231582 reached

I'm running a Linux 2.6.36 kernel, and I'm seeing some random errors. Things like
ls: error while loading shared libraries: libpthread.so.0: cannot open shared object file: Error 23
Yes, my system can't consistently run an 'ls' command. :(
I note several errors in my dmesg output:
# dmesg | tail
[2808967.543203] EXT4-fs (sda3): re-mounted. Opts: (null)
[2837776.220605] xv[14450] general protection ip:7f20c20c6ac6 sp:7fff3641b368 error:0 in libpng14.so.14.4.0[7f20c20a9000+29000]
[4931344.685302] EXT4-fs (md16): re-mounted. Opts: (null)
[4982666.631444] VFS: file-max limit 1231582 reached
[4982666.764240] VFS: file-max limit 1231582 reached
[4982767.360574] VFS: file-max limit 1231582 reached
[4982901.904628] VFS: file-max limit 1231582 reached
[4982964.930556] VFS: file-max limit 1231582 reached
[4982966.352170] VFS: file-max limit 1231582 reached
[4982966.649195] top[31095]: segfault at 14 ip 00007fd6ace42700 sp 00007fff20746530 error 6 in libproc-3.2.8.so[7fd6ace3b000+e000]
Obviously, the file-max errors look suspicious, being clustered together and recent.
# cat /proc/sys/fs/file-max
1231582
# cat /proc/sys/fs/file-nr
1231712 0 1231582
That also looks a bit odd to me, but the thing is, there's no way I have 1.2 million files open on this system. I'm the only one using it, and it's not visible to anyone outside the local network.
# lsof | wc
16046 148253 1882901
# ps -ef | wc
574 6104 44260
I saw some documentation saying:
file-max & file-nr:
The kernel allocates file handles dynamically, but as yet it doesn't free them again.
The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit.
Historically, the three values in file-nr denoted the number of allocated file handles, the number of allocated but unused file handles, and the maximum number of file handles. Linux 2.6 always reports 0 as the number of free file handles -- this is not an error, it just means that the number of allocated file handles exactly matches the number of used file handles.
Attempts to allocate more file descriptors than file-max are reported with printk, look for "VFS: file-max limit reached".
My first reading of this is that the kernel basically has a built-in file descriptor leak, but I find that very hard to believe. It would imply that any system in active use needs to be rebooted every so often to free up the file descriptors. As I said, I can't believe this would be true, since it's normal to me to have Linux systems stay up for months (even years) at a time. On the other hand, I also can't believe that my nearly-idle system is holding over a million files open.
Does anyone have any ideas, either for fixes or further diagnosis? I could, of course, just reboot the system, but I don't want this to be a recurring problem every few weeks. As a stopgap measure, I've quit Firefox, which was accounting for almost 2000 lines of lsof output (!) even though I only had one window open, and now I can run 'ls' again, but I doubt that will fix the problem for long. (edit: Oops, spoke too soon. By the time I finished typing out this question, the symptom was/is back)
Thanks in advance for any help.
I hate to leave a question open, so a summary for anyone who finds this.
I ended up reposting the question on serverfault instead (this article)
They weren't able to come up with anything, actually, but I did some more investigation and ultimately found that it's a genuine bug with NFSv4, specifically the server-side locking code. I had an NFS client which was running a monitoring script every 5 seconds, using rrdtool to log some data to an NFS-mounted file. Every time it ran, it locked the file for writing, and the server allocated (but erroneously never released) an open file descriptor. That script (plus another that ran less frequently) resulted in about 900 open files consumed per hour, and two months later, it hit the limit.
Several solutions are possible:
1) Use NFSv3 instead.
2) Stop running the monitoring script.
3) Store the monitoring results locally instead of on NFS.
4) Wait for the patch to NFSv4 that fixes this (Bruce Fields actually sent me a patch to try, but I haven't had time)
I'm sure you can think of other possible solutions.
Thanks for trying.

Resources