Running python3 multiprocessing job with slurm makes lots of core.###### files. What are they? - python-3.x

So I have a python3 job that is being run by slurm. The python job uses lots of multiprocessing, generating about 20 or so threads. The code is far from perfect, uses lots of memory, and occasionally reaches some unexpected data and throws an error. That in itself is not a problem, I don't need every one of the 20 process to complete.
The issue is that sometimes something is causing the program to create files named like core.356729, (the number after the dot changes), and these files are massive! Like GB of data. Eventually I end up with so many that I don't have any disk space left and all my jobs are stopped. I can't tell what they are, their contents are not human readable. Google searches for "core files slurm" or "core.number files" are not giving anything relevant.
The quick and dirty solution would be just to add a process that deletes these files as soon as they appear. But I'd rather understand why they are being created first.
Does anyone know what would create a file of the format "core.######"? Is there a name for this type of file? Is there any way to identify which slurm job created the file?

Those are core dump files used for debugging. They're essentially the contents of memory for the process that crashed. You can disable their creation with ulimit -c 0

Related

disk usage increasing indefinitely with php script

I am using the following code to create backups of the php variables.
if(file_exists(old_backup.txt))
unlink('old_backup.txt');
copy('new_backup.txt', 'old_backup.txt');
$content = serialize($some_ar);
file_put_contents('new_backup.txt', $content);
new_backup.txt will have current variables dump and old_backup.txt will have variables dump sometime back in time.
dump size is constant, around 300Mb. But every time above code is run, disk usage increases indefinitely. When the php script killed, disk usage is normal.
Not sure where the file handler still open for deleted files.
How do I make above code work, without much increase in disk usage.
Not sure about what exactly is causing the disk usage increase, because you posted only a snippet and not the full script. However there are a few things that are not correct for sure:
if(file_exists(old_backup.txt))
should be
if(file_exists('old_backup.txt'))
Then the mere existence of the file does not mean you can unlink it, you should check permissions too.
That being said, those aren't good reasons to fill the disk, but we need to see where you get the $some_ar variable from to give better advice.

SLURM / Sbatch creates many small output files

I am running a pipeline on a SLURM-cluster, and for some reason a lot of smaller files (between 500 and 2000 bytes in size) named along the lines of slurm-XXXXXX.out (where XXXXXX is a number). I've tried to find out what these files are on the SLURM website, but I can't find any mention of them. I assume they are some sort of in-progress files that the system uses while parsing my pipeline?
If it matters, the pipeline I'm running is using snakemake. I know I've seen these types of files before though, without snakemake, but I they weren't a big problem back then. I'm afraid that clearing the working directory of these files after each step of the workflow will interrupt in-progress steps, so I'm not doing anything with them at the moment.
What are these files, and how can I suppress their output or, alternatively, delete them after their corresponding job is finished? Did I mess up my workflow somehow, and that's why they are created?
You might want to take a look at the sbatch documentation. The files that you are referring to are essentially SLURM logs as explained there:
By default both standard output and standard error are directed to a
file of the name "slurm-%j.out", where the "%j" is replaced with the
job allocation number.
You can change the filename with the --error=<filename pattern> and --output=<filename pattern> command line options. The filename_pattern can have one or more symbols that will be replaced as explained in the documentation. According to the FAQs, you should be able to suppress standard output and standard error by using the following command line options:
sbatch --output=/dev/null --error=/dev/null [...]

Node .fs Working with a HUGE Directory

Picture a directory with a ton of files. As a rough gauge of magnitude I think the most that we've seen so far is a couple of million but it could technically go another order higher. Using node, I would like to read files from this directory, process them (upload them, basically), and then move them out of the directory. Pretty simple. New files are constantly being added while the application is running, and my job (like a man on a sinking ship holding a bucket) is to empty this directory as fast as it's being filled.
So what are my options? fs.readdir is not ideal, it loads all of the filenames into memory which becomes a problem at this kind of scale. Especially as new files are being added all the time and so it would require repeated calls. (As an aside for anybody referring to this in the future, there is something being proposed to address this whole issue which may or may not have been realised within your timeline.)
I've looked at the myriad of fs drop-ins (graceful-fs, chokadir, readdirp, etc), none of which have this particular use-case within their remit.
I've also come across a couple of people suggesting that this can be handled with child_process, and there's a wrapper called inotifywait which tasks itself with exactly what I am asking but I really don't understand how this addresses the underlying problem, especially at this scale.
I'm wondering if what I really need to do is find a way to just get the first file (or, realistically, batch of files) from the directory without having the overhead of reading the entire directory structure into memory. Some sort of stream that could be terminated after a certain number of files had been read? I know Go has a parameter for reading the first n files from a directory but I can't find a node equivalent, has anybody here come across one or have any interesting ideas? Left-field solutions more than welcome at this point!
You can use your operation system listing file command, and stream the result into NodeJS.
For example in Linux:
var cp=require('child_process')
var stdout=cp.exec('ls').stdout
stdout.on('data',function(a){
console.log(a)
});0
RunKit: https://runkit.com/aminanadav/57da243180f3bb140059a31d

How to open and read 1000s of files very quickly

My problem is that application takes too long to load thousands of files. Yes, I know it's going to take a long time, but I would like to make it faster by any amount of time. What I mean by "load" is open the file to get its descriptor and then read the first 100 bytes or so of it.
So, my main strategy has been to create a second thread that will open and close (without reading any contents) all the files. This seems to help because the thread runs ahead of the main thread and I'm guessing the OS is caching these file descriptors ahead of time so that when my main thread opens them it's a quick open. This has actually helped because the thread can start caching these file descriptors while my main thread is parsing the data read in from these files.
So my real question is...what else can I do to make this faster? What approaches are there? Has anyone had success doing this?
I've heard of OS prefetching calls but it was for virtual memory pages. Is there a way to tell the OS, hey I'm going to be needed all these files pretty soon - I suggest that you start gathering them for me ahead of time. My lookahead thread is pretty crude.
Are there low level disk techniques I could use? Is there possibly a pattern of file access that would help? Right now, the files that are loaded all come from the same folder. I suppose there is no way to determine where exactly on disk they lie and which ordering of file opens would be fastest for the disk. I'm also guessing that the disk has some hard ware to make this as efficient as possible too.
My application is mainly for windows, but unix suggestions would help as well.
I am programming in C++ if that makes a difference.
Thanks,
-julian
My first thought is that this is going to be hard to work around from a programmatic level.
You'll find Linux and OSX can access thousands of files like this in a fraction of the time it takes Windows. I don't know how much control you have over the machine. If you can keep the thousands of files on a FAT partition, you should see better results than with NTFS.
How often are you scanning these files and how often are they changing. If the ratio is heavily on the reading side, it would make sense to copy the start of each file into a cache. The cache could store the filename, modification time, and 100 bytes of each of the thousand files.

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

Resources