Python3 pathlib's Path.glob() generator keeps increasing memory usage when performed on large file structure - python-3.x

I used pathlib's Path(<path>).glob() function for walking through file directories and grabbing their files' name and extension parameters. My Python script is meant to run on a large file system, so I tested it on my root directory of my Linux machine. When left for a few hours I noticed that my machine's memory usage increased by over a GB.
After using memray and memory_profiler, I found that whenever I looped through directory items using the generator the memory usage kept climbing.
Here's the problematic code (path is the path to the root directory):
dir_items = Path(path).glob("**/*")
for item in dir_items:
pass
Since I was using a generator, my expectation was that my memory requirements would remain constant throughout. I think I might have some fundamental misunderstanding. Can anyone explain where I've gone wrong?

Related

How do I read a large .conll file with python?

My attempt to read a very large file with pyconll and conllu keeps running into memory errors. The file is 27Gb in size and even using iterators to read it does not help. I'm using python 3.7.
Both pyconll and conllu have iterative versions that use less memory at a given moment. If you call pyconll.load_from_file it is going to try to read and parse the entire file into memory and likely your machine has much less than 27Gb of RAM. Instead use pyconll.iter_from_file. This will read the sentences one by one and use minimal memory, and you can extract that's needed from those sentences piecemeal.
If you need to do some larger processing that requires having all information at once, it's a bit outside the scope of either of these libraries to support that type of scenario.

Running python3 multiprocessing job with slurm makes lots of core.###### files. What are they?

So I have a python3 job that is being run by slurm. The python job uses lots of multiprocessing, generating about 20 or so threads. The code is far from perfect, uses lots of memory, and occasionally reaches some unexpected data and throws an error. That in itself is not a problem, I don't need every one of the 20 process to complete.
The issue is that sometimes something is causing the program to create files named like core.356729, (the number after the dot changes), and these files are massive! Like GB of data. Eventually I end up with so many that I don't have any disk space left and all my jobs are stopped. I can't tell what they are, their contents are not human readable. Google searches for "core files slurm" or "core.number files" are not giving anything relevant.
The quick and dirty solution would be just to add a process that deletes these files as soon as they appear. But I'd rather understand why they are being created first.
Does anyone know what would create a file of the format "core.######"? Is there a name for this type of file? Is there any way to identify which slurm job created the file?
Those are core dump files used for debugging. They're essentially the contents of memory for the process that crashed. You can disable their creation with ulimit -c 0

Is there a way to see how much memory does a python module take?

In python3 is there a simple way to see how much memory is used when loading a module? (not while running its content such as functions or methods, which may load data and so on).
# Memory used before, in bytes
import mymodule
# Memory used after, in bytes
# Delta memory = memory used before - memory used after
(E.g. these 3 comment lines of extra code to insert would be what I call "simple").
By using the spyder IDE for example, I can see in the "File explorer" tab on the top right, the size of the file (i.e. size on disk) which contains my module, but I think it's not the size that is taken into memory after Python has actually loaded its contents, with the many imports I need in there.
And in the "Memory and Swap History" part of the "System Monitor" (Ubuntu 18.04) I can see a little bump while effectively loading my module in python (it may get bigger as the module grows of course) and which is probably the amount I'm searching for:
My uses would mainly be inside the Spyder IDE, any jupyter-notebook or directly into a python console.

Release disk space used by cgi.FieldStorage temp files

I am writing a pyramid application that accepts many large file uploads (as a POST). Similar to How can I serve temporary files from Python Pyramid, I'm having a problem where the the temp files created by cgi.FieldStorage are orphaned, consuming GB's of disk space. lsof indicates that my wsgi process has deleted files from /tmp but the files haven't been closed. Restarting the application clears the orphans.
How can I cause these files to be closed so that the disk space is returned to the OS?
This problem I encountered was unrelated to cgi.FieldStorage, pyramid actually uses WebOb for serializing data.
The cause of the high disk space usage was pyramid_debugtoolbar. The debugger states in it's documentation that it maintains the data from the previous 100 requests, which took up a great amount of memory and disk space in my case. Removing the include for the debugger from __init__.py and restarting the server resolved the problem.

how does kernel handle new file creation

I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J
Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.

Resources