How do I read a large .conll file with python? - python-3.x

My attempt to read a very large file with pyconll and conllu keeps running into memory errors. The file is 27Gb in size and even using iterators to read it does not help. I'm using python 3.7.

Both pyconll and conllu have iterative versions that use less memory at a given moment. If you call pyconll.load_from_file it is going to try to read and parse the entire file into memory and likely your machine has much less than 27Gb of RAM. Instead use pyconll.iter_from_file. This will read the sentences one by one and use minimal memory, and you can extract that's needed from those sentences piecemeal.
If you need to do some larger processing that requires having all information at once, it's a bit outside the scope of either of these libraries to support that type of scenario.

Related

external multithreading sort

I need to implement external multithreading sort. I dont't have experience in multithreading programming and now I'm not sure if my algorithm is good anoth also I don't know how to complete it. My idea is:
Thread reads next block of data from input file
Sort it using standart algorith(std::sort)
Writes it to another file
After this I have to merge such files. How should I do this?
If I wait untill input file will be entirely processed until merge
I recieve a lot of temporary files
If I try to merge file straight after sort, I can not come up with
an algorithm to avoid merging files with quite different sizes, which
will lead to O(N^2) difficulty.
Also I suppose this is a very common task, however I cannot find good prepared algoritm in the enternet. I would be very grateful for such a link especially for it's c++ implementation.
Well, the answer isn't that simple, and it actually depends on many factors, amongst them the number of items you wish to process, and the relative speed of your storage system and CPUs.
But the question is why to use multithreading at all here. Data too big to be held in memory? So many items that even a qsort algorithm can't sort fast enough? Take advantage of multiple processors or cores? Don't know.
I would suggest that you first write some test routines to measure the time needed to read and write the input file and the output files, as well as the CPU time needed for sorting. Please note that I/O is generally A LOT slower than CPU execution (actually they aren't even comparable), and I/O may not be efficient if you read data in parallel (there is one disk head which has to move in and out, so reads are in effect serialized - even if it's a digital drive it's still a device, with input and output channels). That is, the additional overhead of reading/writing temporary files may more than eliminate any benefit from multithreading. So I would say, first try making an algorithm that reads the whole file in memory, sorts it and writes it, and put in some time counters to check their relative speed. If I/O is some 30% of the total time (yes, that little!), it's definitely not worth, because with all that reading/merging/writing of temporary files, this will rise a lot more, so a solution processing the whole data at once would rather be preferable.
Concluding, don't see why use multithreading here, the only reason imo would be if data are actually delivered in blocks, but then again take into account my considerations above, about relative I/O-CPU speeds and the additional overhead of reading/writing the temporary files. And a hint, your file accessing must be very efficient, eg reading/writing in larger blocks using application buffers, not one by one (saves on system calls), otherwise this may have a detrimental effect if the file(s) are stored on a machine other than yours (eg a server).
Hope you find my suggestions useful.

How much ram do different python objects use?

I want to optimize my program as it consumes a lot of ram and stores a lot of data. To decide which datatype to use I need to know how much RAM a single python objects uses. How can I determine those sizes.
Just use sys.getsizeof().
If you want the size of a container pluss all of its content, then the doc even has a link to a recipe to do that.

Valgrind hangs when reading large HDF5 dataset in Fortran

I have an application written in Fortran which makes use of parallel HDF5 for input / output.
A matching post-processing code is used to read its output, in the form of a *.h5 file, and process it.
When I try to use valgrind to check for memory leaks, however, it stalls when reading large datasets.
More exactly, the stalling occurs at a call to H5Dread_f for large datasets, for example 1069120 doubles (where doubles are defined as H5kind_to_type(REAL64,H5_REAL_KIND)), whereas for smaller ones it is okay.
I tried recompiling the HDF5 library using --enable-using-memchecker, as described here, but it didn't help.
Does anybody have more experience with this?
I found the cause / solution: Due to an error of mine, the chunk size I used in these HDF5 routines was often only 1 byte, which is of course much too low.
Fixing this also makes valgrind much much faster and usuable.

How to parallelize file reading and writing

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.
My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.
Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?
Environment: Win32, C++, NTFS, Single Hard Disk
Thanks.
-Dbger
Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).
To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.
Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.
This isn't really an answer to your question but rather a re-design (which we all hate but can't help doing). As already mentioned, trying to speed up I/O on a hard disk with multiple threads probably won't help.
However, it might be possible to use another approach depending on data sensitivity, throughput needs, data size, etc. It would not be difficult to create a structure in memory that maintains a picture of the data and allows easy/fast updates of the lines of text anywhere in the data. You could then use a dedicated thread that simply monitors that structure and whose job it is to write the data to disk. Writing data sequentially to disk can be extremely fast; it can be much faster than seeking randomly to different sections and writing it in pieces.

Linux/perl mmap performance

I'm trying to optimize handling of large datasets using mmap. A dataset is in the gigabyte range. The idea was to mmap the whole file into memory, allowing multiple processes to work on the dataset concurrently (read-only). It isn't working as expected though.
As a simple test I simply mmap the file (using perl's Sys::Mmap module, using the "mmap" sub which I believe maps directly to the underlying C function) and have the process sleep. When doing this, the code spends more than a minute before it returns from the mmap call, despite this test doing nothing - not even a read - from the mmap'ed file.
Guessing, I though maybe linux required the whole file to be read when first mmap'ed, so after the file had been mapped in the first process (while it was sleeping), I invoked a simple test in another process which tried to read the first few megabytes of the file.
Suprisingly, it seems the second process also spends a lot of time before returning from the mmap call, about the same time as mmap'ing the file the first time.
I've made sure that MAP_SHARED is being used and that the process that mapped the file the first time is still active (that it has not terminated, and that the mmap hasn't been unmapped).
I expected a mmapped file would allow me to give multiple worker processes effective random access to the large file, but if every mmap call requires reading the whole file first, it's a bit harder. I haven't tested using long-running processes to see if access is fast after the first delay, but I expected using MAP_SHARED and another separate process would be sufficient.
My theory was that mmap would return more or less immediately, and that linux would load the blocks more or less on-demand, but the behaviour I am seeing is the opposite, indicating it requires reading through the whole file on each call to mmap.
Any idea what I'm doing wrong, or if I've completely misunderstood how mmap is supposed to work?
Ok, found the problem. As suspected, neither linux or perl were to blame. To open and access the file I do something like this:
#!/usr/bin/perl
# Create 1 GB file if you do not have one:
# dd if=/dev/urandom of=test.bin bs=1048576 count=1000
use strict; use warnings;
use Sys::Mmap;
open (my $fh, "<test.bin")
|| die "open: $!";
my $t = time;
print STDERR "mmapping.. ";
mmap (my $mh, 0, PROT_READ, MAP_SHARED, $fh)
|| die "mmap: $!";
my $str = unpack ("A1024", substr ($mh, 0, 1024));
print STDERR " ", time-$t, " seconds\nsleeping..";
sleep (60*60);
If you test that code, there are no delays like those I found in my original code, and after creating the minimal sample (always do that, right!) the reason suddenly became obvious.
The error was that I in my code treated the $mh scalar as a handle, something which is light weight and can be moved around easily (read: pass by value). Turns out, it's actually a GB long string, definitively not something you want to move around without creating an explicit reference (perl lingua for a "pointer"/handle value). So if you need to store in in a hash or similar, make sure you store \$mh, and deref it when you need to use it like ${$hash->{mh}}, typically as the first parameter in a substr or similar.
If you have a relatively recent version of Perl, you shouldn't be using Sys::Mmap. You should be using PerlIO's mmap layer.
Can you post the code you are using?
On 32-bit systems the address space for mmap()s is rather limited (and varies from OS to OS). Be aware of that if you're using multi-gigabyte files and your are only testing on a 64-bit system. (I would have preferred to write this in a comment but I don't have enough reputation points yet)
one thing that can help performance is the use of 'madvise(2)'. probably most easily
done via Inline::C. 'madvise' lets you tell the kernel what your access pattern will be like (e.g. sequential, random, etc).
If I may plug my own module: I'd advice using File::Map instead of Sys::Mmap. It's much easier to use, and is less crash-prone than Sys::Mmap.
That does sound surprising. Why not try a pure C version?
Or try your code on a different OS/perl version.
See Wide Finder for perl performance with mmap. But there is one big pitfall. If your dataset will be on classical HD and you will read from multiple processes, you can easily fall in random access and your IO will fall down to unacceptable values (20~40 times).
Ok, here's another update. Using Sys::Mmap or PerlIO's ":mmap" attribute both works fine in perl, but only up to 2 GB files (the magic 32 bit limit). Once the file is more than 2 GB, the following problems appear:
Using Sys::Mmap and substr for accessing the file, it seems that substr only accepts a 32 bit int for the position parameter, even on systems where perl supports 64 bit. There's at least one bug posted about it:
#62646: Maximum string length with substr
Using open(my $fh, "<:mmap", "bigfile.bin"), once the file is larger than 2 GB, it seems perl will either hang/or insist on reading the whole file on the first read (not sure which, I never ran it long enough to see if it completed), leading to dead slow performance.
I haven't found any workaround to either of these, and I'm currently stuck with slow file (non mmap'ed) operations for working on these files. Unless I find a workaround I may have to implement the processing in C or another higher level language that supports mmap'ing huge files better.
Your access to that file had better be well random to justify a full mmap. If your usage isn't evenly distributed, you're probably better off with a seek, read to a freshly malloced area and process that, free, rinse and repeat. And work with chunks of multiples of 4k, say 64k or so.
I once benchmarked a lot string pattern matching algorithms. mmaping the entire file was slow and pointless. Reading to a static 32kish buffer was better, but still not particularly good. Reading to freshly malloced chunk, processing that and then letting it go allows kernel to work wonders under the hood. The difference in speed was enormous, but then again pattern matching is very fast complexitywise and more emphasis must be put on handling efficiency than perhaps is usually needed.

Resources