Read and Write a Very big text file using openmp - io

I have a very big text file (grater than 90 GB) I have to read the file, I have tried to read the same sequentially but as expected it is taking a lot of time. If I use openmp to do the same, is that feasible? I am pretty new to openmp.

Related

How do I read from a wordlist concurrently with multiple threads efficiently?

I am building an application that requires me to read a really big wordlist (a txt file containing strings separated by new line characters) without reading it from memory as it could be hundreds of MBs large.
The application would have anywhere between 1 and 100s of threads that read from that file and execute a process based on the words/strings read independently.
My current solution has me using concurrent.futures and read through the file once to load the contents into memory along with their corresponding line count. I would then split the work based on that line count and the number of threads the user requested for this process.
I understand that I can read from disk using the read and seek methods, but I am more interested in the logical steps involved in doing this.
How do I make this process more efficient?

How do I read a large .conll file with python?

My attempt to read a very large file with pyconll and conllu keeps running into memory errors. The file is 27Gb in size and even using iterators to read it does not help. I'm using python 3.7.
Both pyconll and conllu have iterative versions that use less memory at a given moment. If you call pyconll.load_from_file it is going to try to read and parse the entire file into memory and likely your machine has much less than 27Gb of RAM. Instead use pyconll.iter_from_file. This will read the sentences one by one and use minimal memory, and you can extract that's needed from those sentences piecemeal.
If you need to do some larger processing that requires having all information at once, it's a bit outside the scope of either of these libraries to support that type of scenario.

Reading multiple lines from a text file

I need to read a process large text files. I currently read one line at a time and process it synchronously. I need to improve performance and realise the disk access is a bottleneck. I want to refactor to have a disk read thread putting data on a queue waiting to be processed with multiple threads doing the processing. My concern is that by only reading one line at a time I might not be able to supply the data to the processing threads fast enough. Is there a way to read multiple lines on each time? I need to make sure that I don't break any words as the processing is based on words.
Whereas your program is reading one line at a time, the runtime library is reading large blocks of data from the file and then parsing the lines from a memory buffer. So when you read the first line of the file, what really happens is that the runtime library loads a large buffer, scans it to find the end of the first line, and returns that line to you. The next time you ask for a line, the runtime library doesn't have to read, but rather just find the end of the next line.
How large that buffer is depends on the runtime library, and possibly on how you initialize the file.
In addition, the file system likely maintains an even larger buffer. Your runtime library, for example, might have a 4 kilobyte file buffer, and the operating system might be buffering the input file in 64 kilobyte blocks.
In short, you probably don't need to do anything special to optimize reading the text file. You could perhaps specify a larger file buffer, and in some cases I've seen that help. Other than that, it's not worth worrying about.
Unless you have an especially fast disk subsystem, a typical developer's machine will sustain between 50 and 100 megabytes per second if you're sequentially reading line by line. In most text processing applications, that's going to be your limiting factor.

multithreading and reading from one file (perl)

Hej sharp minds!
I need your expert guidance in making some choices.
Situation is like this:
1. I have approx. 500 flat files containing from 100 to 50000 records that have to be processed.
2. Each record in the files mentioned above has to be replaced using value from the separate huge file (2-15Gb) containing 100-200 million entries.
So I thought to make the processing using multicores - one file per thread/fork.
Is that a good idea? Since each thread needs to read from same huge file? It's a bit of a problem loading it into memory do to the size? Using file::tie is an option, but is that working with threads/forks?
Need your advise how to proceed.
Thanks
Yes, of course, using multiple cores for multi-threaded application is a good idea, because that's what those cores are for. Though it sounds like your problem involves heavy I/O, so, it might be that you will not use that much of CPU anyway.
Also since you are only going to read that big file, tie should work perfectly. I haven't heard of problems with that. But if you are going to search that big file for each record in your smaller files, then I guess it would take you a long time despite of the number of threads you use. If data from big file can be indexed based on some key, then I would advice to put it in some NoSQL databse and access it from your program. That would probably speed up your task even more than using multiple threads/cores.

How to parallelize file reading and writing

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.
My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.
Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?
Environment: Win32, C++, NTFS, Single Hard Disk
Thanks.
-Dbger
Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).
To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.
Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.
This isn't really an answer to your question but rather a re-design (which we all hate but can't help doing). As already mentioned, trying to speed up I/O on a hard disk with multiple threads probably won't help.
However, it might be possible to use another approach depending on data sensitivity, throughput needs, data size, etc. It would not be difficult to create a structure in memory that maintains a picture of the data and allows easy/fast updates of the lines of text anywhere in the data. You could then use a dedicated thread that simply monitors that structure and whose job it is to write the data to disk. Writing data sequentially to disk can be extremely fast; it can be much faster than seeking randomly to different sections and writing it in pieces.

Resources