multithreading and reading from one file (perl) - multithreading

Hej sharp minds!
I need your expert guidance in making some choices.
Situation is like this:
1. I have approx. 500 flat files containing from 100 to 50000 records that have to be processed.
2. Each record in the files mentioned above has to be replaced using value from the separate huge file (2-15Gb) containing 100-200 million entries.
So I thought to make the processing using multicores - one file per thread/fork.
Is that a good idea? Since each thread needs to read from same huge file? It's a bit of a problem loading it into memory do to the size? Using file::tie is an option, but is that working with threads/forks?
Need your advise how to proceed.
Thanks

Yes, of course, using multiple cores for multi-threaded application is a good idea, because that's what those cores are for. Though it sounds like your problem involves heavy I/O, so, it might be that you will not use that much of CPU anyway.
Also since you are only going to read that big file, tie should work perfectly. I haven't heard of problems with that. But if you are going to search that big file for each record in your smaller files, then I guess it would take you a long time despite of the number of threads you use. If data from big file can be indexed based on some key, then I would advice to put it in some NoSQL databse and access it from your program. That would probably speed up your task even more than using multiple threads/cores.

Related

external multithreading sort

I need to implement external multithreading sort. I dont't have experience in multithreading programming and now I'm not sure if my algorithm is good anoth also I don't know how to complete it. My idea is:
Thread reads next block of data from input file
Sort it using standart algorith(std::sort)
Writes it to another file
After this I have to merge such files. How should I do this?
If I wait untill input file will be entirely processed until merge
I recieve a lot of temporary files
If I try to merge file straight after sort, I can not come up with
an algorithm to avoid merging files with quite different sizes, which
will lead to O(N^2) difficulty.
Also I suppose this is a very common task, however I cannot find good prepared algoritm in the enternet. I would be very grateful for such a link especially for it's c++ implementation.
Well, the answer isn't that simple, and it actually depends on many factors, amongst them the number of items you wish to process, and the relative speed of your storage system and CPUs.
But the question is why to use multithreading at all here. Data too big to be held in memory? So many items that even a qsort algorithm can't sort fast enough? Take advantage of multiple processors or cores? Don't know.
I would suggest that you first write some test routines to measure the time needed to read and write the input file and the output files, as well as the CPU time needed for sorting. Please note that I/O is generally A LOT slower than CPU execution (actually they aren't even comparable), and I/O may not be efficient if you read data in parallel (there is one disk head which has to move in and out, so reads are in effect serialized - even if it's a digital drive it's still a device, with input and output channels). That is, the additional overhead of reading/writing temporary files may more than eliminate any benefit from multithreading. So I would say, first try making an algorithm that reads the whole file in memory, sorts it and writes it, and put in some time counters to check their relative speed. If I/O is some 30% of the total time (yes, that little!), it's definitely not worth, because with all that reading/merging/writing of temporary files, this will rise a lot more, so a solution processing the whole data at once would rather be preferable.
Concluding, don't see why use multithreading here, the only reason imo would be if data are actually delivered in blocks, but then again take into account my considerations above, about relative I/O-CPU speeds and the additional overhead of reading/writing the temporary files. And a hint, your file accessing must be very efficient, eg reading/writing in larger blocks using application buffers, not one by one (saves on system calls), otherwise this may have a detrimental effect if the file(s) are stored on a machine other than yours (eg a server).
Hope you find my suggestions useful.

Strategies for playing (long) audio files from disk

I wanted to start a thread on this. A lot of people are wondering how to do it in a specific context or with a specific language, but I was wondering what's the best strategy in general
I see two main practices :
load small chunks (like 2048 samples) of the file in a buffer. It seems the most straightforward but it involves to use the disk the lot, so I suspect it is not the best.
load all the file in a big buffer. More gentle with the hardrive, but needs a lot of ram if you use several long files. And if your file is very long, or has a lot of channels, I imagine the index variable could get corrupted. For example if it's a 16bit integer maybe it cannot reach the end of the file (or am I paranoid ?)
and I'm thinking about hybrid things, like :
using very big buffers without loading the whole file
store the file in a custom format on hardrive, in a way that it's optimized for accessing it quickly.
So, what do you think, how do you deal with this ?
I don't really care what's the "best", I'm more wondering about the pros and cons of each.
Answering part of my own question (the part about hybrid solutions).
Audacity is using custom BlockFiles format for storing and playback. It encapsulates both the idea of big(-ger than callback) buffers which are around 1Mb and the idea of custom file type (.aup).
"BlockFiles balance two conflicting forces. We can insert and delete audio without excessive copying, and during playback we are guaranteed to get reasonably large chunks of audio with each request to the disk. The smaller the blocks, the more potential disk requests to fetch the same amount of audio data; the larger the blocks, the more copying on insertions and deletions." (from : http://www.aosabook.org/en/audacity.html)
From what I've read, it was primarly designed for speeding up the edition of very long files (for example inserting data at the beginning without having to move everything after).
But for playback of relatively short audio data (< 1 hour) I guess putting everything in RAM is just fine.

How to open and read 1000s of files very quickly

My problem is that application takes too long to load thousands of files. Yes, I know it's going to take a long time, but I would like to make it faster by any amount of time. What I mean by "load" is open the file to get its descriptor and then read the first 100 bytes or so of it.
So, my main strategy has been to create a second thread that will open and close (without reading any contents) all the files. This seems to help because the thread runs ahead of the main thread and I'm guessing the OS is caching these file descriptors ahead of time so that when my main thread opens them it's a quick open. This has actually helped because the thread can start caching these file descriptors while my main thread is parsing the data read in from these files.
So my real question is...what else can I do to make this faster? What approaches are there? Has anyone had success doing this?
I've heard of OS prefetching calls but it was for virtual memory pages. Is there a way to tell the OS, hey I'm going to be needed all these files pretty soon - I suggest that you start gathering them for me ahead of time. My lookahead thread is pretty crude.
Are there low level disk techniques I could use? Is there possibly a pattern of file access that would help? Right now, the files that are loaded all come from the same folder. I suppose there is no way to determine where exactly on disk they lie and which ordering of file opens would be fastest for the disk. I'm also guessing that the disk has some hard ware to make this as efficient as possible too.
My application is mainly for windows, but unix suggestions would help as well.
I am programming in C++ if that makes a difference.
Thanks,
-julian
My first thought is that this is going to be hard to work around from a programmatic level.
You'll find Linux and OSX can access thousands of files like this in a fraction of the time it takes Windows. I don't know how much control you have over the machine. If you can keep the thousands of files on a FAT partition, you should see better results than with NTFS.
How often are you scanning these files and how often are they changing. If the ratio is heavily on the reading side, it would make sense to copy the start of each file into a cache. The cache could store the filename, modification time, and 100 bytes of each of the thousand files.

How to parallelize file reading and writing

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.
My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.
Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?
Environment: Win32, C++, NTFS, Single Hard Disk
Thanks.
-Dbger
Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).
To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.
Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.
This isn't really an answer to your question but rather a re-design (which we all hate but can't help doing). As already mentioned, trying to speed up I/O on a hard disk with multiple threads probably won't help.
However, it might be possible to use another approach depending on data sensitivity, throughput needs, data size, etc. It would not be difficult to create a structure in memory that maintains a picture of the data and allows easy/fast updates of the lines of text anywhere in the data. You could then use a dedicated thread that simply monitors that structure and whose job it is to write the data to disk. Writing data sequentially to disk can be extremely fast; it can be much faster than seeking randomly to different sections and writing it in pieces.

Can running 'cat' speed up subsequent file random access on a linux box?

on a linux box with plenty of memory (a few Gigs), I need to access randomly to a big file as fast as possible.
I was thinking about doing a cat myfile > /dev/null before accessing it so my file pages go in memory sequentially, hence faster than with a dry random access.
Does this approach make sense to you?
While doing that may force the contents of the file into the system's cache, you are better off using posix_fadvise() (with the POSIX_FADV_WILLNEED advice) or the (blocking)readahead() call to make the kernel precache the data you will need.
EDIT:
You might also want to try using the POSIX_FADV_RANDOM advice to disable readahead altogether.
There's an article with a decent explanation of usage here: Advising the Linux Kernel on File I/O
As the others said, you'll need to benchmark it in your particular case.
It is quite possible it will result in a significant performance increase though.
On traditional rotating media (i.e. a hard disk) sequential access (cat file > /dev/null/fadvise) is much faster than random access.
Only one way to be sure that any (possibly premature?) optimization is worthwhile: benchmark it.
It could theoretically speed up the access (especially if you access almost everything from the file), but I wouldn't bet on a big difference.
The only really useful approach is to benchmark it for your specific case.
If you really want the speed I'd recommend trying memory-mapped IO instead of trying to hack something up with cat. Of course, it depends on the size of file you're trying to access and the type of access you want.. this may not be possible...
readahead is a good call too...
Doing "cat" on a big file might bring the data in and blow more valuable data out of the cache; this is not what you want.
If performance is at all important to you, you'll be doing regular performance testing anyway (and soak tests etc), so continue to do that and watch your graphs, figures etc.

Resources