We have to deal with extracting gzip/bzip files over the internet, sometimes they are way over multiple gigabytes (eg. 15gb wiki dump).
Is there a way that those can be extracted by multiple computers instead of by one? Perhaps reading the header plus the bytes between X and Y by each node in the cluster, writing it into a shared folder?
Or any other way that can accelerate that process?
Have you considered using a parallelized alternative to gzip/bzip?
In the scenario that you are using bzip, pbzip2 is a parallelized alternative using pthreads to speedup download. In addtion, a parallel alternative to gzip is pgzip.
Related
I am using the PIGZ library. https://zlib.net/pigz/
I compressed large files using multiple threads per file with this library and now I want to decompress those files using multiple threads per file too. As per the documentation:
Decompression can’t be parallelized, at least not without specially
prepared deflate streams for that purpose.
However, the documentation doesn't specify how to do that, and I'm finding it difficult to find information on this.
How would I create these "specically prepared deflate streams" that PIGZ can utilise for decompression?
pigz does not currently support parallel decompression, so it wouldn't help to specially prepare such a deflate stream.
The main reason this has not been implemented is that, in most situations, decompression is fast enough to be i/o bound, not processor bound. This is not the case for compression, which can be much slower than decompression, and where parallel compression can speed things up quite a bit.
You could write your own parallel decompressor using zlib and pthread. pigz 2.3.4 and later will in fact make a specially prepared stream for parallel decompression by using the --independent (-i) option. That makes the blocks independently decompressible, and puts two sync markers in front of each to make it possible to find them quickly by scanning the compressed data.The uncompressed size of a block is set with --blocksize or -b. You might want to make that size larger than the default, e.g. 1M instead of 128K, to reduce the compression impact of using -i. Some testing will tell you how much your compression is reduced by using -i.
(By the way, pigz is not a library, it is a command-line utility.)
I writting an application using OpenCV 2.2 under VC++. I am getting videos from different network streams and write frame by frame to AVI file each in separate thread. The video streams are in hundrads and my application writting hundrads of files to disk which is very heavy, can someone advise me the optimized way to do this
Thanks in advance
Oh dear. I hope you have plenty of RAM.
Writing multiple files is a real pain. The best you can do is to mitigate the write seeks by always writing as large a chunk of AVI-frames, (preferably a multiple of sector size), as reasonably possible. Maybe:
1) A 'FrameBuf' frame-buffer class. Create a shitload of *FrameBuf at startup and pool them on a producer-consumer queue.
2) A 'FrameVec' container class for multiple *FrameBuf instances. You may need to pool these as well.
3) A threadpool for writing the contents of a *FrameVec to the disk system. This will contain very few threads, possibly only one, for best disk-write performance with few seeks. Best make the number of threads configurable/changeable at runtime to optimize overall throughput. Best make it all configurable - depth of *FrameBuf pool, number of *FrameBuf in each *FrameVec - everything.
If possible, use an SSD. If the system has any 'quiet' time, it could move the accumulated avi's to a big spinner, or networked disks, to free up the SSD for the next 'busy' time.
When moving your various instances about, remember these mantras:
'Stack-objects, copy ctors, any template class with no * bad', and 'pointers, pools, pointer containers good'.
Good luck..
Hej sharp minds!
I need your expert guidance in making some choices.
Situation is like this:
1. I have approx. 500 flat files containing from 100 to 50000 records that have to be processed.
2. Each record in the files mentioned above has to be replaced using value from the separate huge file (2-15Gb) containing 100-200 million entries.
So I thought to make the processing using multicores - one file per thread/fork.
Is that a good idea? Since each thread needs to read from same huge file? It's a bit of a problem loading it into memory do to the size? Using file::tie is an option, but is that working with threads/forks?
Need your advise how to proceed.
Thanks
Yes, of course, using multiple cores for multi-threaded application is a good idea, because that's what those cores are for. Though it sounds like your problem involves heavy I/O, so, it might be that you will not use that much of CPU anyway.
Also since you are only going to read that big file, tie should work perfectly. I haven't heard of problems with that. But if you are going to search that big file for each record in your smaller files, then I guess it would take you a long time despite of the number of threads you use. If data from big file can be indexed based on some key, then I would advice to put it in some NoSQL databse and access it from your program. That would probably speed up your task even more than using multiple threads/cores.
I'd like to know if there's a way to make a zip file, or any other compressed file (tar,gz,etc) that will extract as quickly as possible. I'm just trying to move one folder to another computer, so I'm not concerned with the size of the file. However, I'm zipping up a large folder (~100 Mbs), and I was wondering if there's a method to extract a zip file quicker, or if another standard can decompress files more quickly.
Thanks!
The short answer is that compression is always a trade off between speed and size. i.e. faster compression usually means smaller size - but unless you're using floppy disks to transfer the data, the time you gain by using a faster compression method means more network time to haul the data about. But having said that, the speed and compression ratio for different mathods varies depending on te structure of the file(s) you are compressing.
You also have to consider availability of software - is it worth spending the time downloading and compiling a compression program? I guess if its worth the time waiting for an answer here then either you're using an RFC1149 network or you're going to be doing this a lot.
In which case the answer is simple: test the programs yourself using a representative dataset.
I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?