Efficiency of multiple threads reading the same file

Efficiency of multiple threads reading the same file - multithreading

Will reading from the same file make threads run slower? If so, how does YouTube or Netflix servers handle so many people watching the same movie and everyone is at different place in the movie?
Or if reading from the same file make threads slow, then if space is not a concern, is it better to have multiple copies of the file, or split the file into parts?

Will reading from the same file make threads run slower?
No. Modern operating systems handle this situation extremely efficiently.

Related

Multithreaded reading line by line a file in Crystal

I’m beginning with Crystal Lang and I’d like to know if we can make multithreaded reading a file line by line like in C# with Parallel (and the option MaxDegreeOfParallelism)
Thanks

As far as I understand C#'s Parallel correctly, it just implements concurrent (and eventually multithreaded) execution of a number of similar tasks. This is obviously possible in Crystal, even without multithreading. In the stdlib, HTTP::Server uses this and there are several shards for job processing for example. Once multithreading lands, this will give us the option to run tasks truly in parallel.
Issue #6468 makes a suggestion how to structure such conccurent tasks, and potentially also configure how many tasks are to be executed in parallel.
I'm not sure what you mean by "multithreaded reading a file line by line". Sharing a file descriptor for simultaneous access from multiple threads sounds like a dangerous idea in any language. Are you certain, C#'s Parallel can do that?

Which is faster? Multi-threading VS Multi-tasking approach

As we know that Multitasking involves running multiple processes and multithreading, on the other hand, is running multiple threads which share the same memory space of process
So I want to know which approach seems to be better and faster in terms of Computer Systems?
Which can bring a noticeable difference in performance?
Thanks in advance!

As SergeyA has indicated, this is an awfully broad question. The answer is really going to depend upon the problem being solved.
If the various tasks are large separate with only occasional communication between them, then multiple processes offers the advantage of being able to split the processes across different compute servers.
If the tasks are tightly coupled, then the inter-process communications can eat up a lot of that advantage. At that point, multithreading is most likely more efficient and most likely easier to implement.
Creating multiple processes can be somewhat expensive. Spawning threads is exceedingly easy. That becomes a factor.
Resources required can also be a factor. If you're processing a large dataset and do that across processes, then each process needs to pull the dataset into memory. That takes both time and memory. If you multithread, you can load it once and share the data between your threads.
So it depends. For most problems, multithreading is probably significantly faster than using multiple processes, but as soon as you encounter hardware limitations, that answer goes out the window.

Fastest Way of Storing Data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a server which generates some output, like this:
http://192.168.0.1/getJPG=[ID]
I have to go through ID 1 to 20M.
I see that most of delay is in storing file, currently I do store every request result as a separate file in a folder. in form of: [ID].jpg
Server responses quickly, generator server is really fast, but I can't handle received data rapidly.
What is best way of storing data for later processing?
I can do all type of storing, like in DB, like in SINGLE file and later parsing big file, etc.
I can code in .NET, PHP, C++, etc. No restrictions in programming language. Please advice.
Thanks

So you're downloading 20 million files from a server, and the speed at which you can save them to disk is a bottleneck? If you're accessing the server over the Internet, that's very strange. Perhaps you're downloading over a local network, or maybe the "server" is even running locally.
With 20 million files to save, I'm sure they won't all fit in RAM, so buffering the data in memory won't help. And if the maximum speed at which data can be written to your disk is really a bottleneck, using MS SQL or any other DB will not change anything. There's nothing "magic" about a DB -- it is limited by the performance of your disk, just like any other program.
It sounds like your best bet would be to use multiple disks. Download multiple files in parallel, and as each is received, write it out to a different disk, in a round-robin fashion. The more disks you have, the better. Use multiple threads OR non-blocking I/O so downloads and disk writes all happen concurrently.

To do this efficiently, I would multi-thread your application (c++).
The main thread of your application will make these web-requests and push them to the back of a std::list. This is all your main application thread will do.
Spawn (and keep it running, do not spawn repeatedly) a pthread(my preferred threading method, even on windows...) and set it up to check the same std::list in a while loop. In the loop, make sure to check the size of the list and if there are things to be processed, pop the front item off of the list (these can be done in different threads without needing a mutex... most of the time...) and write it to disk.
This will allow you to queue up the responses in memory and at the same time be asynchronously saving the files to disk. If your server really is as quick as you say it is, you might run out of memory. Then I would implement some 'waiting' if the number of items to be processed are over a certain threshold, but this will only run a little better than doing it serially.
The real way to 'improve' the speed of this is to have many worker threads (each with their own std::list and 'smart' pushing onto the list with the least items or one std::list shared with a mutex) processing the files. If you have a multi-core machine with multiple hard drives, this will greatly increase the speed of saving these files to disk.
The other solution is to off-load the saving of the files to many different computers as well (if the number of disks on your current computer is limiting the writes). By using a message passing system such as ZMQ/0MQ, you'd be able to very easily push off the saving of files to different systems (which are setup in a PULL fashion) with more hard drives being accessible than just what is currently on one machine. Using ZMQ makes the round-robin style message passing trivial as a fan-out architecture is built in and is literally minutes to implement.
Yet another solution is to create a ramdisk (easy done natively on linux, for windows... I've used this). This will allow you to parallelize the writing of the files with as many writers as you want without issue. Then you'd need to make sure to copy those files to a real storage location before you restart or you'd lose the files. But during the running, you'd be able to store the files in real-time without issue.

Probably it helps to access the disk sequentially. Here is a simple trick to do this: Stream all incoming files to an uncompressed ZIP file (there are libraries for that). This makes all IO sequential and there is only one file. You can also split off a new ZIP file after 10000 images or so to keep the individual ZIPs small.
You can later read all files by streaming out of the ZIP file. Little overhead there as it is uncompressed.

It sounds like you are trying to write an application which downloads as much content as you can as quickly as possible. You should be aware that when you do this, chances are people will notice as this will suck up a good amount of bandwidth and other resources.
Since this is Windows/NTFS, there are some things you need to keep in mind:
- Do not have more than 2k files in one folder.
- Use async/buffered writes as much as possible.
- Spread over as many disks as you have available for best I/O performance.
One thing that wasn't mentioned that is somewhat important is file size. Since it looks like you are fetching JPEGs, I'm going to assume an average files size of ~50k.
I've recently done something like this with an endless stream of ~1KB text files using .Net 4.0 and was able to saturate a 100mbit network controller on the local net. I used the TaskFactory to generate HttpWebRequest threads to download the data to memory streams. I buffered them in memory so I did not have to write them to disk. The basic approach I would recommend is similar - Spin off threads that each make the request, grab the response stream, and write it to disk. The hardest part will be generating the sequential folders and file names. You want to do this as quickly as possible, make it thread safe, and do your bookkeeping in memory to avoid hitting the disk with unnecessary calls for directory contents.
I would not worry about trying to sequence your writes. There are enough layers of the OS/NTFS that will try and do this for you. You should be saturating some piece of your pipe in no time.

C# TPL Tasks - How many at one time

I'm learning how to use the TPL for parellizing an application I have. The application processes ZIP files, exctracting all of the files held within them and importing the contents into a database. There may be several thousand zip files waiting to be processed at a given time.
Am I right in kicking off a separate task for each of these ZIP files or is this an inefficient way to use the TPL?
Thanks.

This seems like a problem better suited for worker threads (separate thread for each file) managed with the ThreadPool rather than the TPL. TPL is great when you can divide and conquer on a single item of data but your zip files are treated individually.
Disc I/O is going to be your bottle neck so I think that you will need to throttle the number of jobs running simultaneously. It's simple to manage this with worker threads but I'm not sure how much control you have (if nay) over the parallel for, foreach as far as how parallelism goes on at once, which could choke your process and actually slow it down.

Anytime that you have a long running process, you can typically gain additional performance on multi-processor systems by making different threads for each input task. So I would say that you are most likely going down the right path.

I would have thought that this would depend on if the process is limited by CPU or disk. If the process is limited by disk I'd thought that it might be a bad idea to kick off too many threads since the various extractions might just compete with each other.
This feels like something you might need to measure to get the correct answer for what's best.

I have to disagree with certain statements here guys.
First of all, I do not see any difference between ThreadPool and Tasks in coordination or control. Especially when tasks runs on ThreadPool and you have easy control over tasks, exceptions are nicely propagated to the caller during await or awaiting on Tasks.WhenAll(tasks) etc.
Second, I/O wont have to be the only bottleneck here, depending on data and level of compression the ZIPping is going to take msot likely more time than reading the file from the disc.
It can be thought of in many ways, but I would best go for something like number of CPU cores or little less.
Loading file paths to ConcurrentQueue and then allowing running tasks to dequeue filepaths, load files, zip them, save them.
From there you can tweak the number of cores and play with load balancing.
I do not know if ZIP supports file partitioning during compression, but in some advanced/complex cases it could be good idea especially on large files...
WOW, it is 6 years old question, bummer! I have not noticed...:)

Multiple Backup Jobs Simultaneously: Theory vs Practice

While designing a fairly simple backup system for Linux in python, I'm finding myself asking the question, could there be any time advantage to backing up up several datasets/archives simultaneously?
My intuition tells me that writing to several archives simultaneously would not buy me much time as I/O would already be the greatest bottleneck.
On the other hand, if using something like bz2, would there be an advantage with multi-threading since higher demand of CPU will decrease I/O demand? Or is it a wash since all threads would be doing essentially the same thing and therefore sharing the same bottlenecks?

It depends on your system. If you have multiple disks, it could be very worthwhile to parallelize your backup job. If you have multiple processors, compressing multiple jobs in parallel may be worth your while.
If the processor is slow enough (and the disks are fast enough) that zipping makes your CPU a bottleneck, you'll make some gains on multicore or hyperthreaded processors. The reduced I/O demand from zipped data being written is almost certainly a win if your CPU can keep up with the read speed of your drive(s).
Anyway, this is all very system dependent. Try it and see. Run two jobs at once and then run the same two in serial and see which took longer. The cheap (coding-wise) way is to just run your backup script twice with different input and output parameters. Once you've established a winner, you can go farther down the path.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string