Sync item list between perl scripts, across servers - multithreading

I have a multi-threaded perl script which does the following:
1) One boss thread searches through a folder structure on an external server. For each file it finds, it adds its path/name to a thread queue. If the path/file is already in the queue, or being processed by the worker threads, the enqueuing is skipped.
2) A dozen worker threads dequeue from the above queue, process the files, and remove them from the hard disk.
It runs on a single physical server, and everything works fine.
Now I want to add a second server, which will work concurrently with the first one, searching through the same folder structure, looking for files to enqueue/process. I need a means to make both servers aware of what each one is doing, so that they don't process the same files. The queue is minimal, ranging from 20 to 100 items. The list is very dynamic and changes many times per second.
Do I simply write to/read from a regular file to keep them sync'ed about the current items list? Any ideas?

I would be very wary of using a regular file - it'll be difficult to manage locking and caching semantics.
IPC is a big and difficult topic, and when you're doing server to server - it can get very messy indeed. You'll need to think about much more complicated scenarios, like 'what if host A crashes with partial processing'.
So first off I would suggest you need to (if at all possible) make your process idempotent. Specifically - set it up so IF both servers do end up processing the same things, then no harm is done - it's 'just' inefficient.
I can't tell you how to do this, but the general one is to permit (and discard) duplication of effort.
In terms of synchronising your two processes on different servers - I don't think a file will do the trick - shared filesystem IPC is not really suitable for a near real time sort of operation, because of caching. Default cache lag on NFS is somewhere in the order of 60s.
I would suggest that you think in terms of sockets - they're a fairly standard way of server to server IPC. As you already check 'pending' items in the queue, expanding this to query the other host (note - consider what you'll do if it's offline or otherwise unreachable) before enqueing.
The caveat here is parallelism works better the less IPC is going on. Talking across a network is generally a bit faster than talking to a disk, but it's considerably slower than the speed at which a processor runs. So if you can work out some sort of caching/locking mechanism, where you don't need to update for each and every file - then it'll run much better.

Related

Fastest Way of Storing Data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a server which generates some output, like this:
http://192.168.0.1/getJPG=[ID]
I have to go through ID 1 to 20M.
I see that most of delay is in storing file, currently I do store every request result as a separate file in a folder. in form of: [ID].jpg
Server responses quickly, generator server is really fast, but I can't handle received data rapidly.
What is best way of storing data for later processing?
I can do all type of storing, like in DB, like in SINGLE file and later parsing big file, etc.
I can code in .NET, PHP, C++, etc. No restrictions in programming language. Please advice.
Thanks
So you're downloading 20 million files from a server, and the speed at which you can save them to disk is a bottleneck? If you're accessing the server over the Internet, that's very strange. Perhaps you're downloading over a local network, or maybe the "server" is even running locally.
With 20 million files to save, I'm sure they won't all fit in RAM, so buffering the data in memory won't help. And if the maximum speed at which data can be written to your disk is really a bottleneck, using MS SQL or any other DB will not change anything. There's nothing "magic" about a DB -- it is limited by the performance of your disk, just like any other program.
It sounds like your best bet would be to use multiple disks. Download multiple files in parallel, and as each is received, write it out to a different disk, in a round-robin fashion. The more disks you have, the better. Use multiple threads OR non-blocking I/O so downloads and disk writes all happen concurrently.
To do this efficiently, I would multi-thread your application (c++).
The main thread of your application will make these web-requests and push them to the back of a std::list. This is all your main application thread will do.
Spawn (and keep it running, do not spawn repeatedly) a pthread(my preferred threading method, even on windows...) and set it up to check the same std::list in a while loop. In the loop, make sure to check the size of the list and if there are things to be processed, pop the front item off of the list (these can be done in different threads without needing a mutex... most of the time...) and write it to disk.
This will allow you to queue up the responses in memory and at the same time be asynchronously saving the files to disk. If your server really is as quick as you say it is, you might run out of memory. Then I would implement some 'waiting' if the number of items to be processed are over a certain threshold, but this will only run a little better than doing it serially.
The real way to 'improve' the speed of this is to have many worker threads (each with their own std::list and 'smart' pushing onto the list with the least items or one std::list shared with a mutex) processing the files. If you have a multi-core machine with multiple hard drives, this will greatly increase the speed of saving these files to disk.
The other solution is to off-load the saving of the files to many different computers as well (if the number of disks on your current computer is limiting the writes). By using a message passing system such as ZMQ/0MQ, you'd be able to very easily push off the saving of files to different systems (which are setup in a PULL fashion) with more hard drives being accessible than just what is currently on one machine. Using ZMQ makes the round-robin style message passing trivial as a fan-out architecture is built in and is literally minutes to implement.
Yet another solution is to create a ramdisk (easy done natively on linux, for windows... I've used this). This will allow you to parallelize the writing of the files with as many writers as you want without issue. Then you'd need to make sure to copy those files to a real storage location before you restart or you'd lose the files. But during the running, you'd be able to store the files in real-time without issue.
Probably it helps to access the disk sequentially. Here is a simple trick to do this: Stream all incoming files to an uncompressed ZIP file (there are libraries for that). This makes all IO sequential and there is only one file. You can also split off a new ZIP file after 10000 images or so to keep the individual ZIPs small.
You can later read all files by streaming out of the ZIP file. Little overhead there as it is uncompressed.
It sounds like you are trying to write an application which downloads as much content as you can as quickly as possible. You should be aware that when you do this, chances are people will notice as this will suck up a good amount of bandwidth and other resources.
Since this is Windows/NTFS, there are some things you need to keep in mind:
- Do not have more than 2k files in one folder.
- Use async/buffered writes as much as possible.
- Spread over as many disks as you have available for best I/O performance.
One thing that wasn't mentioned that is somewhat important is file size. Since it looks like you are fetching JPEGs, I'm going to assume an average files size of ~50k.
I've recently done something like this with an endless stream of ~1KB text files using .Net 4.0 and was able to saturate a 100mbit network controller on the local net. I used the TaskFactory to generate HttpWebRequest threads to download the data to memory streams. I buffered them in memory so I did not have to write them to disk. The basic approach I would recommend is similar - Spin off threads that each make the request, grab the response stream, and write it to disk. The hardest part will be generating the sequential folders and file names. You want to do this as quickly as possible, make it thread safe, and do your bookkeeping in memory to avoid hitting the disk with unnecessary calls for directory contents.
I would not worry about trying to sequence your writes. There are enough layers of the OS/NTFS that will try and do this for you. You should be saturating some piece of your pipe in no time.

Managing the TPL Queue

I've got a service that runs scans of various servers. The networks in question can be huge (hundreds of thousands of network nodes).
The current version of the software is using a queueing/threading architecture designed by us which works but isn't as efficient as it could be (not least of which because jobs can spawn children which isn't handled well)
V2 is coming up and I'm considering using the TPL. It seems like it should be ideally suited.
I've seen this question, the answer to which implies there's no limit to the tasks TPL can handle. In my simple tests (Spin up 100,000 tasks and give them to TPL), TPL barfed fairly early on with an Out-Of-Memory exception (fair enough - especially on my dev box).
The Scans take a variable length of time but 5 mins/task is a good average.
As you can imagine, scans for huge networks can take a considerable length of time, even on beefy servers.
I've already got a framework in place which allows the scan jobs (stored in a Db) to be split between multiple scan servers, but the question is how exactly I should pass work to the TPL on a specific server.
Can I monitor the size of TPL's queue and (say) top it up if it falls below a couple of hundred entries? Is there a downside to doing this?
I also need to handle the situation where a scan needs to be paused. This is seems easier to do by not giving the work to TPL than by cancelling/resetting tasks which may already be partially processed.
All of the initial tasks can be run in any order. Children must be run after the parent has started executing but since the parent spawns them, this shouldn't ever be a problem. Children can be run in any order. Because of this, I'm currently envisioning that child tasks be written back to the Db not spawned directly into TPL. This would allow other servers to "work steal" if required.
Has anyone had any experience with using the TPL in this way? Are there any considerations I need to be aware of?
TPL is about starting small units of work and running them in parallel. It is not about monitoring, pausing, or throttling this work.
You should see TPL as a low-level tool to start "work" and to synchronize threads.
Key point: TPL tasks != logical tasks. Logical tasks are in your case scan-tasks ("scan an ip-range from x to y"). Such a task should not correspond to a physical task "System.Threading.Task" because the two are different concepts.
You need to schedule, orchestrate, monitor and pause the logical tasks yourself because TPL does not understand them and cannot be made to.
Now the more practical concerns:
TPL can certainly start 100k tasks without OOM. The OOM happened because your tasks' code exhausted memory.
Scanning networks sounds like a great case for asynchronous code because while you are scanning you are likely to wait on results while having a great degree of parallelism. You probably don't want to have 500 threads in your process all waiting for a network packet to arrive. Asynchronous tasks fit well with the TPL because every task you run becomes purely CPU-bound and small. That is the sweet spot for TPL.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.
Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.
A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.
could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.
Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.
A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

C# TPL Tasks - How many at one time

I'm learning how to use the TPL for parellizing an application I have. The application processes ZIP files, exctracting all of the files held within them and importing the contents into a database. There may be several thousand zip files waiting to be processed at a given time.
Am I right in kicking off a separate task for each of these ZIP files or is this an inefficient way to use the TPL?
Thanks.
This seems like a problem better suited for worker threads (separate thread for each file) managed with the ThreadPool rather than the TPL. TPL is great when you can divide and conquer on a single item of data but your zip files are treated individually.
Disc I/O is going to be your bottle neck so I think that you will need to throttle the number of jobs running simultaneously. It's simple to manage this with worker threads but I'm not sure how much control you have (if nay) over the parallel for, foreach as far as how parallelism goes on at once, which could choke your process and actually slow it down.
Anytime that you have a long running process, you can typically gain additional performance on multi-processor systems by making different threads for each input task. So I would say that you are most likely going down the right path.
I would have thought that this would depend on if the process is limited by CPU or disk. If the process is limited by disk I'd thought that it might be a bad idea to kick off too many threads since the various extractions might just compete with each other.
This feels like something you might need to measure to get the correct answer for what's best.
I have to disagree with certain statements here guys.
First of all, I do not see any difference between ThreadPool and Tasks in coordination or control. Especially when tasks runs on ThreadPool and you have easy control over tasks, exceptions are nicely propagated to the caller during await or awaiting on Tasks.WhenAll(tasks) etc.
Second, I/O wont have to be the only bottleneck here, depending on data and level of compression the ZIPping is going to take msot likely more time than reading the file from the disc.
It can be thought of in many ways, but I would best go for something like number of CPU cores or little less.
Loading file paths to ConcurrentQueue and then allowing running tasks to dequeue filepaths, load files, zip them, save them.
From there you can tweak the number of cores and play with load balancing.
I do not know if ZIP supports file partitioning during compression, but in some advanced/complex cases it could be good idea especially on large files...
WOW, it is 6 years old question, bummer! I have not noticed...:)

Threads or asynch?

How do you make your application multithreaded ?
Do you use asynch functions ?
or do you spawn a new thread ?
I think that asynch functions are already spawning a thread so if your job is doing just some file reading, being lazy and just spawning your job on a thread would just "waste" ressources...
So is there some kind of design when using thread or asynch functions ?
If you are talking about .Net, then don't forget the ThreadPool. The thread pool is also what asynch functions often use. Spawning to much threads can actually hurt your performance. A thread pool is designed to spawn just enough threads to do the work the fastest. So do use a thread pool instead of spwaning your own threads, unless the thread pool doesn't meet your needs.
PS: And keep an eye out on the Parallel Extensions from Microsoft
Spawning threads is only going to waste resources if you start spawning tons of them, one or two extra threads isn't going to effect the platforms proformance, infact System currently has over 70 threads for me, and msn is using 32 (I really have no idea how a messenger can use that many threads, exspecialy when its minimised and not really doing anything...)
Useualy a good time to spawn a thread is when something will take a long time, but you need to keep doing something else.
eg say a calculation will take 30 seconds. The best thing to do is spawn a new thread for the calculation, so that you can continue to update the screen, and handle any user input because users will hate it if your app freezes untill its finished doing the calculation.
On the other hand, creating threads to do something that can be done almost instantly is nearly pointless, since the overhead of creating (or even just passing work to an existing thread using a thread pool) will be higher than just doing the job in the first place.
Sometimes you can break your app into a couple of seprate parts which run in their own threads. For example in games the updates/physics etc may be one thread, while grahpics are another, sound/music is a third, and networking is another. The problem here is you really have to think about how these parts will interact or else you may have worse proformance, bugs that happen seemingly "randomly", or it may even deadlock.
I'll second Fire Lancer's answer - creating your own threads is an excellent way to process big tasks or to handle a task that would otherwise be "blocking" to the rest of synchronous app, but you have to have a clear understanding of the problem that you must solve and develope in a way that clearly defines the task of a thread, and limits the scope of what it does.
For an example I recently worked on - a Java console app runs periodically to capture data by essentially screen-scraping urls, parsing the document with DOM, extracting data and storing it in a database.
As a single threaded application, it, as you would expect, took an age, averaging around 1 url a second for a 50kb page. Not too bad, but when you scale out to needing to processes thousands of urls in a batch, it's no good.
Profiling the app showed that most of the time the active thread was idle - it was waiting for I/O operations - opening of a socket to the remote URL, opening a connection to the database etc. It's this sort of situation that can easily be improved with multithreading. Rewriting to be multi-threaded and with just 5 threads instead of one, even on a single core cpu, gave an increase in throughput of over 20 times.
In this example, each "worker" thread was explicitly limited to what it did - open the remote a remote url, parse the data, store it in the db. All the "high level" processing - generating the list of urls to parse, working out which next, handling errors, all remained with the control of the main thread.
The use of threads makes you think more about the way your application needs threading and can in the long run make it easier to improve / control your performance.
Async methods are faster to use but they are a bit magic - a lot of things happen to make them possible - so it's probable that at some point you will need something that they can't give you. Then you can try and roll some custom threading code.
It all depends on your needs.
The answer is "it depends".
It depends on what you're trying to achieve. I'm going to assume that you're aiming for more performance.
The simplest solution is to find another way to improve your performance. Run a profiler. Look for hot spots. Reduce unnecessary IO.
The next solution is to break your program into multiple processes, each of which can run in their own address space. This is easiest because there is no chance of the individual processes messing each other up.
The next solution is to use threads. At this point you're opening a major can of worms, so start small, and only multi-thread the critical path of the code.
The next solution is to use asynch IO. Generally only recommended for people writing some of very heavily loaded server, and even then I would rather re-use one of the existing frameworks that abstract away the details e.g. the C++ framework ICE, or an EJB server under java.
Note that each of these solutions has multiple sub-solutions - there are different breeds of threads and different kinds of asynch IO, each with slightly different performance characteristics, but again, it's generally best to let the framework handle it for you.

Resources