TensorFlow: More than one thread in shuffle_batch for single sample files

TensorFlow: More than one thread in shuffle_batch for single sample files - python-3.x

I'm trying to understand the significance of using num_threads>1 in tf.train.shuffle_batch connected to tf.WholeFileReader reading image files (each file contains a single data sample). Will setting num_threads>1 make any difference in such case compared to num_threads=1? What is the mechanics of the file and batch queues in such case?

A short answer: probably it will make the execution faster. Here is some authoritative explanation from the guide:
single reader via the tf.train.shuffle_batch with num_threads bigger
than 1. This will make it read from a single file at the same time
(but faster than with 1 thread), instead of N files at once. This can
be important:
If you have more reading threads than input files, to avoid the risk
that you will have two threads reading the same example from the same
file near each other.
Or if reading N files in parallel causes too
many disk seeks. How many threads do you need?
the
tf.train.shuffle_batch* functions add a summary to the graph that
indicates how full the example queue is. If you have enough reading
threads, that summary will stay above zero.

Related

Which one I should use in Clojure? go block or thread?

I want to see the intrinsic difference between a thread and a long-running go block in Clojure. In particular, I want to figure out which one I should use in my context.
I understand if one creates a go-block, then it is managed to run in a so-called thread-pool, the default size is 8. But thread will create a new thread.
In my case, there is an input stream that takes values from somewhere and the value is taken as an input. Some calculations are performed and the result is inserted into a result channel. In short, we have input and out put channel, and the calculation is done in the loop. So as to achieve concurrency, I have two choices, either use a go-block or use thread.
I wonder what is the intrinsic difference between these two. (We may assume there is no I/O during the calculations.) The sample code looks like the following:
(go-loop []
(when-let [input (<! input-stream)]
... ; calculations here
(>! result-chan result))
(recur))
(thread
(loop []
(when-let [input (<!! input-stream)]
... ; calculations here
(put! result-chan result))
(recur)))
I realize the number of threads that can be run simultaneously is exactly the number of CPU cores. Then in this case, is go-block and thread showing no differences if I am creating more than 8 thread or go-blocks?
I might want to simulate the differences in performance in my own laptop, but the production environment is quite different from the simulated one. I could draw no conclusions.
By the way, the calculation is not so heavy. If the inputs are not so large, 8,000 loops can be run in 1 second.
Another consideration is whether go-block vs thread will have an impact on GC performance.

There's a few things to note here.
Firstly, the thread pool that threads are created on via clojure.core.async/thread is what is known as a cached thread pool, meaning although it will re-use recently used threads inside that pool, it's essentially unbounded. Which of course means it could potentially hog a lot of system resources if left unchecked.
But given that what you're doing inside each asynchronous process is very lightweight, threads to me seem a little overkill. Of course, it's also important to take into account the quantity of items you expect to hit the input stream, if this number is large you could potentially overwhelm core.async's thread pool for go macros, potentially to the point where we're waiting for a thread to become available.
You also didn't mention preciously where you're getting the input values from, are the inputs some fixed data-set that remains constant at the start of the program, or are inputs continuously feed into the input stream from some source over time?
If it's the former then I would suggest you lean more towards transducers and I would argue that a CSP model isn't a good fit for your problem since you aren't modelling communication between separate components in your program, rather you're just processing data in parallel.
If it's the latter then I presume you have some other process that's listening to the result channel and doing something important with those results, in which case I would say your usage of go-blocks is perfectly acceptable.

Perl threads to execute a sybase stored proc parallel

I have written a sybase stored procedure to move data from certain tables[~50] on primary db for given id to archive db. Since it's taking a very long time to archive, I am thinking to execute the same stored procedure in parallel with unique input id for each call.
I manually ran the stored proc twice at same time with different input and it seems to work. Now I want to use Perl threads[maximum 4 threads] and each thread execute the same procedure with different input.
Please advise if this is recommended way or any other efficient way to achieve this. If the experts choice is threads, any pointers or examples would be helpful.

What you do in Perl does not really matter here: what matters is what happens on the side of the Sybase server. Assuming each client task creates its own connection to the database, then it's all fine and how the client achieved this makes no diff for the Sybase server. But do not use a model where the different client tasks will try to use the same client-server connection as that will never happen in parallel.

No 'answer' per se, but some questions/comments:
Can you quantify taking a very long time to archive? Assuming your archive process consists of a mix of insert/select and delete operations, do query plans and MDA data show fast, efficient operations? If you're seeing table scans, sort merges, deferred inserts/deletes, etc ... then it may be worth the effort to address said performance issues.
Can you expand on the comment that running two stored proc invocations at the same time seems to work? Again, any sign of performance issues for the individual proc calls? Any sign of contention (eg, blocking) between the two proc calls? If the archival proc isn't designed properly for parallel/concurrent operations (eg, eliminate blocking), then you may not be gaining much by running multiple procs in parallel.
How many engines does your dataserver have, and are you planning on running your archive process during a period of moderate-to-heavy user activity? If the current archive process runs at/near 100% cpu utilization on a single dataserver engine, then spawning 4 copies of the same process could see your archive process tying up 4 dataserver engines with heavy cpu utilization ... and if your dataserver doesn't have many engines ... combined with moderate-to-heavy user activity at the same time ... you could end up invoking the wrath of your DBA(s) and users. Net result is that you may need to make sure your archive process hog the dataserver.
One other item to consider, and this may require input from the DBAs ... if you're replicating out of either database (source or archive), increasing the volume of transactions per a given time period could have a negative effect on replication throughput (ie, an increase in replication latency); if replication latency needs to be kept at a minimum, then you may want to rethink your entire archive process from the point of view of spreading out transactional activity enough so as to not have an effect on replication latency (eg, single-threaded archive process that does a few insert/select/delete operations, sleeps a bit, then does another batch, then sleeps, ...).
It's been my experience that archive processes are not considered high-priority operations (assuming they're run on a regular basis, and before the source db fills up); this in turn means the archive process is usually designed so that it's efficient while at the same time putting a (relatively) light load on the dataserver (think: running as a trickle in the background) ... ymmv ...

MPI I/O, matching processes to files

I have a number of files, say 100 and a number of processors, say 1000. Each proc needs to read parts of some subset of files. For instance, proc 3 needs file04.dat, file05.dat, and file09.dat, while proc 4 needs file04.dat, file07.dat, and file08.dat., etc. Which files are needed by which procs are not known at compile time and cannot be determined from any algorithm, but are easily determined during runtime from an existing metadata file.
I am trying to determine the best way to do this, using MPI I/O. It occurs to me that I could just have all the procs cycle through the files they need, calling MPI_File_open with MPI_COMM_SELF as the communicator argument. However, I'm a beginner with MPI I/O, and I suspect this would create some problems with large numbers of procs or files. Is this the case?
I have also thought that perhaps the thing to do would be to establish a separate communicator for each file, and each processor that needs a particular file would be a member of the file's associated communicator. But here, first, would that be a good idea? And second, I'm not an expert on communicators either and I can't figure out how to set up the communicators in this manner. Any ideas?
And if anyone has a completely different idea that would work better, I would be glad to hear it.

What is the general design ideas of read-compute-write thread-safe program based on it's single-threaded version?

Consider that the sequental version of the program already exists and implements a sequence of "read-compute-write" operations on a single input file and other single output file. "Read" and "write" operations are performed by the 3rd-party library functions which are hard (but possible) to modify, while the "compute" function is performed by the program itself. Read-write library functions seems to be not thread-safe, since they operate with internal flags and internal memory buffers.
It was discovered that the program is CPU-bounded, and it is planned to improve the program by taking advantage of multiple CPUs (up to 80) by designing the multi-processor version of the program and using OpenMP for that purpose. The idea is to instantiate multiple "compute" functions with same single input and single output.
It is obvious that something nedds to be done in insuring the consistent access to reads, data transfers, computations and data storages. Possible solutions are: (hard) rewrite the IO library functions in thread-safe manner, (moderate) write a thread-safe wrapper for IO functions that would also serve as a data cacher.
Is there any general patterns that cover the subject of converting, wrapping or rewriting the single-threaded code to comply with OpenMP thread-safety assumptions?
EDIT1: The program is fresh enough for changes to make it multi-threaded (or, generally a parallel one, implemented either by multi-threading, multi-processing or other ways).

As a quick response, if you are processing a single file and writing to another, with openMP its easy to convert the sequential version of the program to a multi-thread version without taking too much care about the IO part, provided that the compute algorithm itself can be parallelized.
This is true because usually the main thread, takes care of the IO. If this cannot be achieved because the chunks of data are too big to read at once, and the compute algorithm cannot process smaller chunks, you can use the openMP API to synchronize the IO in each thread. This does not mean that the whole application will stop or wait until the other threads finish computing so new data can be read or written, it means that only the read and write parts need to be done atomically.
For example, if the flow of your sequencial application is as follows:
1) Read
2) compute
3) Write
Given that it truly can be parallelized, and each chunk of data needs to be read from within each thread, each thread could follow the next design:
1) Synchronized read of chunk from input (only one thread at the time could execute this section)
2) Compute chunk of data (done in parallel)
3) Synchronized write of computed chunk to output (only one thread at the time could execute this section)
if you need to write the chunks in the same order you have read them, you need to buffer first, or adopt a different strategy like fseek to the correct position, but that really depends if the output file size is known from the start, ...
Take special attention to the openMP scheduling strategy, because the default may not be the best to your compute algorithm. And if you need to share results between threads, like the offset of the input file you have read, you may use reduction operations provided by the openMP API, which is way more efficient than making a single part of your code run atomically between all threads, just to update a global variable, openMP knows when its safe to write.
EDIT:
In regards of the "read, process, write" operation, as long as you keep each read and write atomic between every worker, I can't think any reason you'll find any trouble. Even when the data read is being stored in a internal buffer, having every worker accessing it atomically, that data is acquired in the exact same order. You only need to keep special attention when saving that chunk to the output file, because you don't know the order each worker will finish processing its attributed chunk, so, you could have a chunk ready to be saved that was read after others that are still being processed. You just need each worker to keep track of the position of each chunk and you can keep a list of pointers to chunks that need to be saved, until you have a sequence of finished chunks since the last one saved to the output file. Some additional care may need to be taken here.
If you are worried about the internal buffer itself (and keeping in mind I don't know the library you are talking about, so I can be wrong) if you make a request to some chunk of data, that internal buffer should only be modified after you requested that data and before the data is returned to you; and as you made that request atomically (meaning that every other worker will need to keep in line for its turn) when the next worker asks for his piece of data, that internal buffer should be in the same state as when the last worker received its chunk. Even in the case that the library particularly says it returns a pointer to a position of the internal buffer and not a copy of the chunk itself, you can make a copy to the worker's memory before releasing the lock on the whole atomic read operation.
If the pattern I suggested is followed correctly, I really don't think you would find any problem you wouldn't find in the same sequential version of the algorithm.

with a little of synchronisation you can go even further. Consider something like this:
#pragma omp parallel sections num_threads
{
#pragma omp section
{
input();
notify_read_complete();
}
#pragma omp section
{
wait_read_complete();
#pragma omp parallel num_threads(N)
{
do_compute_with_threads();
}
notify_compute_complete();
}
#pragma omp section
{
wait_compute_complete();
output();
}
}
So, the basic idea would be that input() and output() read/write chunks of data. The compute part then would work on a chunk of data while the other threads are reading/writing. It will take a bit of manual synchronization work in notify*() and wait*(), but that's not magic.
Cheers,
-michael

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.

It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.

It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.

That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.

Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.

Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string