How do TensorFlow queue runners work? - multithreading

I am trying to understand how Queue Runners work. I know that queue runners are used to fill a queue more efficiently, running several enqueuing threads in parallel. But how does the resulting queue look like? What exactly is happening?
Let me make my question more specific:
I have a number of files whose content I want to put in one big queue.
file 1: A, B, C
file 2: D, E, F
file 3: G, H, I
I follow the TensorFlow input pipeline. I first create a list of filenames, then a queue of filenames using tf.string_input_producer(), a reader operation and a decoder. In the end I want to put my serialized examples into an example queue that can be shared with the graph. For this I use a QueueRunner:
qr = tf.train.QueueRunner(queue, [enqueue_op] * numberOfThreads)
and I add it to the QUEUE_RUNNERS collection:
tf.train.add_queue_runner(qr)
My enqueue_op enqueues one example at a time. So, when I use e.g. numberOfThreads = 2, how does the resulting queue look like? In which order are the files read into the queue? For example, does the queue look something like
q = [A, B, C, D, E, F, G, H, I] (so despite the parallel processing the content of the files is not mixed in the queue)
Or does the queue look rather look like
q = [A, D, B, E, C, F, G, H, I] ?
def get_batch(file_list, batch_size, input_size,
num_enqueuing_threads):
file_queue = tf.train.string_input_producer(file_list)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(file_queue)
sequence_features = {
'inputs': tf.FixedLenSequenceFeature(shape=[input_size],
dtype=tf.float32),
'labels': tf.FixedLenSequenceFeature(shape=[],
dtype=tf.int64)}
_, sequence = tf.parse_single_sequence_example(
serialized_example, sequence_features=sequence_features)
length = tf.shape(sequence['inputs'])[0]
queue = tf.PaddingFIFOQueue(
capacity=1000,
dtypes=[tf.float32, tf.int64, tf.int32],
shapes=[(None, input_size), (None,), ()])
enqueue_ops = [queue.enqueue([sequence['inputs'],
sequence['labels'],
length])] * num_enqueuing_threads
tf.train.add_queue_runner(tf.train.QueueRunner(queue, enqueue_ops))
return queue.dequeue_many(batch_size)

I actually find the animated figure of the same page you mention to be rather illustrative.
There is a first queue, the Filename Queue, that feeds file reading threads. Within each of these threads, files are read sequentially -- or rather, according to the order followed by your reader. Each of these threads feed another queue, the Example Queue, the output of which is consumed by your training.
In your question, you are interested in the resulting order of the examples in the Example Queue.
The answer depends on how the reader works. The best practice and the most efficient way (illustrated in the animated figure above) is to send a sample to the Example Queue as soon as it is ready. This streaming functionality is usually provided by the various Reader variants proposed by tensorflow to promote and ease this best practice. However it is certainly possible to write a reader that would enqueue_many all samples at once at the end.
In all cases, since readers are in separate, non-synchronized threads, you do not have any guarantee in the order of their respective output. For example in the standard, streaming cases, the examples may be interleaved regularly, or not -- virtually anything can happen.

Related

python Concurrent Futures gives different results each time

I am very confused why the concurrent.futures module is giving me different results each time. I have a function, say foo(), which runs on segments of a larger set of data d.
I consistently break this larger data set d into parts and make a list
d_parts = [d1, d2, d3, ...]
Then following the documentation, I do the following
results = [executor.submit(foo, d) for d in d_parts]
which is supposed to give me a list of "futures" objects in the order of foo(d1), foo(d2), and so on.
However, when I try to compile results with
done, _ = concurrent.futures.wait(results)
The list of results stored in done seem to be out of order, i.e. they are not the returns of foo(d1), foo(d2) but some different ordering. Hence, running this program on the same data set multiple times yields different results, as a result of the indeterminacy of which finishes first (the d1, d2... are roughly same size) Is there a reason why, since it seems that wait() should preserve the ordering in which the jobs were submitted?
Thanks!

Array assignment using multiprocessing

I have a uniform 2D coordinate grid stored in a numpy array. The values of this array are assigned by a function that looks roughly like the following:
def update_grid(grid):
n, m = grid.shape
for i in range(n):
for j in range(m):
#assignment
Calling this function takes 5-10 seconds for a 100x100 grid, and it needs to be called several hundred times during the execution of my main program. This function is the rate limiting step in my program, so I want to reduce the process time as much as possible.
I believe that the assignment expression inside can be split up in a manner which accommodates multiprocessing. The value at each gridpoint is independent of the others, so the assignments can be split something like this:
def update_grid(grid):
n, m = grid.shape
for i in range (n):
for j in range (m):
p = Process(target=#assignment)
p.start()
So my questions are:
Does the above loop structure ensure each process will only operate
on a single gridpoint? Do I need anything else to allow each
process to write to the same array, even if they're writing to
different placing in that array?
The assignment expression requires a set of parameters. These are
constant, but each process will be reading at the same time. Is this okay?
To explicitly write the code I've structured above, I would need to
define my assignment expression as another function inside of
update_grid, correct?
Is this actually worthwhile?
Thanks in advance.
Edit:
I still haven't figured out how to speed up the assignment, but I was able to avoid the problem by changing my main program. It no longer needs to update the entire grid with each iteration, and instead tracks changes and only updates what has changed. This cut the execution time of my program down from an hour to less than a minute.

Using multiple map_async (Multiprocessing) in Python3

I have sample code that uses map_async in Multiprocessing using Python 3. What I'm trying to figure out is how I can run map_async(a, c) and map_async(b, d) concurrently. But it seems like to second map_async(b, d) statement seems to run when the first one is about to finish. Is there a way I can run two map_async functions to run at the same time? I tried to search online but didn't get the answer that I wanted. Below is the sample code. If you have other suggestions, I'm very happy to listen to that as well. Thank you all for the help!
from multiprocessing import Pool
import time
import os
def a(i):
print('First': i)
return
def b(i):
print('Second': i)
return
if __name__ = '__main__':
c = range(100)
d = range(100)
pool = Pool(os.cpu_count())
pool.map_async(a, c)
pool.map_async(b, d)
pool.close()
pool.join()
map_async simply splits the iterable in a set of chunks and sends those chunks via a os.pipe to the workers. Therefore, two subsequent calls to map_async will appear to the workers as a single list composed by the join of the two above mentioned sets.
This is the correct behaviour as the workers really don't care about which map_async call a chunk belongs. Running two map_async in parallel would not bring any improvement in terms of speed or throughput.
If for any reason you really need the two call to be executed in parallel, the only way is to create two different Pool objects. I would nevertheless recommend against such approach as it would make things much more unpredictable.

What does the Queue Standard Library Interface of Chisel 3 synthesizes to?

There are brief definitions of Queue and other Standard Library Interfaces of Chisel (Decoupled, Valid, etc) in the Cheat-Sheet and a bit more detail in the Chisel Manual. I also found these two answers here at StackOverflow - here and here.
However, neither of these resources explains in the plastic way - and I feel that would help me better understand the purpose of these Interfaces - what do these lines of code synthesize to - what do they look like in actual hardware?
For example, here is a snippet of the FPU code from the package HardFloat:
val input = Decoupled(new DivRecFN_io(expWidth, sigWidth)).flip
where DivRecFN_io is a class as follows:
class DivRecFN_io(expWidth: Int, sigWidth: Int) extends Bundle {
val a = ...
val b = ...
val ...
...
}
What exactly is achieved with the line containing Decouple?
Thank you.
For what it looks like in actual hardware:
The default Chisel util Queue is a standard circular buffer implementation. This means it has a series of registers with an enqueue and dequeue pointer, that move as a result of operations on the queue, checked for fullness and emptiness.
Decoupled wires a DivRecFN Bundle to field named bits and adds ready and valid signals that are typically used to manage flow control for Modules that do not return results within a single cycle. By default DecoupledIO's data fields would be Output. The flip at the end of the line would convert that to Input. Considering a module C which contains the val input and a module P that uses an instance of Module(C), The module C would be consuming the data in the Bundle, the parent of this module P would be producing the data placed in the Bundle. C would assert ready to indicate it is ready for data, and would read/use that data when valid is asserted by P.
The fields in the decoupled Bundle would be
input.ready
input.valid
input.bits.a
input.bits.b
...

Parallel processing - Connected Data

Problem
Summary: Parallely apply a function F to each element of an array where F is NOT thread safe.
I have a set of elements E to process, lets say a queue of them.
I want to process all these elements in parallel using the same function f( E ).
Now, ideally I could call a map based parallel pattern, but the problem has the following constraints.
Each element contains a pair of 2 objects.( E = (A,B) )
Two elements may share an object. ( E1 = (A1,B1); E2 = (A1, B2) )
The function f cannot process two elements that share an object. so E1 and E2 cannot be processing in parallel.
What is the right way of doing this?
My thoughts are like so,
trivial thought: Keep a set of active As and Bs, and start processing an Element only when no other thread is already using A OR B.
So, when you give the element to a thread you add the As and Bs to the active set.
Pick the first element, if its elements are not in the active set spawn a new thread , otherwise push it to the back of the queue of elements.
Do this till the queue is empty.
Will this cause a deadlock ? Ideally when a processing is over some elements will become available right?
2.-The other thought is to make a graph of these connected objects.
Each node represents an object (A / B) . Each element is an edge connecting A & B, and then somehow process the data such that we know the elements are never overlapping.
Questions
How can we achieve this best?
Is there a standard pattern to do this ?
Is there a problem with these approaches?
Not necessary, but if you could tell the TBB methods to use, that'll be great.
The "best" approach depends on a lot of factors here:
How many elements "E" do you have and how much work is needed for f(E). --> Check if it's really worth it to work the elements in parallel (if you need a lot of locking and don't have much work to do, you'll probably slow down the process by working in parallel)
Is there any possibility to change the design that can make f(E) multi-threading safe?
How many elements "A" and "B" are there? Is there any logic to which elements "E" share specific versions of A and B? --> If you can sort the elements E into separate lists where each A and B only appears in a single list, then you can process these lists parallel without any further locking.
If there are many different A's and B's and you don't share too many of them, you may want to do a trivial approach where you just lock each "A" and "B" when entering and wait until you get the lock.
Whenever you do "lock and wait" with multiple locks it's very important that you always take the locks in the same order (e.g. always A first and B second) because otherwise you may run into deadlocks. This locking order needs to be observed everywhere (a single place in the whole application that uses a different order can cause a deadlock)
Edit: Also if you do "try lock" you need to ensure that the order is always the same. Otherwise you can cause a lifelock:
thread 1 locks A
thread 2 locks B
thread 1 tries to lock B and fails
thread 2 tries to lock A and fails
thread 1 releases lock A
thread 2 releases lock B
Goto 1 and repeat...
Chances that this actually happens "endless" are relatively slim, but it should be avoided anyway
Edit 2: principally I guess I'd just split E(Ax, Bx) into different lists based on Ax (e.g one list for all E's that share the same A). Then process these lists in parallel with locking of "B" (there you can still "TryLock" and continue if the required B is already used.

Resources