Tensorflow while_loop for training #2 - multithreading

Here I asked how to solve overhead problem by using while_loop for training (which allow to evaluate train_op several time by call only one run). After that I create 4 thread and run one while_loop per thread for optimization in parallel. Is there native mechanism in TensorFlow for such parallel optimization?
I use Ftrl optimizer.
Thanks!
EDITED:
In my situation I have big data set, which I read gradually in main thread and enqueue to FIFOQueue. I use batch optimization and one optimization step on small batch (ideal only one element) takes little time (I use linear model), since that I want to do all optimization step in one run call, without returning to python interpreter on each step (because overhead problem). Now I call run as many times as number of threads.

Related

Run a parallel process saving results from a main process in Python

I have a function that creates some results for a list of tasks. I would like to save the results on the fly to 1) release memory compared to saving to appending to a results_list and 2) have the results of the first part in case of errors.
Here is a very short sample code:
for task in task_list:
result = do_awesome_stuff_to_task(task)
save_nice_results_to_db(result) # Send this job to another process and let the main process continue
Is there a way for the main process to create results for each task in task_list and each time a result is create send this to another processor/thread to save it, so the main loop can continue without waiting for the slow saving process?
I have looked at multiprocessing, but that seems mostly to speed up the loop over task_list rather than allow a secondary sub process to do other parts of the work. I have also looked into asyncio, but that seems mostly used for I/O.
All in all, I am looking for a way to have a main process looping over the task_list. For each task finished I would like to send the results to another subprocess to save the results. Notice, the do_awesome_stuff_to_task is much faster than savings process, hence, the main loop will have reached through multiple task before the first task is saved. I have thought of two ways of tackling this:
Use multiple sub process to save
Save every xx iteration - the save_results scale okay, so perhaps the save process can save xx iteration at a time while the main loop continuous?
Is this possible to do with Python? Where to look and what key considerations to take?
All help is appreciated.
It's hard to know what will be faster in your case without testing, but here's some thoughts on how to choose what to do.
If save_nice_results_to_db is slow because it's writing data to disk or network, make sure you aren't already at the maximum write speed of your hardware. Depending on the server at the other end, network traffic can sometimes benefit greatly from opening multiple ports at once to read/write so long as you stay within your total network transfer speed (of the mac interface as well as your ISP). SSD's can see some limited benefit from initiating multiple reads/writes at once, but too many will hurt performance. HDD's are almost universally slower when trying to do more than one thing at once. Everything is more efficient reading/writing larger chunks at a time.
multiprocessing must typically transfer data between the parent and child processes using pickle because they don't share memory. This has a high overhead, so if result is a large object, you may waste more time with the added overhead of sending the data to a child process than you could save by any sort of concurrency. (emphasis on the may. always test for yourself). As of 3.8 the shared_memory module was added which may be somewhat more efficient, but is much less flexible and easy to use.
threading benefits from all threads sharing memory so there is zero transfer overhead to "send" data between threads. Python threads however cannot execute bytecode concurrently due to the GIL (global interpreter lock), so multiple CPU cores cannot be leveraged to increase computation speed. This is due to python itself having many parts which are not thread-safe. Specific functions written in c may release this lock to get around this issue and leverage multiple cpu cores using threads, but once execution returns to the python interpreter, that lock is held again. Typically functions involving network access or file IO can release the GIL, as the interpreter is waiting on an operating system call which is usually thread safe. Other popular libraries like Numpy also make an effort to release the GIL while doing complex math operations on large arrays. You can only release the GIL from c/c++ code however, and not from python itself.
asyncio should get a special mention here, as it's designed specifically with concurrent network/file operations in mind. It uses coroutines instead of threads (even lower overhead than threads, which themselves are much lower overhead than processes) to queue up a bunch of operations, then uses an operating system call to wait on any of them to finish (event loop). Using this would also require your do_awesome_stuff_to_task to happen in a coroutine for it to happen at the same time as save_nice_results_to_db.
A trivial example of firing each result off to a thread to be processed:
for task in task_list:
result = do_awesome_stuff_to_task(task)
threading.Thread(target=save_nice_results_to_db, args=(result,)).start() # Send this job to another process and let the main process continue

Why Tensorflow creates so many CPU threads

Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.

NaN values in tensorflow when running from GPU and using multiple threads to feed queues

I'm running into NaN values in a model I'm training.
I have 2 tensorflow queues. The first is being fed by an independent loader thread, a second thread is reading from the first queue, performing preprocessing, and loading into the second queue.
That all occurs on the CPU. Then the model reads a batch from the second queue and trains on the GPU.
I get NaNs after a variable number of steps, usually on the order of 10-20.
I can side step the problem in 2 ways:
Run on the CPU. Same code runs fine on the CPU.
Remove threading, same code, instead of running the loader and preprocessor in a thread, just do those two steps in sequence before the training step.
So the problem is only encountered when I run with multiple threads accessing the queues from different devices.
Or so it seems, I've thus far failed to distill the problem into a minimal test case successfully. A simplified test of this seems to work.
Wondering if there are any known related issues.
I've reproduced this on 2 systems one running TF 1.0.1 and one running 1.1.0-rc1. I've tried both CUDNN 5 and CUDNN 6 libraries.
This problem appeared to be related to having some of the tf.image processing functions defined on the GPU, but feeding them data from a queue on the CPU. I wouldn't have expected that to be a problem, but as soon as I tied those operations to the CPU using everything worked fine.

Multiprocessors and multithreading - Operating Systems

I was going through topics of Operating Systems using the text book by Galvin (the 9th edition). In Chapter 4 on multi-threading, I came across problem 14 which is as follows:
A system with two dual-core processors has four processors available for scheduling. A CPU -intensive application is running on this system. All input is performed at program start-up, when a single file must be opened. Similarly, all output is performed just before the program terminates, when the program results must be written to a single file. Between startup and termination, the program is entirely CPU - bound. Your task is to improve the performance of this application by multithreading it. The application runs on a system that uses the one-to-one threading model (each user thread maps to a kernel thread).
• How many threads will you create to perform the input and output? Explain.
• How many threads will you create for the CPU -intensive portion of the application? Explain.
For the first part, I think we could create 4 threads for taking input for reading from a file as well as for writing output to a file. This is because during either input or output, there is no updating of the data being carried out.
For the second part, the nature of operation to be carried out on data is not known, for example, whether (1) average of the data is to be printed or (2) a function to print the average of first and last data points, then print average of second and second last data points, and so on.
Therefore, for second part, one thread could be employed to handle the operation.
But I am not very sure of the answer I gave here being right. So, I would be very grateful if you could let me know the right answer for this.
The question is testing if you understand some principles about parallelizing work to increase speed. Some of these principles are:
In the usual case, reading and writing a single file cannot be sped up using multiple cores. Speed of file I/O is determine by the properties of where and how the file is stored. Throwing more threads at it is not going to help, because those threads are just going to be waiting for the I/O to complete.
How many threads you use for CPU intensive portion depends entirely on what is being computed. If the program is generating imagery for a movie, use 4 threads because that is completely parallel. If the workload is entirely serial, use 1 thread because adding more threads won't help (by definition).
Computing the averages in your example is almost completely parallel, so you should use four threads, not one.

statisics for randomized task on multiple cores

Consider the time for completing a task on a processor core is a distribution with mean m and standard deviation s. If the same task runs on n cores, what is the mean and standard deviation of the time it takes to complete the task? (the task is finished when one of the cores finishes the task)
This is more of a statistics question, than anything else. Without information on the distribution function of the time t a single task needs to complete, I could only give you a hint: You need to calculate the distribution function of the minimum of t for n of your tasks, as seen here. Using that you can then calculate the mean and the standard deviation.
PS: Is this homework?
EDIT:
Whether - and how much - it's worth to use multiple cores, depends on several things:
What you need to do. If you have to run the same program with different inputs, launching multiple instances makes a lot of sense. It might not cut down the overall time down to 1/n and each experiment will still need at least as much time as before, but the time needed for the whole series will be signigicantly less.
If on the other hand, you are hoping to run the same task with e.g. a different seed and keep the one that converges the fastest, you will probably gain far less, as estimated by the first part of my answer.
How well you have parallelized your tasks. n completely independent tasks is the ideal scenario. n threads with multiple synchronization points etc are not going to be near as efficient.
How well your hardware can handle multiple tasks. For example if each of these tasks needs a lot of memory, it will probably be faster to use a single core only, than forcing the system to use the swap space/pagefile/whatever your OS calls it by running multiple instances at once.

Resources