Tensorflow shared queue in PS server with reader in workers - multithreading

I am running a distributed Tensorflow program with large input files (150 MB per example).
I wish to have a shared input queue for file names in the PS in order for the workers to work on different examples.
I want the CPU of each worker to then read the shared input queue and generate data for the GPU to process.
The code below is only run by the workers:
with tf.Graph().as_default():
with tf.device('/job:ps/replica:0/task:0'):
file_queue =tf.train.string_input_producer(file_paths, shared_name='train_queue')
with tf.device('/cpu:0'):
input_tensors = model.input_fn(file_queue, ...)
# sets variables to PS and ops default to GPU
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
output_tensors = model.model_fn(input_tensors, ...)
However, the Reader() for file_queue (which is inside model.input_fn()) is being placed in the PS instead of in the workers' CPUs, as treid to specify using tf.device().
This causes 150 MB messages being sent between the PS and the workers, which slows down training (I only notice this because google-cloud ml engine raises a warning when large messages are being sent).
Why is the Reader() not being placed on the workers' CPU?
Is it mandatory for a queue and its reader to be on the same device?
Here is a link to my previous context which might provide more context.
Here is the code for input_fn():
def input_fn(file_queue, ...):
reader = tf.TFRecordReader()
_, example = reader.read(file_queue)
image, ground_truth = my_decoder(example)
image, ground_truth = tf.train.shuffle_batch([image, ground_truth], ...)
return image, ground_truth
The problem is tf.TFRecordReader() is being placed in the PS. All the other ops (decoder and batch) are correctly placed in the workers' CPUs.


Convert function to exploit parallelization of the GPU

I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.
Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.

When would I use model.to("cuda:1") as opposed to model.to("cuda:0")?

I have a user with two GPU's; the first one is AMD which can't run CUDA, and the second one is a cuda-capable NVIDIA GPU. I am using the code model.half().to("cuda:0"). I'm not sure if the invocation successfully used the GPU, nor am I able to test it because I don't have any spare computer with more than 1 GPU lying around.
In this case, does "cuda:0" mean the first device which can run CUDA, so it would've worked even if their first device was AMD? Or would I need to say "cuda:1" instead? How would I detect which number is the first CUDA-capable device?
The package nvidia-smi can help to track GPU's memory while running your code.
To install, run pip install nvidia-ml-py3. Take a look at this code snip:
import nvidia_smi
cuda_idx = 0 # edit device index that you want to track
to_cuda = f'cuda:{cuda_idx}' # 'cuda:0' in this case
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(cuda_idx)
def B2G(num):
return round(num/(1024**3),2)
def print_memory(name, handle, pre_used):
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
used = info.used
print(f'{name}: {B2G(used)}')
print(f'This step use: {B2G(used-pre_used)}')
return used
# start
mem = print_memory('Start', handle, 0)
model = ... # init your model
mem = print_memory('Init model', handle, mem)
Above is the example with nvidia-smi that can help to track the memory that needs for each part of the model and print it in GB unit.
Edited: To check the list of GPUs:
def check_gpu():
for i in range(torch.cuda.device_count()):
device_name = f'cuda:{i}'
print(f'{i} device name:{torch.cuda.get_device_name(torch.device(device_name))}')
I tested it and as I suspected the model.half().to("cuda:0") will put your model in the first available GPU with CUDA support i.e. NVIDIA GPU in your case, the AMD GPU isn't visible as a cuda device, feel safe to assume cuda:0 is only a CUDA enabled GPU, and AMD GPU won't be seen by your program.
Have a good day.
There are plenty of methods of torch.cuda to query and monitor GPU devices.
For example, you can check the type of each device:
% or
In my case, the output of get_device_name returns:
'Quadro RTX 6000'
If you want a more programmatic way to explore the properties of your devices, you can use torch.cuda.get_device_properties.
Once you are working with a device (or believe you are), you can use [torch.cuda]'s memory management functions to monitor GPU memory usage.
For instance, you can get a very detailed account of the current state of your device's memory using:
% or
If you want nvidia-smi-like stats on utilization, you can use torch.cuda.utilization

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
def classify(imgp=""):
#do some classification with the net
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
Any help here will be much appreciated. Thank you.

Tensorflow supports multiple threads/streams on one GPU for training?

I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 / max_streams /) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
I have similar experience with you.
I have two GPU, each GPU run three threads, each thread running a session, each session running time fluct a lot.
if run only one thread on each GPU, session running time is quite stable.
from these appearence, we can conclude that ,thread in tensorflow not cowork well,
the mechanism of tensorflow has problem.

scikit learn unwanted parallel processing

I have a problem with nested multiprocessing witch starts when I use scikit-learn (v. 0.22) Quadratic Discriminant Analysis. Necessary is system configuration: 24 thread Xeon machine running fedora 30.
I run consecutively qda on the randomly selected subset of predictors:
def process(X,y,n_features,i=1):
comb = np.random.choice(range(X.shape[1]),n_features,replace=False)
qda = QDA(tol=1e-8)
y_pred = qda.predict(X[:,comb])
return (accuracy_score(y,y_pred),comb,i)
where n_features is number of features randomly selected from the full set of possible predictors, X,y explanatory and depended variables.
When n_features is 18 or less process works in single-thread mode, which means that I can use any tool to parallel processing (I use ray). When n_features is 19, and above for unknown reason it (not me) starts all available threads, and entire calculation get more time even in comparison to a single thread.
tmp = [process(X,y,n_features,i=1) for _ in range(1000)]
Based on my previous experiences with other Linux libraries (R gstat precisely) the same situation (uncontrolled multithreading mode) was caused by Linux implementation of blas, but here it could not be the case. In general, the question is: what starts this multithreading and how to control it even if LDA/QDA hasn't n_jobs parameter to avoid nested multiprocessing.
QDA in scikit-learn does not expose n_jobs meaning that you cannot set anything. However, it could be due to numpy which does not restrict the number of threads.
The solution to limit the number of threads are:
set the environment variable OMP_NUM_THREADS, MKL_NUM_THREADS, or OPENBLAS_NUM_THREADS to be sure that you will limit the number of threads;
you can use threadpoolctl which provides a context manager to set the number of threads.
