I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however the dist.init_process_group waits and times out
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00)
What could go wrong? machines not networked together?

This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and see if you can figure out the difference.

The torch.distributed.launch is the helper utility within the torch.distributed package which can be used to launch multiple processes per node for distributed training. It tells all workers which IP address is of rank 0 which is set by MASTER_ADDR,
Each rank needs to be able to communicate to the MASTER_ADDR on the port MASTER_PORT. If those are set but the workers cannot reach the MASTER_ADDR, then it can be the root cause of hang and timeoutfor the job.
Besides, it will also wait until all nodes report in from --nodes defined in the launch as well.


Tune Hyperparameter in sklearn with ray

I wonder but could not found any information why this appears all the time if I try to tune hyperparameter from sklearn with TuneSearchCV:
Note that the important part is the Log sync warning and as a result that the logging in combination with tensorflow and search_optimization such as optuna does not work:
Backend is sklearn
Concatenating h5 datasets of the following files:
('output/example_train_1.h5', 'output/example_train_2.h5')
based on the following keys:
('x', 'y')
Concatenation successful, resulting shapes for the given dsets:
Key: x, shape: (20000, 25)
Key: y, shape: (20000,)
Log sync requires rsync to be installed.
Process finished with exit code 0
The tuning processes seem to be working, as long as I do not use search-optimization such as optional.
I use it within a docker container. I got through the ray-documentation, but I could find the source where I think the error drops. However, I could not find any settings or additional options on how to prevent it.
Furthermore, it seems that rsync is just necessary if I use a cluster. But actually, I don't do that right now.
The warning (Log sync requires rsync to be installed.) does not stop the script from executing. If rsync is not installed, it will just not synchronize logs between nodes, which seems to be unnecessary in your case anyway. You shouldn't run into any problem there.
It's hard to say what the problem here is, as we're missing crucial information: Which version of Ray are you running, which version of tune-sklearn, and how does your training script look like?
If you're running into problems and you suspect it is a bug, please consider opening an issue in the tune-sklearn repository, and make sure to include the above information and preferably a minimal reproducible script so the maintainers can look into this.

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
def classify(imgp=""):
#do some classification with the net
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
Any help here will be much appreciated. Thank you.

Tensorflow supports multiple threads/streams on one GPU for training?

I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 / max_streams /) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
I have similar experience with you.
I have two GPU, each GPU run three threads, each thread running a session, each session running time fluct a lot.
if run only one thread on each GPU, session running time is quite stable.
from these appearence, we can conclude that ,thread in tensorflow not cowork well,
the mechanism of tensorflow has problem.

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

how to deal with large linnet object

I am trying to use a whole city network for a particular analysis which I know is very huge. I have also set it as sparse network.
SS_StreetsUTM =as.psp(StreetsUTM)
SS_linnetUTM = as.linnet(SS_StreetsUTM, sparse=TRUE)
> SS_linnetUTM
Linear network with 321631 vertices and 341610 lines
Enclosing window: rectangle = [422130.9, 456359.7] x [4610458,
4652536] units
> SS_linnetUTM$sparse
[1] TRUE
I have the following problems:
It took 15-20 minutes to build psp object
It took almost 5 hours to build the linnet object
every time I want to analyse it for a point pattern or envelope, R crashes
I understand I should try to reduce the network size, but:
I was wondering if there is a smart way to overcome this problem. Would rescaling help?
How can I put it on more processing power?
I am also curios to know if spatstat can be used with parallel package
In the end, what are the limitations on network size for spatstat.
R crashes
R crashes when I use the instructions from Spatstat book:
KN <- linearK(spiders, correction="none") ; on my network (linnet) of course
envelope(spiders, linearK, correction="none", nsim=39); on my network
I do not think RAM is the problem, I have 16GB RAM and 2.5GhZ Dual core i5 processor on an SSD machine.
Could someone guide me please.
Please be more specific about the commands you used.
Did you build the linnet object from a psp object using as.linnet.psp (in which case the connectivity of the network must be guessed, and this can take a long time), or did you have information about the connectivity of the network that you passed to the linnet() command?
Exactly what commands to "analyse it for a point pattern or envelope" cause a crash, and what kind of crash?
The code for linear networks in spatstat is research code which is still under development. Faster algorithms for the K-function will be released soon.
I could only resolve this with simplifying my network in QGIS with Douglas-Peucker algorithm in Simplify Geometries tool. So it is a slight compromise on the geometry of the linear network in the shapefile.
