Given a detectron2 instance segmentation context.
Does anyone out there knows an easy way (ideally by setting some parameters
in https://detectron2.readthedocs.io/en/latest/modules/config.html#yaml-config-references) to reject results where for a given instance multiple bitmasks patches have been detected.
Exemple
Related
My network includes 'torch.nn.MaxPool3d' which throw a RuntimeError when cudnn deterministic flag is on according to the PyTorch docs (version 1.7 - https://pytorch.org/docs/stable/generated/torch.set_deterministic.html#torch.set_deterministic), however, when I inserted the code 'torch.backends.cudnn.deterministic=True' at the beginning of my code, there was no RuntimeError. Why doesn't that code throw a RuntimeError?
I wonder whether that code guarantees the deterministic computation of my training process.
torch.backends.cudnn.deterministic=True only applies to CUDA convolution operations, and nothing else. Therefore, no, it will not guarantee that your training process is deterministic, since you're also using torch.nn.MaxPool3d, whose backward function is nondeterministic for CUDA.
torch.set_deterministic(), on the other hand, affects all the normally-nondeterministic operations listed here (note that set_deterministic has been renamed to use_deterministic_algorithms in 1.8): https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html?highlight=use_deterministic#torch.use_deterministic_algorithms
As the documentation states, some of the listed operations don't have a deterministic implementation. So if torch.use_deterministic_algorithms(True) is set, they will throw an error.
If you need to use nondeterministic operations like torch.nn.MaxPool3d, then, at the moment, there is no way for your training process to be deterministic--unless you write a custom deterministic implementation yourself. Or you could open a GitHub issue requesting a deterministic implementation: https://github.com/pytorch/pytorch/issues
In addition, you might want to check out this page: https://pytorch.org/docs/stable/notes/randomness.html
I wanted to see how the conv1d module is implemented
https://pytorch.org/docs/stable/_modules/torch/nn/modules/conv.html#Conv1d. So I looked at functional.py but still couldn’t find the looping and cross-correlation computation.
Then I searched Github by keyword ‘conv1d’, checked conv.cpp https://github.com/pytorch/pytorch/blob/eb5d28ecefb9d78d4fff5fac099e70e5eb3fbe2e/torch/csrc/api/src/nn/modules/conv.cpp 1 but still couldn’t locate where the computation is happening.
My question is two-fold.
Where is the source code that "conv1d” is implemented?
In general, if I want to check how the modules are implemented, where is the best place to find? Any pointer to the documentation will be appreciated. Thank you.
It depends on the backend (GPU, CPU, distributed etc) but in the most interesting case of GPU it's pulled from cuDNN which is released in binary format and thus you can't inspect its source code. It's a similar story for CPU MKLDNN. I am not aware of any place where PyTorch would "handroll" it's own convolution kernels, but I may be wrong. EDIT: indeed, I was wrong as pointed out in an answer below.
It's difficult without knowing how PyTorch is structured. A lot of code is actually being autogenerated based on various markup files, as explained here. Figuring this out requires a lot of jumping around. For instance, the conv.cpp file you're linking uses torch::conv1d, which is defined here and uses at::convolution which in turn uses at::_convolution, which dispatches to multiple variants, for instance at::cudnn_convolution. at::cudnn_convolution is, I believe, created here via a markup file and just plugs in directly to cuDNN implementation (though I cannot pinpoint the exact point in code when that happens).
Below is an answer that I got from pytorch discussion board:
I believe the “handroll”-ed convolution is defined here: https://github.com/pytorch/pytorch/blob/master/aten/src/THNN/generic/SpatialConvolutionMM.c 3
The NN module implementations are here: https://github.com/pytorch/pytorch/tree/master/aten/src
The GPU version is in THCUNN and the CPU version in THNN
I am new to fastText, a library for efficient learning of word representations and sentence classification. I am trying to generate word-vector for huge data set. But in single process it's taking significantly long time.
So let me put my questions clearly:
Are there any options which I can use to speedup the single fastText process?
Is there any way to generate word-vector in parallel fastText processes?
Are there any other implementation or workaround available which can solve the problem, as I read caffe2 implementation is available, but I am unable to find it.
Thanks
I understand your questions that you like to distribute fastText and do parallel training.
As mentioned in Issue #144
... a future feature we might consider implementing. For now it's not on our list of priorities, but it might very well soon.
Except for the there also mentioned Word2Vec Spark implementation, I am not aware of any other implementations.
The original FastText release by Facebook includes a command-line option thread, default 12, which controls the number of worker threads which will do parallel training (on a single machine). If you have more CPU cores, and haven't yet tried increasing it, try that.
The gensim implementation (as gensim.models.fasttext.FastText) includes an initialization parameter, workers, which controls the number of worker threads. If you haven't yet tried increasing it, up to the number of cores, it may help. However, due to extra multithreading bottlenecks in its Python implementation, if you have a lot of cores (especially 16+), you might find maximum throughput with fewer workers than cores – often something in the 4-12 range. (You have to experiment & watch the achieved rates via logging to find the optimal value, and all cores won't be maxed.)
You'll only get significant multithreading in gensim if your installation is able to make use of its Cython-optimized routines. If you watch the logging when you install gensim via pip or similar, there should be a clear error if this fails. Or, if you are watching logs/output when loading/using gensim classes, there will usually be a warning if the slower non-optimized versions are being used.
Finally, often in the ways people use gensim, the bottleneck can be in their corpus iterator or IO rather than the parallelism. To minimize this slowdown:
Check to see how fast your corpus can iterate over all examples separate from passing it to the gensim class.
Avoid doing any database-selects or complicated/regex preprocessing/tokenization in the iterator – do it once, and save the easy-to-read-as-tokens resulting corpus somewhere.
If the corpus is coming from a network volume, test if streaming it from a local volume helps. If coming from a spinning HD, try an SSD.
If the corpus can be made to fit in RAM, perhaps on a special-purpose giant-RAM machine, try that.
We are having troubles when we are running certain GPU functions (e.g. a point tracker where we use texture references on multiple places) parallel in multiple CPU-threads (all CPU threads use the same GPU).
It seems that the 'texture references', which afaik are sort of 'global device variables' are the problem, as these are the only 'global' variables we have (note 'constant memory' might be probably also a issue but we will focus for now on the texture references problem). We use mainly texture references to 2D images (pitched-linear memory) as we are in image processing field.
How can we rewrite the kernels, which use the texture references, so that they are CPU-thread-safe ? Is it possible at all ? Note that in our framework we plan to have exactly 4 CPU threads for each GPU (each CPU-thread is a GPU-Worker-Thread which gets some 'GPU job' issued which he then executed).
This question seems to be related to the problem of 'arrays of texture references', I don't know if an array of texture reference is possible now with newer Cuda Toolkit / newer GPU architectures.
See forum postings at
https://devtalk.nvidia.com/default/topic/376700/?comment=2688621
https://devtalk.nvidia.com/default/topic/368336/cuda-programming-and-performance/-help-compilation-problem-with-cuda-texture-cuda-texture-usage/
Or just search the nvidia cuda forum for 'texture references array' and notice that seems to be really an hot topic :-)
In one of this postings a function 'cuTexRefCreate' was mentioned, is that the way to go ? I suppose it can be used also in the cuda runtime api. But it seems to be deprecated, so that may not be a safe way.
Any help on this question would be appreciated. Note any possible strategies should work also on Fermi architecture GPUs.
An related question is also whether this multi-threading issue is also a problem for latest Kepler architecture where pointers which are of type 'const __restrict' may get mapped automatically to a texture object.
Textures need to be global objects, hence thread-safety is a concern the same way as it is with any global variable shared by multiple threads. The possible solutions are also similar, you can use the typical concurrency constructs to ensure thread-safety as you would use otherwise,
e.g. a mutex to ensure that only one thread binds the texture and the others use the ref created by this.
What I'm not entirely sure of is whether the texture reference manipulation functions themselves are thread-safe, i.e. whether you need to make sure that texture manipulations do not happen concurrently on the same reference.
However, what you should really consider is texture objects which although are supported only on CC >=3.0, they are not required to be globally declared.
EDIT:
NVIDIA engineers confirmed that thread reference manipulation functions are thread-safe.
I'm looking specifically for a set of operations to be performed within a given length of time, so I can compare my library's performance to that of other libraries without having to build them all.