How to move multiple tensors to the Cuda device concurrently? - pytorch

policy_data, value_data, action_mask = policy_data.cuda(non_blocking=True), value_data.cuda(non_blocking=True), action_mask.cuda(non_blocking=True)
rewards, regret_probs = rewards.cuda(non_blocking=True), regret_probs.cuda(non_blocking=True)
return action_probs.cpu(), sample_probs.cpu(), sample_indices.cpu(), update
I am doing some RL work and am wondering whether it would be possible to speed up fragments like the above by launching the data transfer to the GPU on different streams before waiting on them together. Does PyTorch have any functions that would make this easier? I'd rather ask here before I dive into the minutiae of optimizing data transfers.

Seems like one potential solution would be to pack all of the data into a single tensor (though of course you'd likely pay a small cost due to unused elements within this compacted representation.) An alternative would be to store this compact tensor as a sparse tensor (no additional data, but slightly more memory consumption per value). You'd have to test between these two to determine which was more efficient for your use case.

Related

Using GPU with Pytorch

I am confused on the usage of .to(device) in pytorch. I know that it loads the variables to the GPU. But after that, lets say we multiply two gpu tensors together, will our computer know to use the gpu to do that? Or will it cast it back to CPU and preform the computation? I guess I am just confused on when and how our computer will know to use gpu outside of us telling it to hold some variable in its memory.
All operations on tensors that are e.g. in the GPU memory will be performed on the GPU. Furthermore if multiple tensors are involved in some operation, all of them need to be set to the same .device.
The way this is done is that no matter what operation we do to tensors, we'll always call methods of the torch.Tensor class. So adding two tensors a + b does actually call torch.Tensor.add, so whenever some operation is executed, it is always done with the knowledge of what device is used.

Which data structure is more efficient for A*?

In A* search, which data structure would be more efficient ? Min-heap or Binary search tree.
Considering that below operations are to be handled frequently:
(a) extract min
(b) search a node
(c) update the node
(d) insert a node
Note: Search operation would be very frequent as we need to check the presence of each probable child node in the open-list of A*.
A min-heap is a much simpler data structure than a balanced binary search tree, and it's typically implemented in an array, which reduces memory used and improves cache locality.
For these reasons, a min-heap implementation will be much faster if you do it right.
It's often tricky to implement the decrease-key operation in an array-based min-heap, though. The usual solution is not to implement decrease-key at all, but just to insert another record in the min-heap whenever the distance to a node is decreased.
This will not increase the time complexity of the algorithm, the min-heap will take O(|E|) space. If your graph is very dense, then a node's weight may be decreased many times, and this memory consumption might be too much. If that's so, then you should just clean up the min-heap -- remove invalid entries and re-heapify -- whenever more than half of the entries in the heap are invalid. This will keep memory consumption down to O(|V|) without significantly affecting run time.

Is it possible to make the nodes and tress of MCTS work on GPU-only with PyTorch?

I've viewed some discussion about MCTS and GPU. It's said there's no advantage using GPU, as it doesn't have many matrix-multiply. But it does have a drawback using CPU, as the data transfering between devices really takes time.
Here I mean the nodes and tree should be on GPU. Then they may process the data on GPU, without copying the the data from CPU. If I just create class node and tree, they will let their methods work on CPU.
So I wonder whether I can move the searching part to GPU. Is there any example?

AWS, Cuda, Tensorflow

When I'm running my Python code on the most powerfull AWS GPU instances (with 1 or 8 x Tesla v100 16mb aka. P3.x2large or P3.16xlarge) they are both only 2-3 times faster than my DELL XPS Geforce 1050-Ti laptop?
I'm using Windows, Keras, Cuda 9, Tensorflow 1.12 and the newest Nvidia drivers.
When I check the GPU load via GZU the GPU max. run at 43% load for a very short period - each time. The controller runs at max. 100%...
The dataset I use is matrices in JSON format and the files are located on a Nitro drive at 10TB with MAX 64.000 IOPS. No matter if the folder contains 10TB, 1TB or 100mb...the training is still very very slow per iteration?
All advises are more than welcome!
UPDATE 1:
From the Tensorflow docs:
"To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct a tf.data.TFRecordDataset."
Before I had matrices stored in JSON format (Made by Node). My TF runs in Python.
I will now only save the coordinates in Node and save it in JSON format.
The question is now: In Python what is the best solution to load data? Can TF use the coordinates only or do I have to make the coordinates back to matrices again or what?
The performance of any machine learning model depends on many things. Including but not limited to: How much pre-processing you do, how much data you copy from CPU to GPU, Op bottlenecks, and many more. Check out the tensorflow performance guide as a first step. There are also a few videos from the tensorflow dev summit 2018 that talk about performance. How to properly use tf.data, and how to debug performance are two that I recommend.
The only thing I can say for sure is that JSON is a bad format for this purpose. You should switch to tfrecord format, which uses protobuf (better than JSON).
Unfortunately performance and optimisation of any system takes a lot of effort and time, and can be a rabbit hole that just keeps going down.
First off, you should be having a really good reason to go for an increased computational overhead with Windows-based AMI.
If your CPU is at ~100%, while GPU is <100%, then your CPU is likely the bottleneck. If you are on cloud, consider moving to instances with larger CPU-count (CPU is cheap, GPU is scarce). If you can't increase CPU count, moving some parts of your graph to GPU is an option. However, tf.data-based input pipeline is run entirely on CPU (but highly scalable due to C++ implementation). Prefetching to GPUs might also help here, but the cost of spawning another background thread to populate the buffer for downstream might damp this effect. Another option is to do some or all pre-processing steps offline (i.e. prior to training).
A word of caution on using Keras as the input pipeline. Keras relies on Python´s multithreading (and optionally multiprocessing) libraries, which may both lack performance (when doing heavy I/O or augmentations on-the-fly) and scalability (when running on multiple CPUs) compared to GIL-free implementations. Consider performing preprocessing offline, pre-loading input data, or using alternative input pipelines (as the aforementioned TF native tf.data, or 3rd party ones, like Tensorpack).

How to optimize an algorithm for a given multi-core architecture

I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.
What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind
The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.
It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.

Resources