How to use multiple cores in Caffe2 on mobile? - pytorch

I’ve got a simple model consisting only of convolutions (even no activation between) and I wanted to benchmark it in Caffe2 on ARM Android device using multiple cores.
When I run
./speed_benchmark --init_net=model_for_inference-simplified-init-net.pb --net=model_for_inference-simplified-predict-net.pb --iter=1
it runs on single core.
Speed benchmark was built using:
scripts/build_android.sh -DANDROID_ABI=arm64-v8a -DANDROID_TOOLCHAIN=clang -DBUILD_BINARY=ON
On X86 it has been build via
mkdir build
cd build
cmake .. -DBUILD_BINARY=ON
and setting OMP_NUM_THREADS=8 helps but not on ARM
Do I need to change the building command for arm, set some environmental variables, some binary arguments or something else?

I didn't know that I need to set the engine information in the model like described in https://caffe2.ai/docs/mobile-integration.html
After updating prediction net by:
for op in predict_net.op:
if op.type == 'Conv':
op.engine = 'NNPACK'
more cores started to be used

Related

How to run multi-thread programs in rocket-chip?

I am trying to run multi-thread programs/benchmark in a rocket-chip SoC I generated from chipyard.
I generated the TutorialConfig SoC given in https://fires.im/isca22-slides-pdf/03_building_custom_socs.pdf, which consists of a Rocket core and a BOOM core.
To check whether I can run a multi-threaded program in this configuration, I compiled the mt-matmul benchmark in riscv-tests after changing the number of cores in crt.S file.
I ran it using the following command:
make CONFIG=TutorialStarterConfig run-binary BINARY=riscv-tests/benchmarks/mt-matmul.riscv
In the output trace, I can only see 'C0' at the beginning of each line, I am assuming I should see 'C1' if the program was executed in a second core.
Is this the correct way to run multi-threaded programs with RocketChip SoCs?
Do I need to change anything else in the programs or in the SoC?

Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?
Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3

Executable gives different FPS values ​in Yocto and Raspbian(everything looks the same in terms of configuration)

In Yocto project, built my project which is running on Raspbian OS. When i run executable, i get half FPS compared to executable running on Raspbian OS.
The libraries i use:
OpenCV
Tensorflow-Lite, Flatbuffer, Libedgetpu
I use Libedgetpu1-std, Tensorflow-lite 2.4.0 on Raspbian and Libedgetpu 2.5.0, Tensorflow-lite 2.5.0 on Yocto.
Thinking that the problem is that the versions or configurations of the libraries are not the same, i followed these steps:
I ran the executable which i built in Raspbian directly in the runtime of the Yocto project.(I have set the required library versions to the same library versions available in raspbian for it to work in runtime.)
But i still got low FPS. Here is how i calculate that i get half the FPS:
I am using TFLite's interpreter invoke function. I set a timer when entering and exiting the function, i calculate FPS over it. I can exemplify like this:
Timer_Begin();
m_tf_interpreter->Invoke();
Timer_End();
Somehow i think the Interpreter Invoke function is running slower on the Yocto side. I checked Kernel versions, CPU speeds, /boot/config.txt contents, USB power consumes of Raspbian and Yocto. However, I couldn't catch anything from anywhere.
Note : Using RPI4 and Coral-TPU(Plugged into USB 2.0).
We spoke with #Paulo Neves. He recommend Perf profiling and i did . In the perf profiling, i noticed that the CPU is running slowly. Although the frequencies are the same.
When i check the "scaling_governor", i saw that it was in "powersave" mode. The problem solved when i switched from "powersave" to "performance" mode from virtual kernel.
In addition, if you want to make the governor change permanent, you need to create a kernel config fragment.

How to run tensorflow in single process mode?

I am trying to run tensorflow 1.13.1 with Python 2.7 on SLF 6 without GPU support. When I start my model, tensorflow appears to be spawning multiple subprocesses and running my model in parallel, trying to load every core in the system. While in most cases this is what one would probably want, this is not my case. I would like to run my model on single core only.
I have tried setting these variables:
export OMP_NUM_THREADS=1
export KMP_BLOCKTIME=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
in different combinations, but was not able to achieve single-core running.
Is there a way to run Tensorflow in "dumb" single-process mode ?
There are two configurable options regarding parallelism inter_op_parallelism_threads and intra_op_parallelism_threads in the tf.ConfigProto protocol buffer. To use a single process, I think you can try:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1,
allow_soft_placement=True)
There are other possible forms of parallelism, see mrry# 's answer is this thread.

Writing CUDA program for more than one GPU

I have more than one GPU and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically? Utilizing resources of all available GPUs for the program.
A utility that may periodically report the available resource and my program will launch as many threads to GPUs.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
I have more than GPUs and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically.
For arbitrary kernels that you write, there is no API that I am aware of (certainly no CUDA API) that "automatically" makes use of multiple GPUs. Today's multi-GPU aware programs often use a strategy like this:
detect how many GPUs are available
partition the data set into chunks based on the number of GPUs available
successively transfer the chunks to each GPU, and launch the computation kernel on each GPU, switching GPUs using cudaSetDevice().
A program that follows the above approach, approximately, is the cuda simpleMultiGPU sample code. Once you have worked out the methodology for 2 GPUs, it's not much additional effort to go to 4 or 8 GPUs. This of course assumes your work is already separable and the data/algorithm partitioning work is "done".
I think this is an area of active research in many places, so if you do a google search you may turn up papers like this one or this one. Whether these are of interest to you will probably depend on your exact needs.
There are some new developments with CUDA libraries available with CUDA 6 that can perform certain specific operations (e.g. BLAS, FFT) "automatically" using multiple GPUs. To investigate this further, review the relevant CUBLAS XT documentation and CUFFT XT multi-GPU documentation and sample code. As far as I know, at the current time, these operations are limited to 2 GPUs for automatic work distribution. And these allow for automatic distribution of specific workloads (BLAS, FFT) not arbitrary kernels.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
With the exception of the OGL/DX interop APIs CUDA is mostly orthogonal to choice of windows or linux as a platform. The typical IDE's are different (windows: nsight Visual Studio edition, Linux: nsight eclipse edition) but your code changes will mostly consist of ordinary porting differences between windows and linux. If you want to get started with linux, follow the getting started document.

Resources