Tensorflow Video Decoding on a Separate Thread in a Distributed System

Tensorflow Video Decoding on a Separate Thread in a Distributed System - multithreading

Having a distributed system, I need to enqueue frames on a CPU device, while processing the frames, that is, training the network, has to be done on a GPU device. Could this be performed in parallel (simultaneously) in tensorflow?
Currently, tensorflow enables Audio coding through FFMPEG(contrib), are there any features for video encoding and decoding which is multi-threaded?
Again, the purpose is is to perform an enqueue operation in one thread on a CPU device and through another thread, dequeue the frames and feed them to the graph which lies on a GPU device.
Currently I need to process more than a 100 video with a minimum duration of 10 minutes each.

Related

Pre-processing on multiple CPUs in parallel and feeding output to tensorflow which trains on Multi-GPUs

I am trying to use Tensorflow for my work on a classification problem.
Before feeding the input in the form of images, I want to do some pre-processing on the images. I would like to carry out this pre-processing on multiple CPU cores in parallel and feed them to the TensorFlow graph which I want to run in a Multi-GPU setting (I have 2 TitanX GPUs).
The reasoning I want this setup is, so that while the GPU is performing the training, the CPUs keep on doing their job of pre-processing and hence after each iteration, the GPU does not remain idle. I have been looking into the TensorFlow API for this, but could not locate something which specifically addresses such a scenario.
So, multiple CPU cores should keep on pre-processing a list of files and fill in a queue from which TensorFlow extracts its batch of data. Whenever this queue is full, CPU cores should wait and again start processing when the queue (or part of it) is vacated due to feeding of examples to TensorFlow graph.
I have two questions specifically :
How may I achieve this setup ?
Is it a good idea to have this setup ?
A clear example would be a great help.

multiple streams in one GPU device

I have a multi-threaded program which supposed to run on 6 GPU devices.
I want to open on each device 6 streams to reuse during the lifetime of my program (36 in total).
I'm using cudaStreamCreate() cublasCreate() cublasSetStream() to create each stream and handle.
I also use a GPU memory monitor to see the memory usage for each handle.
However, when I look at the GPU memory usage on each device, it grow only on the first stream creation, and doesn't change in the rest of the streams I create.
As far as I know there isn't any limitation on the amount of streams I want to use.
But I can't figure out why the memory usage of the handles and the streams don't show up on the GPU memory usage.

All the streams you create are residing within a single context on a given device, so there is no context related overhead from creating additional streams after the first one. The streams themselves are lightweight and are (mostly) a host side scheduler abstraction. As you have observed, they don't in themselves consume much (if any) device memory.

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)

TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.

While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

Real-time audio on multi-core Linux system

I'm working on an audio application on a multi-core (Debian) Linux machine with an RT kernel. The audio source generation takes a lot of processing time which can't be handled by a single core, so I have three different threads:
The main portaudio thread running on core 0
Source generation 1 running on core 1
Source generation 2 running on core 2
Thread 2 and 3 are writing to a ringbuffer, while thread 1 is reading data from the ringbuffer and sums it into the portaudio buffer.
I've tried many buffer sizes and scheduling policies, my best result was FIFO policy with audio buffer size of 16 stereo samples and ringbuffer size of 576. This solution generates more than 13ms (576/44100*1000) latency, which is too much.
I'm sure that this latency can be reduced, but I'm not an expert in Linux scheduling. Any ideas?

As long as you keep RT prio of your process above any other on the core the policy doesn't matter.
Make sure you kick any other application out of the cores you use for RT (e.g. with isolcpus= kernel cmdline parameter). Otherwise the low-prio processes can trigger I/O which will block your RT threads. You should also assign all the interrupts your application doesn't care about to the unused core. Actually I would suggest using core0 for normal tasks and cores 1,2,3 for RT in your case, because since core0 is the boot CPU it will have to perform some special housekeeping tasks.
Once you partition the system as described above try latency-measurement tools to figure out what is causing delays. Googling linux rt latency trace will give you a lot of useful links. This is the basic one: http://people.redhat.com/williams/latency-howto/rt-latency-howto.txt
If it turns out some kernel processing is blocking your app you may find a solution by looking at the description of kernel threads here: http://lxr.free-electrons.com/source/Documentation/kernel-per-CPU-kthreads.txt
You should definitely be able to go below 2ms.

Raspberry Pi Linux kernel module. Realtime for very shord time period

I want to connect TCD1304AP CCD array to RPi board. The problem is that when normally clocked (4MHz) TCD1304AP yields ~500 000 samples per second (3648 samples total). So total data reading from CCD array is going to be ~7.3 milliseconds. I want the RPi processor to handle the whole interaction programmatically with GPIO inputs/outputs.
Is it good idea to write a linux kernel module which will disable all interrupts for 7.3 ms and perform all I/O actions according to time diagram and then enable interrupts and return control to OS? What is the common approach to problems like this?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string