OpenGL PBO mapped buffer: multi-threaded unpack slow, memcpy fast - multithreading

We are working on a workstation Core i7 and AMD FirePro 8000. For video decoding (8K, 7680x4320 video frame ~ 66MB hapq codec ) we tried to use the following obvious loop:
get frame from stream
map buffer
decode frame slices multi-threaded into mapped buffer
unmap buffer
texsubimage into texture from bound PBO
BUT the step
3. decode slices multi-threaded into mapped buffer
is horribly slow - it takes at least some 40ms to finish
When we split this into tow steps
3a. decode frame slices multi-threaded into malloced memory
3b. memcpy from malloced memory into mapped buffer
both steps take 8+9 ~ 17ms to finish. Now we have a somewhat acceptable solution, but the extra copy step is painful.
Why is multithreaded unpacking into mapped memory so exceptionally slow? How can we avoid the extra copy step?
Edit 1;
This is how the buffer is generated, defined and mapped:
glGenBuffers(1, &hdf.m_pbo_id);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, hdf.m_pbo_id);
glBufferData(GL_PIXEL_UNPACK_BUFFER, m_compsize, nullptr, GL_STREAM_DRAW);
hdf.mapped_buffer = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
Edit 2:
There was a question raised how the time is measured. Only the non-gl code is measured. The pseudo code is like this,
Case 1 (very slow, t2-t1 ~ 40ms):
gl_map();
t1 = elapse_time();
unpack_multithreaded_multiple_snappy_slices_into_mapped_buffer();
t2 = elapse_time();
gl_unmap();
Case 2 (medium slow, t3-t2~9ms, t2-t1~8ms):
gl_map();
malloc_sys_buffer();
t1 = elapse_time();
unpack_multithreaded_multiple_snappy_slices_into_sys_buffer();
t2 = elapse_time();
memcpy_sys_buffer_into_mapped_buffer();
t3 = elapse_time();
gl_unmap();
Inside the measured code blocks there is no OpenGL code involved. Maybe it is an write-through / cpu-cache issue.

Unpacking into mapped memory is slow because this memory is write-combined. For each write into this type of memory full cache line is transferred to GPU over the bus. The best way to interact with this memory is to write data in as big chunks as possible. To avoid extra copy step you may need to modify your decoder to write large contiguous chunks of memory. It's also good to experiment with the number of threads writing.
There is a great overview of this here https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/

Related

Rust Multi-threading Memory Allocation on the RP Pico/RP2040

I'm working with the Raspberry PI Pico to perform the basic task of reading data from a UART signal, modifying it, and writing it back out to a different UART address. However, I need to simultaneously be constantly monitoring an on-board sensor and sending the values it generates as well.
I found a good example at cortexm-threads but it performs some stack allocation like this:
let mut stack1 = [0xDEADBEEF; 512];
let mut stack2 = [0xDEADBEEF; 512];
How do I know (or find out) what memory addresses I can allocate the stacks to on the RP2040/Pico?
In the example, the 0xDEADBEEF will denote the initial per-cell value of stack1 and stack2 arrays, and can be set to anything. Since those arrays are function-local non-constants/non-statics, they will end up in the (main thread) stack.
Just make sure that the arrays are large enough for your use case, otherwise risking a stack overflow (How does a "stack overflow" occur and how do you prevent it?).
Regarding on where those variables end up being located in the device memory space: cortex-m will set the initial SP value to the largest possible memory address (0x20040000 on Pico - RP2040 SRAM is located between 0x20000000 and 0x20040000, sized 256 kB). See https://github.com/rust-embedded/cortex-m/blob/657af97d66b7157d6a6e5704d86dd59b398e7108/cortex-m-rt/link.x.in#L63. Thereby, the location of those variables will be close to the end of SRAM. See also https://docs.rust-embedded.org/embedonomicon/memory-layout.html
Regarding the multicore use case, see also https://github.com/rp-rs/rp-hal/blob/427344667e9f24f03d132fa08e2dfaa709bc805d/rp2040-hal/src/multicore.rs.
You could also achieve the similar functionality (but using only one core) with interrupt-driven approach, where you store each incoming UART-byte into a circular buffer, handle on-board sensor read on a (either DMA/timer) interrupt, and process the circular buffer (and possibly read sensor value) contents in the idle loop. For more information, see https://en.wikipedia.org/wiki/Circular_buffer and https://rtic.rs/,

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Slow Buffer.concat

When I read a 16MB file in pieces of 64Kb, and do Buffer.concat on each piece, the latter proves to be incredibly slow, takes a whole 4s to go through the lot.
Is there a better way to concatenate a buffer in Node.js?
Node.js version used: 7.10.0, under Windows 10 (both are 64-bit).
This question is asked while researching the following issue: https://github.com/brianc/node-postgres/issues/1286, which affects a large audience.
The PostgreSQL driver reads large bytea columns in chunks of 64Kb, and then concatenates them. We found out that calling Buffer.concat is the culprit behind a huge loss of performance in such examples.
Rather than concatenating every time (which creates a new buffer each time), just keep an array of all of your buffers and concat at the end.
Buffer.concat() can take a whole list of buffers. Then it's done in one operation. https://nodejs.org/api/buffer.html#buffer_class_method_buffer_concat_list_totallength
If you read from a file and know the size of that file, then you can pre-allocate the final buffer. Then each time you get a chunk of data, you can simply write it to that large 16Mb buffer.
// use the "unsafe" version to avoid clearing 16Mb for nothing
let buf = Buffer.allocUnsafe(file_size)
let pos = 0
file.on('data', (chunk) => {
buf.fill(chunk, pos, pos + chunk.length)
pos += chunk.length
})
if(pos != file_size) throw new Error('Ooops! something went wrong.')
The main difference with #Brad's code sample is that you're going to use 16Mb + size of one chunk (roughly) instead of 32Mb + size of one chunk.
Also, each chunk has a header, various pointers, etc. so you are not unlikely to use 33Mb or even 34Mb... that's a lot more RAM. The amount of RAM copied is otherwise the same. That being said, it could be that Node starts reading the next chunk while you copy so it could make it transparent. When done in one large chunk in the 'end' event, you're going to have to wait for the contact() to complete while doing nothing else in parallel.
In case you are receiving an HTTP POST and are reading it. Remember that you get a Content-Length parameter so you also have the length in that case and can pre-allocate the entire buffer before reading the data.

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)
There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

Purpose of ibs/obs/bs in dd

I have a script that creates file system in a file on a linux machine. I see that to create the file system, it uses 'dd' with bs=x option, reads from /dev/zero and writes to a file. I think usually specifying ibs/obs/bs is useful to read from real hardware devices as one has specific block size constraints. In this case however, as it is reading from virtual device and writing to a file, I don't see any point behind using 'bs=x bytes' option. Is my understanding wrong here?
(Just in case if it helps, this file system is later on used to boot a qemu vm)
To understand block sizes, you have to be familiar with tape drives. If you're not interested in tape drives - for example, you don't think you're ever going to use one - then you can go back to sleep now.
Remember the tape drives from films in the 60s, 70s, maybe even 80s? The ones where the reel went spinning around, and so on? Not your Exabyte or even QIC - quarter-inch cartridge - tapes; your good old fashioned reel-to-reel half-inch tape drives? On those, block size mattered.
The data on a tape was written in blocks. Each block was separated from the next by an inter-record gap.
----+-------+-----+-------+-----+----
... | block | IRG | block | IRG | ...
----+-------+-----+-------+-----+----
Depending on the tape drive hardware and software, there were a variety of problems that could happen. For example, if the tape was written with a block size of 5120 bytes and you read the tape with a block size of 512 bytes, then the tape drive might read the first block, return you 512 bytes of it, and then discard the remaining data; the next read would start on the next block. Conversely, if the tape was written with a block size of 512 bytes and you requested blocks of 5120 bytes, you would get short reads; each read would return just 512 bytes, and if your software wasn't paying attention, you'd be reading garbage. There was also the issue that the tape drive had to get up to speed to read the block, and then slow down. The ASCII art suggests that the IRG was smaller than the data blocks; that was not necessarily the case. And it took time to read one block, overshoot the IRG, rewind backwards to get to the next block, and start forwards again. And if the tape drive didn't have the memory to buffer data - the cheaper ones did not - then you could seriously affect your tape drive performance.
War story: work prepared on newer machine with a slightly more modern tape drive. I wrote a tape using tar without a sensible block size (so it defaulted to 512 bytes). It was a large bit of software - must have been, oh, less than 100 MB in total (a long time ago, in other words). The tape wrote nicely because the machine was modern enough, and it took just a few seconds to do so. But, I had to get the material off the tape on a machine with an older tape drive, one that did not have any on-board buffer. So, it read the material, 512 bytes at a time, and the reel rocked forward, reading one block, and then rocked back all but maybe half an inch, and then read forwards to get to the next block, and then rocked back, and ... well, you could see it doing this, and since it took appreciable portions of a second to read each 512 byte block, the total time taken was horrendous. My flight was due to leave...and I needed to get that data across too. (It was long enough ago, and in a land far enough away, that last minute changes to flights weren't much of an option either.) To cut a long story short, it did get read - but if I'd used a sensible block size (such as 5120 bytes instead of the default of 512), I would have been done much, much quicker and with much less danger of missing the plane (but I did actually catch the plane, with maybe 20 minutes to spare, despite a taxi ride across Paris in the rush hour).
With more modern tape drives, there was enough memory on the drive to do buffering and getting a tape drive to stream - write continuously without reversing - was feasible. It used to be that I'd use a block size like 256 KB to get QIC tapes to stream. I've not done much with tape drives recently - let's see, not this millennium and not much for a few years before that, either; certainly not much since CD and DVD became the software distribution mechanisms of choice (when electronic download wasn't used).
But the block size really did matter in the old days. And dd provided good support for it. You could even transfer data from a tape drive that was written with, say, 4 KB block to another that you wanted to write with, say, 16 KB blocks, by specifying the ibs (input block size) separately from the obs (output block size). Darned useful!
Also, the count parameter is in terms of the (input) block size. It was useful to say 'dd bs=1024 count=1024 if=/dev/zero of=/my/file/of/zeroes' to copy 1 MB of zeroes around. Or to copy 1 MB of a file.
The importance of dd is vastly diminished; it was an essential part of the armoury for anybody who worked with tape drives a decade or more ago.
The block size is the number of bytes that are read and written at a time. Presumably there is a count= option, and that is specified in units of the block size. If there is a skip= or seek= option, those will also be in block size units. However if you are reading and writing a regular file, and there are no disk errors, then the block size doesn't really matter as long as you can scale those parameters accordingly and they are still integers. However certain sizes may be more efficient than others.
For reading from /dev/zero, it doesn't matter. ibs/obs/bs specify how many bytes will be read at a time. It's helpful to choose a number based on the way bytes are read/written in the operating system. For instance, Linux usually reads from a hard drive in 4096 byte chunks. If you have at least some idea about how the underlying hardware reads/writes, it might be a good idea to specify ibs/obs/bs. By the way, if you specify bs, it will override whatever you specify for ibs and obs.
In addition to the great answer by Jonathan Leffler, keep in mind that the bs= option isn't always a substitute for using both ibs= and obs=, in particular for the old [ugly] days of tape drives.
The bs= option reserves the right for dd to write the data as soon as it's read. This can cause you to no longer have identically sized blocks on the output. Here is GNU's take on this, but the behavior dates back as far as I can remember (80's):
(bs=) Set both input and output block sizes to bytes. This makes dd read and write bytes per block, overriding any ‘ibs’ and ‘obs’ settings. In addition, if no data-transforming conv option is specified, input is copied to the output as soon as it’s read, even if it is smaller than the block size.
For instance, back in the QIC days on an old Sun system, if you did this:
tar cvf /dev/rst0c /bla
It would work, but cause an enormous amount of back and forth thrashing while the drive wrote a small block, and tried to backup and read to reposition itself properly for the next write.
If you swapped this with:
tar cvf - /bla | dd ibs=16K obs=16K of=/dev/rst0c
You'd get the QIC drive writing much larger chunks and not thrashing quite so much.
However, if you made the mistake of this:
tar cvf - /bla | dd bs=16K of=/dev/rst0c
You'd run the risk of having precisely the same thrashing you had before depending upon how much data was available at the time of each read.
Specifying both ibs= and obs= precludes this from happening.

Resources