combining Intel IPP and TBB

combining Intel IPP and TBB - multithreading

I am trying to gain further improvement in my Image Resizing algorithm by combining IPP and TBB. The two ways that I can accomplish this task are:
Use IPP without TBB
Use IPP with TBB inside a parallel_for loop
My question is that I have coded the application, and I get correct result. But surprisingly, my computational time is larger when they are combined. To avoid clutter, I only paste part of my code in here. But I can provide the whole code if needed. For the first case when I use only IPP, the code is like: (The base of the algorithm was borrowed from the Intel TBB sample code for Image resizing)
ippiResizeSqrPixel_8u_C1R(src, srcSize, srcStep, srcRoi, dst, dstStep, dstRoi,
m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBufferWhole);
and the parallel_for loop looks like this:
parallel_for(
blocked_range<size_t>(0,CHUNK),
[=](const blocked_range<size_t> &r){
for (size_t i= r.begin(); i!= r.end(); i++){
ippiResizeSqrPixel_8u_C1R(src+((int)(i*srcWidth*srcHeight)), srcSize,
srcStep, srcRoi, dst+((int)(i*dstWidth*dstHeight)), dstStep, dstRoi,
m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBuffer);
}
}
);
src and dst are pointers to the source image and the destination image. When TBB is used, the image is partitioned into CHUNKS parts and the parallel_for loops through all the CHUNKS and uses an IPP function to resize each CHUNK independently. The value for dstHeight, srcHeight, srcRoi, and dstRoi are modified to accommodate the partitioning of the image, and src+((int)(i*srcWidth*srcHeight)) and dst+((int)(i*dstWidth*dstHeight)) will point to the beginning of each partition in the source and destination image.
Apparently, IPP and TBB can be combined in this manner -- as I get the correct result -- but what baffles me is that the computational time deteriorates when they're combined compared to when IPP is used alone. Any thought on what could be the cause, or how I could solve this issue?
Thanks!

In your code, each parallelized task in parallel_for consists of multiple ippiResizeSqrPixel calls.
This might be meaningless overhead as compared to serial version that calls only once, because such function may contain prepare phase (for example, setup interpolation coefficients table) and it's generally designed to process large memory block at a time for runtime efficiency. (but I don't know how IPP does actually.)
I suggest you following parallel structure:
parallel_for(
// Range = src (or dst) height of image.
blocked_range<size_t>(0, height),
[=](const blocked_range<size_t> &r) {
// 'r' = vertical range of image to process in this task.
// You can calculate src/dst region from 'r' here,
// and call ippiResizeSqrPixel once per task.
ippiResizeSqrPixel_8u_C1R( ... );
}
);

Turns out that some IPP functions use multi-threading automatically. For such functions no improvements can be gained out of using TBB. Apparently ippiResizeSqrPixel_8u_C1R( ... ) function is one of those functions. When I disabled all the cores but one, both versions did equally good.

Related

Why GBuffers need to be created for each frame in D3D12?

I have experience with D3D11 and want to learn D3D12. I am reading the official D3D12 multithread example and don't understand why the shadow map (generated in the first pass as a DSV, consumed in the second pass as SRV) is created for each frame (actually only 2 copies, as the FrameResource is reused every 2 frames).
The code that creates the shadow map resource is here, in the FrameResource class, instances of which is created here.
There is actually another resource that is created for each frame, the constant buffer. I kind of understand the constant buffer. Because it is written by CPU (D3D11 dynamic usage) and need to remain unchanged until the GPU finish using it, so there need to be 2 copies. However, I don't understand why the shadow map needs to do the same, because it is only modified by GPU (D3D11 default usage), and there are fence commands to separate reading and writing to that texture anyway. As long as the GPU follows the fence, a single texture should be enough for the GPU to work correctly. Where am I wrong?
Thanks in advance.
EDIT
According to the comment below, the "fence" I mentioned above should more accurately be called "resource barrier".

The key issue is that you don't want to stall the GPU for best performance. Double-buffering is a minimal requirement, but typically triple-buffering is better for smoothing out frame-to-frame rendering spikes, etc.
FWIW, the default behavior of DXGI Present is to stall only after you have submitted THREE frames of work, not two.
Of course, there's a trade-off between triple-buffering and input responsiveness, but if you are maintaining 60 Hz or better than it's likely not noticeable.
With all that said, typically you don't need to double-buffered depth/stencil buffers for rendering, although if you wanted to make the initial write of the depth-buffer overlap with the read of the previous depth-buffer passes then you would want distinct buffers per frame for performance and correctness.
The 'writes' are all complete before the 'reads' in DX12 because of the injection of the 'Resource Barrier' into the command-list:
void FrameResource::SwapBarriers()
{
// Transition the shadow map from writeable to readable.
m_commandLists[CommandListMid]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_DEPTH_WRITE, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE));
}
void FrameResource::Finish()
{
m_commandLists[CommandListPost]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_DEPTH_WRITE));
}
Note that this sample is a port/rewrite of the older legacy DirectX SDK sample MultithreadedRendering11, so it may be just an artifact of convenience to have two shadow buffers instead of just one.

Version 3 AudioUnits: minimum frameCount in internalRenderBlock

The example code for creating a version 3 AudioUnit demonstrates how the implementation needs to return a function block for rendering processing. The block will both get samples from the previous
AxudioUnit in the chain via pullInputBlock and supply the output buffers with the processed samples. It also must provide some output buffers if the unit further down the chain did not. Here is an excerpt of code from an AudioUnit subclass:
- (AUInternalRenderBlock)internalRenderBlock {
/*
Capture in locals to avoid ObjC member lookups.
*/
// Specify captured objects are mutable.
__block FilterDSPKernel *state = &_kernel;
__block BufferedInputBus *input = &_inputBus;
return Block_copy(^AUAudioUnitStatus(
AudioUnitRenderActionFlags *actionFlags,
const AudioTimeStamp *timestamp,
AVAudioFrameCount frameCount,
NSInteger outputBusNumber,
AudioBufferList *outputData,
const AURenderEvent *realtimeEventListHead,
AURenderPullInputBlock pullInputBlock) {
...
});
This is fine if the processing does not require knowing the frameCount before the call to this block, but many applications do require knowing the frameCount before this block in order to allocate memory, prepare processing parameters, etc.
One way around this would be to accumulate past buffers of output, outputting only frameCount samples each call to the block, but this only works if there is known minimum frameCount. The processing must be initialized with a size greater than this frame count in order to work. Is there a way to specify or obtain a minimum value for frameCount or force it to be a specific value?
The example code is taken from: https://github.com/WildDylan/appleSample/blob/master/AudioUnitV3ExampleABasicAudioUnitExtensionandHostImplementation/FilterDemoFramework/FilterDemo.mm

Under iOS, an audio unit callback must be able to handle variable frameCounts. You can't force it to be a constant.
Thus any processing that requires a fixed size buffer should be done outside the audio unit callback. You can pass data to a processing thread by using a lock-free circular buffer/fifo or similar structure that does not require memory management in the callback.
You can suggest that the frameCount be a certain size by setting a buffer duration using the AVAudioSession APIs. But the OS is free to ignore this, depending on other audio needs in the system (power saving modes, system sounds, etc.) In my experience, the audio driver will only increase your suggested size, not decrease it (by more than a couple samples if resampling to not a power of 2).

VTK efficiency of vtkRenderWindow::GetZbufferData and vtkWindowToImageFilter::Update

I have a snippet that converts vtk (off screen) rendering to 1)Point cloud; 2)Color image. The implementation is correct, it just the speed/efficiency is an issue.
At the beginning of every iteration, I update my rendering by calling:
renderWin->Render ();
For point cloud, I get the depth using following line and then convert it to point cloud (code not posted).
float *depth = new float[width * height];
renderWin->GetZbufferData (0, 0, width - 1, height - 1, &(depth[0]));
For color image, I use vtkWindowToImageFilter to get current color rendered image:
windowToImageFilter->Modified(); // Must have this to get updated rendered image
windowToImageFilter->Update(); // this line takes a lot of time
render_img_vtk = windowToImageFilter->GetOutput();
Above program is run in the same thread sequentially. The renderWindow size is about 1000x1000. There is not a lot of polydata needs to be rendered. VTK was compiled with OpenGL2 support.
Issue:
This code only runs about 15-20Hz, when I disabled/comment the windowToImageFilter part (vtkWindowToImageFilter::Update() takes a lot of time), the framerate goes to about 30Hz.
When I disabled/comment vtkRenderWindow::GetZbufferData, it goes up to 50Hz (which is how fast I call my loop and update the rendering).
I had a quick look of the VTK source file of these two function, I see it copy data using GL command. I am not sure how can I speed this up.
Update:
After some search, I found that the glReadPixels function called in the GetZbufferData causes delay as it try to synchronize the data. Please see this post: OpenGL read pixels faster than glReadPixels.
In this post, it is suggested that PBO should be used. VTK has a class vtkPixelBufferObject but no example can be found for using it to avoid blocking the pipeline when do glReadPixels()
So how can I do this within the VTK pipeline?

My answer is just about the GetZbufferData portion.
vtkOpenGLRenderWindow already uses glReadPixels with little overhead from what I can tell. here
What happens after that I believe can introduce overhead. Main thing to note is that vtkOpenGLRenderWindow has 3 method overloads for GetZbufferData. You are using the method overload with the same signature as the one used in vtkWindowToImageFilter here
I believe you are copying that part of the implementation in vtkWindowToImageFilter, which makes total sense. What do you do with float pointer depthBuffer after you get it? Looking at the vtkWindowToImageFilter implementation, I see that they have a for loop that calls memcpy here. I believe their memcpy has to be in a for loop in order to deal with spacing, because of the variables inIncrY and outIncrY. For your situation you should only have to call memcpy once then free the array pointed to by depthBuffer. Unless you are just using the pointer. Then you have to think about who has to delete that float array, because it was created with new.
I think the better option is to use the method with this signature: int GetZbufferData( int x1, int y1, int x2, int y2, vtkFloatArray* z )
In python that looks likes this:
import vtk
# create render pipeline (not shown)
# define image bounds (not shown)
vfa = vtk.vtkFloatArray()
ib = image_bounds
render_window.GetZbufferData(ib[0], ib[1], ib[2], ib[3], vfa)
Major benefit is that the pointer for the vtkFloatArray gets handed straight to glReadPixels. Also, vtk will take of garbage collection of the vtkFloatArray if you create it with vtkSmartPointer (not needed in Python)
My python implementation is running at about 150Hz on a single pass. On a 640x480 render window.
edit: Running at 150Hz

vtkImageData from 3rd party structure

I have a volume stored as slices in c# memory. The slices may not be consecutive in memory. I want to import this data and create a vtkImageData object.
The first way I found is to use a vtkImageImporter, but this importer only accepts a single void pointer as data input it seems. Since my slices may not be consecutive in memory, I cannot hand a single pointer to my slice data.
A second option is to create the vtkImageData from scratch and use vtkImageData->GetScalarPointer()" to get a pointer to its data. Than fill this using a loop. This is quite costly (although memcpy could speed things up a bit). I could also combine the copy approach with the vtkImageImport ofcourse.
Are these my only options, or is there a better way to get the data into a vtk object? I want to be sure there is no other option before I take the copy approach (performance heavy), or modify the low level storage of my slices so they become consecutive in memory.

I'm not too familiar with VTK for C# (ActiViz). In C++ is a good approach and rather fast one to use vtkImageData->GetScalarPointer() and manually copy your slices. It will increase your speed storing all memory first as you said, perhaps you want to do it this more robust way (change the numbers):
vtkImageData * img = vtkImageData::New();
img->SetExtent(0, 255, 0, 255, 0, 9);
img->SetSpacing(sx , sy, sz);
img->SetOrigin(ox, oy, oz);
img->SetNumberOfScalarComponents(1);
img->SetScalarTypeToFloat();
img->AllocateScalars();
Then is not to hard do something like:
float * fp = static_cast<float *>(img->GetScalarPointer());
for ( int i = 0; i < 256* 256* 10; i ++) {
fp[i] = mydata[i]
}
Another fancier option is to create your own vtkImporter basing the code in the vtkImageImport.

HDF5 write thread concurrency

Is HDF5 able to handle multiple threads on its own, or does it have to be externally synchronized? The OpenMP example suggests the latter.
If the former, what is the proper way to define the dataspace to write to?

Anycorn,
HDF5 can handle multiple threads without external synchronization, although the writes will still be serial. You should compile the latest version (1.8.6 as of 4/5/2011) and run ./configure with the --enable-threadsafe and -with-pthreads=/pthreads-include-path/,/pthreads-lib-path/ flags.
For example:
./configure --enable-threadsafe -with-pthreads=/usr/include,/usr/lib
With regards to defining a dataspace for writing, the simplest way is to construct a basic rectangular-hyperplane using a multi-dimensional array, a rank value, and the H5Screate_simple function. Mine usually follow the same steps:
//NUM = Number of spaces in this dimension
//Create a 1 dimensional array
hsize_t dsDim[1] = {NUM};
//Create the 1x1xNUM data space (rank param = 1).
hid_t dSpace = H5Screate_simple(1, dsDim, NULL);
...
Create datasets using the dataspace
...
//Release the data space
H5Sclose(dSpace);
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string