I have a volume stored as slices in c# memory. The slices may not be consecutive in memory. I want to import this data and create a vtkImageData object.
The first way I found is to use a vtkImageImporter, but this importer only accepts a single void pointer as data input it seems. Since my slices may not be consecutive in memory, I cannot hand a single pointer to my slice data.
A second option is to create the vtkImageData from scratch and use vtkImageData->GetScalarPointer()" to get a pointer to its data. Than fill this using a loop. This is quite costly (although memcpy could speed things up a bit). I could also combine the copy approach with the vtkImageImport ofcourse.
Are these my only options, or is there a better way to get the data into a vtk object? I want to be sure there is no other option before I take the copy approach (performance heavy), or modify the low level storage of my slices so they become consecutive in memory.
I'm not too familiar with VTK for C# (ActiViz). In C++ is a good approach and rather fast one to use vtkImageData->GetScalarPointer() and manually copy your slices. It will increase your speed storing all memory first as you said, perhaps you want to do it this more robust way (change the numbers):
vtkImageData * img = vtkImageData::New();
img->SetExtent(0, 255, 0, 255, 0, 9);
img->SetSpacing(sx , sy, sz);
img->SetOrigin(ox, oy, oz);
img->SetNumberOfScalarComponents(1);
img->SetScalarTypeToFloat();
img->AllocateScalars();
Then is not to hard do something like:
float * fp = static_cast<float *>(img->GetScalarPointer());
for ( int i = 0; i < 256* 256* 10; i ++) {
fp[i] = mydata[i]
}
Another fancier option is to create your own vtkImporter basing the code in the vtkImageImport.
Related
Let's make an example:
i want vector dot product made concurrently (it's not my case, this is only an example) so i have 2 large input vectors and a large output vector with the same size. the work items aviable are less then the sizes of these vectors. How can i make this dot product in opencl if the work items are less then the size of the vectors? Is this possible? Or i have just to make some tricks?
Something like:
for(i = 0; i < n; i++){
output[i] = input1[i]*input2[i];
}
with n > available work items
If by "available work items" you mean you're running into the maximum given by CL_DEVICE_MAX_WORK_ITEM_SIZES, you can always enqueue your kernel multiple times for different ranges of the array.
Depending on your actual workload, it may be more sensible to make each work item perform more work though. In the simplest case, you can use the SIMD types such as float4, float8, float16, etc. and operate on large chunks like that in one go. As always though, there is no replacement for trying different approaches and measuring the performance of each.
Divide and conquer data. If you keep workgroup size as an integer divident of global work size, then you can have N workgroup launches perhaps k of them at once per kernel launch. So you should just launch N/k kernels each with k*workgroup_size workitems and proper addressing of buffers inside kernels.
When you have per-workgroup partial sums of partial dot products(with multiple in-group reduction steps), you can simply sum them on CPU or on whichever device that data is going to.
I've been working on a server that expects data to be received through a buffer. I have an object which is defined like this and some procedures that modify the buffer in it:
Packet* = ref object
buf*: seq[int8]
#buf*: array[0..4096, int8]
pos*: int
proc newPacket*(size: int): Packet =
result = Packet(buf: newSeq[int8](size))
#result = Packet()
proc sendPacket*(s: AsyncSocket, p: Packet) =
aSyncCheck s.send(addr(p.buf), p.pos)
Now the reason I have two lines commented is because that was the code I originally used, but creating an object that initialises an array with 4096 elements every time probably wasn't very good for performance. However, it works and the seq[int8] version does not.
The strange thing is though, my current code will work perfectly fine if I use the old static buffer buf*: array[0..4096, int8]. In sendPacket, I have made sure to check the data contained in the buffer to make sure both the array and seq[int8] versions are equal, and they are. (Or at least appear to be). In other words, if I were to do var p = createPacket(17) and write to p.buf with exactly 17 bytes, the values of the elements appear to be the same in both versions.
So despite the data appearing to be the same in both versions, I get a different result when calling send when passing the address of the buffer.
In case it matters, the data would be read like this:
result = p.buf[p.pos]
inc(p.pos)
And written to like this:
p.buf[p.pos] = cast[int8](value)
inc(p.pos)
Just a few things I've looked into, which were probably unrelated to my problem anyway: I looked at GC_ref and GC_unref which had no effect on my problem and also looked at maybe trying to use alloc0 where buf is defined as pointer but I couldn't seem to access the data of that pointer and that probably isn't what I should be doing in the first place. Also if I do var data = p.buf and pass the addr of data instead, I get a different result, but still not the intended one.
So I guess what I want to get to the bottom of is:
Why does send work perfectly fine when I use array[0..4096, int8] but not seq[int8] which is initialised with newSeq, even when they appear to contain the same data?
Does my current layout for receiving and writing data even make sense in a language like Nim (or any language for that matter)? Is there a better way?
In order not to initialize the array you can use the noinit pragma like this:
buf* {.noinit.}: array[0..4096, int8]
You are probably taking the pointer to the seq, not the pointer to the data inside the seq, so try using addr(p.buf[0]).
A pos field is useless if you are using the seq version since you have p.buf.len already, but you probably know that already and just left it in for the array. If you want to use the seq and expect large packets, make sure to use newSeqOfCap to only allocate the memory once.
Also, your array is 1 byte too big, it goes from 0 to 4096 inclusively! Instead you can use [0..4095, int8] or just [4096, int8].
Personally I would prefer to use a uint8 type inside of buf, so that you can just put in values from 0 to 255 instead of -128 to 127
Using a seq inside of a ref object means you have two layers of indirection when accessing buf, as well as two objects that the GC will have to clean up. You could just make Packet an alias for seq[uint8] (without ref): type Packet* = seq[uint8]. Or you can use the array version if you want to store some more data inside the Packet later on.
I have a snippet that converts vtk (off screen) rendering to 1)Point cloud; 2)Color image. The implementation is correct, it just the speed/efficiency is an issue.
At the beginning of every iteration, I update my rendering by calling:
renderWin->Render ();
For point cloud, I get the depth using following line and then convert it to point cloud (code not posted).
float *depth = new float[width * height];
renderWin->GetZbufferData (0, 0, width - 1, height - 1, &(depth[0]));
For color image, I use vtkWindowToImageFilter to get current color rendered image:
windowToImageFilter->Modified(); // Must have this to get updated rendered image
windowToImageFilter->Update(); // this line takes a lot of time
render_img_vtk = windowToImageFilter->GetOutput();
Above program is run in the same thread sequentially. The renderWindow size is about 1000x1000. There is not a lot of polydata needs to be rendered. VTK was compiled with OpenGL2 support.
Issue:
This code only runs about 15-20Hz, when I disabled/comment the windowToImageFilter part (vtkWindowToImageFilter::Update() takes a lot of time), the framerate goes to about 30Hz.
When I disabled/comment vtkRenderWindow::GetZbufferData, it goes up to 50Hz (which is how fast I call my loop and update the rendering).
I had a quick look of the VTK source file of these two function, I see it copy data using GL command. I am not sure how can I speed this up.
Update:
After some search, I found that the glReadPixels function called in the GetZbufferData causes delay as it try to synchronize the data. Please see this post: OpenGL read pixels faster than glReadPixels.
In this post, it is suggested that PBO should be used. VTK has a class vtkPixelBufferObject but no example can be found for using it to avoid blocking the pipeline when do glReadPixels()
So how can I do this within the VTK pipeline?
My answer is just about the GetZbufferData portion.
vtkOpenGLRenderWindow already uses glReadPixels with little overhead from what I can tell. here
What happens after that I believe can introduce overhead. Main thing to note is that vtkOpenGLRenderWindow has 3 method overloads for GetZbufferData. You are using the method overload with the same signature as the one used in vtkWindowToImageFilter here
I believe you are copying that part of the implementation in vtkWindowToImageFilter, which makes total sense. What do you do with float pointer depthBuffer after you get it? Looking at the vtkWindowToImageFilter implementation, I see that they have a for loop that calls memcpy here. I believe their memcpy has to be in a for loop in order to deal with spacing, because of the variables inIncrY and outIncrY. For your situation you should only have to call memcpy once then free the array pointed to by depthBuffer. Unless you are just using the pointer. Then you have to think about who has to delete that float array, because it was created with new.
I think the better option is to use the method with this signature: int GetZbufferData( int x1, int y1, int x2, int y2, vtkFloatArray* z )
In python that looks likes this:
import vtk
# create render pipeline (not shown)
# define image bounds (not shown)
vfa = vtk.vtkFloatArray()
ib = image_bounds
render_window.GetZbufferData(ib[0], ib[1], ib[2], ib[3], vfa)
Major benefit is that the pointer for the vtkFloatArray gets handed straight to glReadPixels. Also, vtk will take of garbage collection of the vtkFloatArray if you create it with vtkSmartPointer (not needed in Python)
My python implementation is running at about 150Hz on a single pass. On a 640x480 render window.
edit: Running at 150Hz
I am trying to gain further improvement in my Image Resizing algorithm by combining IPP and TBB. The two ways that I can accomplish this task are:
Use IPP without TBB
Use IPP with TBB inside a parallel_for loop
My question is that I have coded the application, and I get correct result. But surprisingly, my computational time is larger when they are combined. To avoid clutter, I only paste part of my code in here. But I can provide the whole code if needed. For the first case when I use only IPP, the code is like: (The base of the algorithm was borrowed from the Intel TBB sample code for Image resizing)
ippiResizeSqrPixel_8u_C1R(src, srcSize, srcStep, srcRoi, dst, dstStep, dstRoi,
m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBufferWhole);
and the parallel_for loop looks like this:
parallel_for(
blocked_range<size_t>(0,CHUNK),
[=](const blocked_range<size_t> &r){
for (size_t i= r.begin(); i!= r.end(); i++){
ippiResizeSqrPixel_8u_C1R(src+((int)(i*srcWidth*srcHeight)), srcSize,
srcStep, srcRoi, dst+((int)(i*dstWidth*dstHeight)), dstStep, dstRoi,
m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBuffer);
}
}
);
src and dst are pointers to the source image and the destination image. When TBB is used, the image is partitioned into CHUNKS parts and the parallel_for loops through all the CHUNKS and uses an IPP function to resize each CHUNK independently. The value for dstHeight, srcHeight, srcRoi, and dstRoi are modified to accommodate the partitioning of the image, and src+((int)(i*srcWidth*srcHeight)) and dst+((int)(i*dstWidth*dstHeight)) will point to the beginning of each partition in the source and destination image.
Apparently, IPP and TBB can be combined in this manner -- as I get the correct result -- but what baffles me is that the computational time deteriorates when they're combined compared to when IPP is used alone. Any thought on what could be the cause, or how I could solve this issue?
Thanks!
In your code, each parallelized task in parallel_for consists of multiple ippiResizeSqrPixel calls.
This might be meaningless overhead as compared to serial version that calls only once, because such function may contain prepare phase (for example, setup interpolation coefficients table) and it's generally designed to process large memory block at a time for runtime efficiency. (but I don't know how IPP does actually.)
I suggest you following parallel structure:
parallel_for(
// Range = src (or dst) height of image.
blocked_range<size_t>(0, height),
[=](const blocked_range<size_t> &r) {
// 'r' = vertical range of image to process in this task.
// You can calculate src/dst region from 'r' here,
// and call ippiResizeSqrPixel once per task.
ippiResizeSqrPixel_8u_C1R( ... );
}
);
Turns out that some IPP functions use multi-threading automatically. For such functions no improvements can be gained out of using TBB. Apparently ippiResizeSqrPixel_8u_C1R( ... ) function is one of those functions. When I disabled all the cores but one, both versions did equally good.
does J2ME have something similar to RandomAccessFile class, or is there any way to emulate this particular (random access) functionality?
The problem is this: I have a rather large binary data file (~600 KB) and would like to create a mobile application for using that data. Format of that data is home-made and contains many index blocks and data blocks. Reading the data on other platforms (like PHP or C) usually goes like this:
Read 2 bytes for index key (K), another 2 for index value (V) for the data type needed
Skip V bytes from the start of the file to seek to a file position there the data for index key K starts
Read the data
Profit :)
This happens many times during the program flow.
Um, and I'm investigating possibility of doing the very same on J2ME, and while I admit I'm quite new to the whole Java thing, I can't seem to be able to find anything beyond InputStream (DataInputStream) classes which don't have the basic seeking/skipping to byte/returning position functions I need.
So, what are my chances?
You should have something like this
try {
DataInputStream di = new DataInputStream(is);
di.marke(9999);
short key = di.readShort();
short val = di.readShort();
di.reset();
di.skip(val);
byte[] b= new byte[255];
di.read(b);
}catch(Exception ex ) {
ex.printStackTrace();
}
I prefer not to use the marke/reset methods, I think it is better to save the offset from the val location not from the start of the file so you can skip these methods. I think they have som issues on some devices.
One more note, I don't recommend to open a 600 KB file, it will crash the application on many low end devices, you should split this file to multiple files.