I have constructed a cairo (v1.12.16) image surface with:
surface = cairo_image_surface_create (CAIRO_FORMAT_ARGB32, size.width, size.height);
and for 60 fps; cleared it, drew stuff and flushed with:
cairo_surface_flush(surface);
then, got the resulting canvas with:
unsigned char * data = cairo_image_surface_get_data(surface);
but the resulting data variable was only modified (approximately) every second, not 60 times a second. I got the same (unexpected) result even when using cairo's quartz backend... Are there any flush/refresh rate settings in cairo that I am not (yet) aware of?
Edit: I am just trying to draw some filled (random and/or calculated) rectangles; tested 100 to 10K rects in each frame. All related code is run in the same (display?) thread. I am not caching the 'data' variable. I even modified one corner of it to flicker and I could see flickers in 60fps (for 100 rects) and 2-3 fps (for 10K rects); meaning the 'data' variable returned is not refreshed!? In a different project using cairo's quartz backend, I got the same 1 fps result!??
Edit2: The culprit turned out to be the time() function; when used in srand(time(NULL)) it was producing the same random variables in the same second; used srand(std::clock()) instead. Thanks to the quick response/reply (and it still answers my question!!)..
No there are no such flush/refresh rate settings. Cairo draws everything you tell it to and then just returns control.
I have two ideas:
Either cairo is drawing fast enough and something else is slowing things down (e.g. your copying the result of the drawing somewhere). You should measure the time that elapses between when you begin drawing and your call to cairo_surface_flush().
You are drawing something really, really complex and cairo really does need a second to render this (However, I have no idea how one could accidentally cause such a complex rendering).
Related
I am doing Direct2D show some text (like fps, resolution etc) on Direct3D surface. The weird thing is that in my Window Class there is a method called CalculateFrameStats() where every loop, use this to calcualte the FPS etc information and use Direct2D IDWriteFactory::CreateTextLayout to create a new Textlayout with latest updated FPS text strings. And do BeginDraw(), DrawTextLayout(), EndDraw() in the 3DFrameDraw() function. And after that I don't release the TextLayout pointer. And next round goes to CalculateFrameStats(), it CreateTextLayout again with newly update FPS etc strings. And in 3DFrameDraw() function, I drawTextlayout again. And it loops like this over and over. But when I run the program, it seems no memory leaks at all, the memory usage keeps low and constant.
But when put IDWriteFactory::CreateTextLayout in 3DFrameDraw() function, which means every 3D frame draw in the beginning I create a new TextLayout with updated FPS string and do some 3D manipulations and before D3D-present, I do BeginDraw(), DrawTextLayout(), EndDraw(). This is the same area in previous 3DFrameDraw(). But this time, the memory leaks, and I can see the memory keep growing when time elapse. But if I add Textlayout pointer->release() after BeginDraw(), DrawTextLayout(), EndDraw(), the memory leaks gone.
I don't really know why the first scenario Textlayout pointer never got release until close the program, the memory never leaks. Does TextLayout need to be released every time/frame when update its text string?
I wrote a Python script that tends to crash sometimes with a Memory Allocation Error. I noticed that the pagefile.sys of my Win10 64 system skyrockets in this script and exceeds the free memory.
My current solution is to run the script in steps, so that every time the script runs through, the pagefile empties.
I would like the script to run through all at once, though.
Moving the pagefile to another drive is not an option, unfortunately, because I only have this one drive and moving the pagefile to an external drive does not seem to work.
During my research, I found out about the module gc but that is not working:
import gc
and after every iteration I use
gc.collect()
Am I using it wrong or is there another (python-based!) option?
[Edit:]
The script is very basic and only iterates over image files (using Pillow). The script only checks for width, height and resolution of the image, calculates the dimensions in cm.
If height > width, the image is rotated 90° counterclockwise.
The images are meant to be enlarged or shrunk to A3 size (42 x 29.7cm), so I use the width/height ratio to calculate whether I can enlarge the width to 42cm and the height remains < 29.7cm and in case the height is > 29.7cm, I enlarge the height to 29.7 cm.
For the moment, I do the actual enlargement/shrinking still in Photoshop. Based on whether it is a width/height enlargement, the file is moved to a certain folder that contains either one of those file types.
Anyways, the memory explosion happens in the iteration that only reads the file dimensions.
For that I use
with Image.open(imgOri) as pic:
widthPX = pic.size[0]
heightPX = pic.size[1]
resolution = pic.info["dpi"][0]
widthCM = float(widthPX) / resolution * 2.54
heightCM = float(heightPX) / resolution * 2.54
I also calculate whether the shrinking would be too strong, the image gets divided in half and re-evaluated.
Even though it is unnecessary, I still added pic.close
to the with open()statement, because I thought Python may be keeping the image files open, but that didn't help.
Once the iteration finished, the pagefile.sys goes back to its original size, so in case that error occurs, I take some files out and do them gradually.
I have a snippet that converts vtk (off screen) rendering to 1)Point cloud; 2)Color image. The implementation is correct, it just the speed/efficiency is an issue.
At the beginning of every iteration, I update my rendering by calling:
renderWin->Render ();
For point cloud, I get the depth using following line and then convert it to point cloud (code not posted).
float *depth = new float[width * height];
renderWin->GetZbufferData (0, 0, width - 1, height - 1, &(depth[0]));
For color image, I use vtkWindowToImageFilter to get current color rendered image:
windowToImageFilter->Modified(); // Must have this to get updated rendered image
windowToImageFilter->Update(); // this line takes a lot of time
render_img_vtk = windowToImageFilter->GetOutput();
Above program is run in the same thread sequentially. The renderWindow size is about 1000x1000. There is not a lot of polydata needs to be rendered. VTK was compiled with OpenGL2 support.
Issue:
This code only runs about 15-20Hz, when I disabled/comment the windowToImageFilter part (vtkWindowToImageFilter::Update() takes a lot of time), the framerate goes to about 30Hz.
When I disabled/comment vtkRenderWindow::GetZbufferData, it goes up to 50Hz (which is how fast I call my loop and update the rendering).
I had a quick look of the VTK source file of these two function, I see it copy data using GL command. I am not sure how can I speed this up.
Update:
After some search, I found that the glReadPixels function called in the GetZbufferData causes delay as it try to synchronize the data. Please see this post: OpenGL read pixels faster than glReadPixels.
In this post, it is suggested that PBO should be used. VTK has a class vtkPixelBufferObject but no example can be found for using it to avoid blocking the pipeline when do glReadPixels()
So how can I do this within the VTK pipeline?
My answer is just about the GetZbufferData portion.
vtkOpenGLRenderWindow already uses glReadPixels with little overhead from what I can tell. here
What happens after that I believe can introduce overhead. Main thing to note is that vtkOpenGLRenderWindow has 3 method overloads for GetZbufferData. You are using the method overload with the same signature as the one used in vtkWindowToImageFilter here
I believe you are copying that part of the implementation in vtkWindowToImageFilter, which makes total sense. What do you do with float pointer depthBuffer after you get it? Looking at the vtkWindowToImageFilter implementation, I see that they have a for loop that calls memcpy here. I believe their memcpy has to be in a for loop in order to deal with spacing, because of the variables inIncrY and outIncrY. For your situation you should only have to call memcpy once then free the array pointed to by depthBuffer. Unless you are just using the pointer. Then you have to think about who has to delete that float array, because it was created with new.
I think the better option is to use the method with this signature: int GetZbufferData( int x1, int y1, int x2, int y2, vtkFloatArray* z )
In python that looks likes this:
import vtk
# create render pipeline (not shown)
# define image bounds (not shown)
vfa = vtk.vtkFloatArray()
ib = image_bounds
render_window.GetZbufferData(ib[0], ib[1], ib[2], ib[3], vfa)
Major benefit is that the pointer for the vtkFloatArray gets handed straight to glReadPixels. Also, vtk will take of garbage collection of the vtkFloatArray if you create it with vtkSmartPointer (not needed in Python)
My python implementation is running at about 150Hz on a single pass. On a 640x480 render window.
edit: Running at 150Hz
I'm developing an app (XNA Game) for the XBOX, which is a pretty simple app. The startpage contains tiles with moving gif images. Those gif images are actually all png images, which gets loaded once by every tile, and put in an array. Then, using a defined delay, these images are played (using a counter which increases every time a delay passes).
This all works well, however, I noticed some small lag every x seconds in the movement of the GIF images. I then started to add some benchmarking stuff:
http://gyazo.com/f5fe0da3ff81bd45c0c52d963feb91d8
As you can see, the FPS is pretty low for such a simple program (This is in debug, when running the app from the Xbox itself, I get an avg of 62fps).
2 important settings:
Graphics.SynchronizeWithVerticalRetrace = false;
IsFixedTimeStep = false;
Changing isFixedTimeStep to true increases the lag. The settings tile has wheels which rotate, and you can see the wheels go back a little every x seconds. The same counts for SynchronizeWVR, also increases lag.
I noticed a connection between the lag and the moment the garbage collector kicks in, every time it kicks in, there is a lag...
Don't mind the MAX HMU(Heap memory usage), as this is takes the amount of the start, the avg is more realistic.
Here is another screen from the performance monitor, however I don't understand much from this tool, first time I'm using it... Hope it helps:
http://gyazo.com/f70a3d400657ac61e6e9f2caaaf17587
After a little research I found the culprit.
I have custom components that all derive from GameComponent, and who get added to the Component list of the main Game class.
This was one (of a total of 2) major problem, causing to update everything that wasn't needing an update. (The draw method was the only one who kept the page state in mind, and only drew if needed).
I fixed this by using different "screens" (or pages as I called them), wich are the only components who derive from GameComponent.
Then I only update the page wich is active, and the custom components on that page also get updated. Problem fixed.
The second big problem, is the following;
I made a class which helps me on positioning stuff on the screen, relative that is, with percentages and stuff like that. Parent containers, aligns & v-aligns etc etc.
That class had properties, for size & vectors, but instead of saving the calculated value in a backing field, I recalculated them everytime I accessed a property. But calculating complex stuff like that uses references (to parent & child containers for example) wich made it very hard for the CLR, because it had alot of work to do.
I now rebuilt the whole positioning class to a fully functional optimized class, with different flags for recalculating when necessairy, and instead of drops of 20fps, I now get an average of 170+fps!
I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.
The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...