OpenGL geometry performance - multithreading

I have an application which renders many filled polygons with OpenGL, in 2D. Filling is done by tesselation but performance is not optimal. 1900 polygons made up of 122000 vertex (that is, about 64 vertex per polygon) are displayed in about 3 seconds.
Apparently, the CPU is not the bottleneck, as if I replace calls to gluTessVertex by calls to glColor - just to test where is the bottleneck, performance is doubled.
I have the same problem with loading many small textures.
Now, which are the options to improve the performance? Seems that most time is spend in the geometry subsystem. Rendering is fast enough.
I already have a worker thread which does the load (so tesselation, texture binding) in one context, and another thread which does the draw in another context. The two contexts share objects via wglShareLists and it works like a charm.
Can I have a third thread in a third context which would handle also tesselation for half of the polygons? Anyone tried that? Is it safe? Any example of sharing objects between three contexts?
Forgot to say, I have an ATI Radeon HD 4550 graphics card, suppose it can handle more than 39kB/s of data.

Increase Performance
Sounds like you're using the old fixed-function pipeline.
If you're unsure of what that is, well, the following functions are a part of the fixed-function pipeline.
glBegin()
glEnd()
glVertex*()
glTexCoord*()
glNormal*()
glColor*()
etc.
Those functions are old and render geometry immediately. That means that each time you call the above functions, that geometry gets send to the GPU. By doing that a lot of times, you can easy make the FPS go way under 60 just by rendering simple things.
Now you need to use buffers and to be more precise VAOs with/or VBOs (and IBOs).
VBO or Vertex Buffer Object, is a buffer which can store vertices which you then can render. This is much much faster and better to use than glBegin() and glEnd(). When you create a VBO you supply it with vertices and they only require to be send to the GPU once, that's basically why they are fast, because they already are in the GPU and only require a single draw call instead of multiple.
The reason I said "with/or" is because in the newer versions you need to create a VAO which then would use a VBO, where before you could simply render the VBOs.
Tessellation
There are multiple ways to do tessellation and things which look like/would give the effect of tessellation.
For instance you could also simply render different models according to the required LOD (Level of Detail), thereby when you're up close to an object you then render the model with all it's details which probably would have a high vertices count. Then the further you're away from the model you simply render another version of that model but which have less vertices, which also equals less detail. Though if you can't really do that on something like terrain and definitely shouldn't do it on something like dynamic terrain and/or procedurally generated terrain.
You can also do actual geometry tessellation and you would do that through a Shader. Since tessellation is a really huge topic I will provide you with 2 urls which both explain and have code on them.
Both of these articles uses modern 4.0/4.0+ OpenGL.
http://prideout.net/blog/?p=48
http://antongerdelan.net/opengl/tessellation.html
Texturing
Generating and binding textures are still the same.
Instead of using gluBuild2DMipmaps() you can use glGenerateMipmap(GL_TEXTURE_2D); it was added in OpenGL version 3.0'ish if I remember correctly.
Again you can (and should) change all you glBegin() - glEnd() (and everything in between) calls out with VAOs and VBOs. You can store everything you want inside a buffer vertices, texture coordinates, normals, colors, etc. You can store the things in separate buffers or you can store them inside a single buffer, usually called an Interleaved Buffer or Interleaved VBO.
You wouldn't be needing glEnable(GL_TEXTURE_2D) and glDisable(GL_TEXTURE_2D) anymore, because you do that within a Shader, you bind textures and use them in a Shader, and since you create the Shader Program you can make it act however you want it to.

Related

How to keep NSBitmapImageRep from creating lots of intermediate CGImages?

I have a generative art application which starts with a small set of points, grows them outwards, and checks the growth to make sure it doesn't intersect with anything. My first naive implementation was to do it all on the main UI thread with the expected consequences. As the size grows, there are more points to check, it slows down and eventually blocks the UI.
I did the obvious thing and moved the calculations to another thread so the UI could stay responsive. This helped, but only a little. I accomplished this by having an NSBitmapImageRep that I wrap an NSGraphicsContext around so I can draw into it. But I needed to ensure that I'm not trying to draw it to the screen on the main UI thread while I'm also drawing to it on the background thread. So I introduced a lock. The drawing can take a long time as the data gets larger, too, so even this was problematic.
My latest revision has 2 NSBitmapImageReps. One holds the most recently drawn version and is drawn to the screen whenever the view needs updating. The other is drawn to on the background thread. When the drawing on the background thread is done, it's copied to the other one. I do the copy by getting the base address of each and simply calling memcpy() to actually move the pixels from one to the other. (I tried swapping them rather than copying, but even though the drawing ends with a call to [-NSGraphicsContext flushContext], I was getting partially-drawn results drawn to the window.)
The calculation thread looks like this:
BOOL done = NO;
while (!done)
{
self->model->lockBranches();
self->model->iterate();
done = (!self->model->moreToDivide()) || (!self->keepIterating);
self->model->unlockBranches();
[self drawIntoOffscreen];
dispatch_async(dispatch_get_main_queue(), ^{
self.needsDisplay = YES;
});
}
This works well enough for keeping the UI responsive. However, every time I copy the drawn image into the blitting image, I call [-NSBitmapImageRep baseAddress]. Looking at a memory profile in instruments, each call to that function causes a CGImage to be created. Furthermore, that CGImage isn't released until the calculations finish, which can be several minutes. This causes memory to grow pretty large. I'm seeing around 3-4 Gigs of CGImages in my process, even though I never need more than 2 of them. After the calculations finish and the cache is emptied, my app's memory goes down to only 350-500 MB. I hadn't thought to use an autorelease pool in the calculation loop for this, but will give it a try.
It appears that the OS is caching the images it creates. However, it doesn't clear out the cache until the calculations are finished, so it grows without bound until then. Is there any way to keep this from happening?
Don't use -bitmapData and memcpy() to copy the image. Draw the one image into the other.
I often recommend that developers read the section "NSBitmapImageRep: CoreGraphics impedance matching and performance notes" from the 10.6 AppKit release notes:
NSBitmapImageRep: CoreGraphics impedance matching and performance notes
Release notes above detail core changes at the NSImage level for
SnowLeopard. There are also substantial changes at the
NSBitmapImageRep level, also for performance and to improve impedance
matching with CoreGraphics.
NSImage is a fairly abstract representation of an image. It's pretty
much just a thing-that-can-draw, though it's less abstract than NSView
in that it should not behave differently based aspects of the context
it's drawn into except for quality decisions. That's kind of an opaque
statement, but it can be illustrated with an example: If you draw a
button into a 100x22 region vs a 22x22 region, you can expect the
button to stretch its middle but not its end caps. An image should not
behave that way (and if you try it, you'll probably break!). An image
should always linearly and uniformly scale to fill the rect in which
its drawn, though it may choose representations and such to optimize
quality for that region. Similarly, all the image representations in
an NSImage should represent the same drawing. Don't pack some totally
different image in as a rep.
That digression past us, an NSBitmapImageRep is a much more concrete
object. An NSImage does not have pixels, an NSBitmapImageRep does. An
NSBitmapImageRep is a chunk of data together with pixel format
information and colorspace information that allows us to interpret the
data as a rectangular array of color values.
That's the same, pretty much, as a CGImage. In SnowLeopard an
NSBitmapImageRep is natively backed by a CGImageRef, as opposed to
directly a chunk of data. The CGImageRef really has the chunk of data.
While in Leopard an NSBitmapImageRep instantiated from a CGImage would
unpack and possibly process the data (which happens when reading from
a bitmap file format), in SnowLeopard we try hard to just hang onto
the original CGImage.
This has some performance consequences. Most are good! You should see
less encoding and decoding of bitmap data as CGImages. If you
initialize a NSImage from a JPEG file, then draw it in a PDF, you
should get a PDF of the same file size as the original JPEG. In
Leopard you'd see a PDF the size of the decompressed image. To take
another example, CoreGraphics caches, including uploads to the
graphics card, are tied to CGImage instances, so the more the same
instance can be used the better.
However: To some extent, the operations that are fast with
NSBitmapImageRep have changed. CGImages are not mutable,
NSBitmapImageRep is. If you modify an NSBitmapImageRep, internally it
will likely have to copy the data out of a CGImage, incorporate your
changes, and repack it as a new CGImage. So, basically, drawing
NSBitmapImageRep is fast, looking at or modifying its pixel data is
not. This was true in Leopard, but it's more true now.
The above steps do happen lazily: If you do something that causes
NSBitmapImageRep to copy data out of its backing CGImageRef (like call
bitmapData), the bitmap will not repack the data as a CGImageRef until
it is drawn or until it needs a CGImage for some other reason. So,
certainly accessing the data is not the end of the world, and is the
right thing to do in some circumstances, but in general you should be
thinking about drawing instead. If you think you want to work with
pixels, take a look at CoreImage instead - that's the API in our
system that is truly intended for pixel processing.
This coincides with safety. A problem we've seen with our SnowLeopard
changes is that apps are rather fond of hardcoding bitmap formats. An
NSBitmapImageRep could be 8, 32, or 128 bits per pixel, it could be
floating point or not, it could be premultiplied or not, it might or
might not have an alpha channel, etc. These aspects are specified with
bitmap properties, like -bitmapFormat. Unfortunately, if someone wants
to extract the bitmapData from an NSBitmapImageRep instance, they
typically just call bitmapData, treat the data as (say) premultiplied
32 bit per pixel RGBA, and if it seems to work, call it a day.
Now that NSBitmapImageRep is not processing data as much as it used
to, random bitmap image reps you may get ahold of may have different
formats than they used to. Some of those hardcoded formats might be
wrong.
The solution is not to try to handle the complete range of formats
that NSBitmapImageRep's data might be in, that's way too hard.
Instead, draw the bitmap into something whose format you know, then
look at that.
That looks like this:
NSBItmapImageRep *bitmapIGotFromAPIThatDidNotSpecifyFormat;
NSBitmapImageRep *bitmapWhoseFormatIKnow = [[NSBitmapImageRep alloc] initWithBitmapDataPlanes:NULL pixelsWide:width pixelsHigh:height
bitsPerSample:bps samplesPerPixel:spp hasAlpha:alpha isPlanar:isPlanar
colorSpaceName:colorSpaceName bitmapFormat:bitmapFormat bytesPerRow:rowBytes
bitsPerPixel:pixelBits];
[NSGraphicsContext saveGraphicsState];
[NSGraphicsContext setContext:[NSGraphicsContext graphicsContextWithBitmapImageRep:bitmapWhoseFormatIKnow]];
[bitmapIGotFromAPIThatDidNotSpecifyFormat draw];
[NSGraphicsContext restoreGraphicsState];
unsigned char *bitmapDataIUnderstand = [bitmapWhoseFormatIKnow bitmapData];
This produces no more copies of the data than just accessing
bitmapData of bitmapIGotFromAPIThatDidNotSpecifyFormat, since that
data would need to be copied out of a backing CGImage anyway. Also
note that this doesn't depend on the source drawing being a bitmap.
This is a way to get pixels in a known format for any drawing, or just
to get a bitmap. This is a much better way to get a bitmap than
calling -TIFFRepresentation, for example. It's also better than
locking focus on an NSImage and using -[NSBitmapImageRep
initWithFocusedViewRect:].
So, to sum up: (1) Drawing is fast. Playing with pixels is not. (2) If
you think you need to play with pixels, (a) consider if there's a way
to do it with drawing or (b) look into CoreImage. (3) If you still
want to get at the pixels, draw into a bitmap whose format you know
and look at those pixels.
In fact, it's a good idea to start at the earlier section with a similar title — "NSImage, CGImage, and CoreGraphics impedance matching" — and read through to the later section.
By the way, there's a good chance that swapping the image reps would work, but you just weren't synchronizing them properly. You would have to show the code where both reps were used for us to know for sure.

How Might I organize vertex data in WebGL for a frame-by-frame (very specific) animated program?

I have been working on an animated graphics project with very specific requirements, and after quite a bit of searching and test coding, I have figured that I could take several approaches, but the Khronos and MDN documentation I have been reading coupled with other posts I have seen here don't answer all of my questions regarding my particular project. In the meantime, I have written short test programs (setting infrastructure for testing).
Firstly, I should describe the project:
The main object drawn to the screen is a simple quad surrounded by a black outline (LINE_LOOP or LINES will do, probably, though I have had issues with z-fighting...that will be left for another question). When the user interacts with the program, exactly one new quad is created and immediately drawn, but for a set amount of time its vertices move around until the quad moves to its final destination. (Note that translations won't do.) Random black lines are also drawn, and sometimes those lines also move around.
Once one of the quads reaches its final spot, it never moves again.
A new quad is always atop old quads (closer to the screen). That means that I need to layer the quads and lines from oldest to newest.
*this also means that it would probably be best to assign z-values to each quad and line, even if the graphics are in pixel coordinates and use an orthographic matrix. Would everyone agree with this?
Given these parameters, I have a few options with varying levels of complexity:
1> Take the object-oriented approach and just assign a buffer to each quad, and the same goes for the random lines. --creation and destruction of buffers every frame for the one shape that is moving. I truthfully think that this is a terrible idea that might only work in a higher level library that does heavy optimization underneath. This approach also doesn't take advantage of the fact that almost every quad will stay the same.
[vertices0] ... , [verticesN]
Draw x N (many draws for many small-size buffers)
2> Assign a z-value to each quad, outline, and line (as mentioned above). Allocate a huge vertex buffer and element buffer to store all permanently-in-their-final-positions quads. Resize only in the very unlikely case someone interacts for long enough. Create a second tiny buffer to store the one temporary moving quad and use bufferSubData every frame. When the quad reaches its destination, bufferSubData it into the large buffer and overwrite the small buffer upon creation of the next quad...all on the same frame. The main questions I have here are: is it possible (safe?) to use bufferSubData and draw it on the same frame? Also, would I use DYNAMIC_DRAW on both buffers even though the larger one would see fewer updates?
[permanent vertices ... | uninitialized (keep a count)]
bufferSubData -> [tempVerticesForOneQuad]
Draw 2x
3> Still create the large and small buffers, but instead of using bufferSubData every frame, create a second shader program and add an attribute for the new/moving quad that explicitly sets the vertex positions for the animation (I would pass vertex index attributes). Only draw with the small buffer when the quad is moving. For the frame when the quad reaches its destination, draw both large and small buffer, but then bufferSubData the final coordinates into the large permanent buffer to be used in the next frame.
switchToShaderProgramA();
[permanent vertices...| uninitialized (keep a count)]
switchToShaderProgramB();
[temp vertices] <- shader B accepts indices for each vertex so we can do all animation in the vertex shader
---last frame of movement arrives : bufferSubData into the permanent vertices buffer for when the the next quad is created
I get the sense that the third option might be the best, but I would like to learn whether there are some other factors that I did not consider. For example, my assumption that a program switch, additional attributes, and vertex shader manipulation would be faster than just substituting the buffer values as in 2>. The advantage of approach 3> (I think) is that I can defer the buffer substitution to a time when nothing needs to be drawn.
Still, I am still not sure of how to work with the randomly-appearing lines. I can't take the "single quad vertex buffer" approach since the number of lines cannot be predicted. Might I also allocate a large buffer for the moving lines? Those also stay after the quad is finished moving, though I don't think that I could use the vertex shader trick because there would be too many attributes to set (as opposed to the 4 for the one quad). I suppose that I could create a large "permanent line data" buffer first, but what to do during the animation is tricky because the lines move. Maybe bufferSubData() + draw on the same frame is not terrible? Or it could be. This is where I need advise.
I understand that this question might not be too specific code-wise, but I don't believe that I would be allowed to show the core of the program. All I have is the typical WebGL boilerplate ready.
I am looking forward to hearing people's thoughts on how I might proceed and whether there are any trade-offs I might have missed when considering the three options above.
Thank you in advance, and please feel free to ask any additional questions if clarification is necessary.
Honestly, for what you're describing, it doesn't sound to me like it matters which you choose. On modern hardware, drawing a few hundred quads and a few thousand lines each frame would not really tax the hardware much.
Having said that, I agree that approach 1 seems very inefficient. Approach 2 sounds perfectly fine. You can safely draw a buffer on the same frame that you uploaded the data. I don't think it matters much whether you use DYNAMIC_DRAW or STATIC_DRAW for the buffer. I tend to think of dynamic buffers as being something you're updating every frame. If you only update it every few seconds or less, then static is fine. Approach 3 is also fine. Between 2 and 3, I'd say do whichever is easier for you to understand and program.
Likewise, for the lines, I would use a separate buffer. It sounds like that one changes per frame, so I would use DYNAMIC_DRAW for that. Allocating a single large buffer for it and performing a glBufferSubData() per frame is probably a fine strategy. As always, trying it and profiling it will tell you for sure.

Multithreaded shading pass Vulkan

I'm currently implementing a basic deferred renderer with multithreading in Vulkan. Since my G-Buffer should have the same resolution as the final image I want to do it in a single render-pass with multiple sub-passes, according to this presentation, on slide 44 (page 138). It says:
vkCmdBeginCommandBuffer
vkCmdBeginRenderPass
vkCmdExecuteCommands
vkCmdNextSubpass
vkCmdExecuteCommands
vkCmdEndRenderPass
vkCmdEndCommandBuffer
I get that in the first sub-pass, you iterate the scene graph and record one secondary commandbuffer for each entity/mesh. What I don't get is how you are supposed to do the shading pass with secondary command buffers. Do you somehow spit the screen into parts and render each part in a separate thread or just record one secondary commandbuffer for the entire second sub-pass?
To me, like you said, you can need to multi thread your command buffer for the "building g-buffer subpass". However for the shading pass, it must depends on how are you doing things. To me (again), you do not need to multi thread your shading subpasses. However, you must take into consideration that you can have one "by region dependency".
So, I encourage you to procede that way.
Before to begin your RenderPass, use a Compute Shader to splat all your lights on the screen (here you have a kind of array of "quad").
By splatting I mean this kind of thing. You have a point light (for example), the idea is to compute the quad in screen space affected by the light. With that you have 4 vertices (that represents a quad) that you put into a SSBO and you can use it as a vertex Buffer in the shading subpass.
Now you begin the render pass.
MT the scene graph rendering if needed. and do your vkCmdExecuteCommands();
NextSubpass
Use the "array of quads" you create from the earlier compute shader (do not forget a VK_SUBPASS_EXTERNAL dependency).
NextSubpass and so on
However, you said
you iterate the scene graph and record one secondary commandbuffer for each entity/mesh.
I am not sure I really understand what you meant, but if you intend to have one secondary command buffer for one mesh, I really advice you to change the way you are doing. You must use batching. Let's say you have 64 000 different meshes to draw. You could for exemple create 64 command buffers (that you dispatch on 4 threads) and each command buffers have 1000 meshes to draw. (The number are took randomly, so profile your application).
So to answer your question for the shading subpass, I would not use command buffers or only very few (by kind of lights (punctual, directional))
What I don't get is how you are supposed to do the shading pass with secondary command buffers.
The shading pass (assumably the second subpass) would possibly take the G-buffers created by the first subpass as an Input Attachment. Then it would draw to equally sized screen-size quad using data from the G-buffers + from a set of lights (or whatever your deferred shader tries to defer).
The presentation you link tries to hint at this structure style starting at page 13 (marked "Page 107").
First step would be to make it working. Use e.g. this SW example. Then the next step of optimizing it into single renderpass should be easier.

OpenGL ES 2.0: Efficient Rendering of Static and Dynamic Vertex Data

I am writing an iOS/Android game and looking for the most performant way to render my vertex data with OpenGL ES 2.0. I have two different kinds of data: dynamic data that changes its attributes every frame, for example the player or animated background objects, and static data such as the static background or the terrain. I googled a lot since yesterday, but I could not find a clear and unique answer to the question of what is the best was to render such data.
There are basically three options for rendering such data (If I do not miss one. If so, feel free to correct me.):
Vertex Arrays Only:
Just fill your vertex every frame on the CPU (including the dynamic data).
Vertex Buffer Objects Only:
Allocate a VBO on the GPU with GL_DYNAMIC_DRAW where both, the dynamic and static data is stored. The dynamic data is then updated every frame via glBufferSubData.
Use both:
Static data is stored and render with a VBO and the dynamic data is rendered with a Vertex Array. With this option, we need two rendering passes, one for rendering the VBO and one for rendering the vertex array.
Since the first option does not exploit the immutability of the static data and since the third option requires two rendering passes, my guess is that I should go with the second option. However, I am absolutely not sure about this and I hope you can clarify my confusion.
Allocate two Vertex Buffer Objects. One with hint GL_DYNAMIC_DRAW that will be updated frequently. Allocate a second VBO for immutable data and use the hint GL_STATIC_DRAW. According to the API documentation, GL_STATIC_DRAW should be used for data that "will be modified once and used many times"; just what you need.
Speaking of two rendering passes here is probably a misuse of the term: what you do is to render your scene in two separate drawing commands. Since drawing commands run asynchronously, you should not expericence any performance hit by doing so.
A second rendering pass, on the other hand, is when you render the entire scene twice (see for example here) with different settings, or when you do some image processing effects on outputs of previous rendering passes.

Does using glBindAttribLocation improve performance?

My understanding is that glBindAttribLocation allows you to custom set a handle to an attribute (before linking a shader program), which you can later use when rendering with glVertexAttribPointer.
But you don't have to use it, and may instead just rely on OpenGL assigning whatever handle it so chooses in its infinite wisdom. However, you would then need to query OpenGL to find out this handle by using glGetAttribLocation at some point before rendering with glVertexAttribPointer.
Now you could use glGetAttribLocation each time you render, which would seem wasteful since you can just use glGetAttribLocation once after building your program, then store the handle.
So essentially, you can store this handle by either using glBindAttribLocation or by using glGetAttribLocation so is there any difference performance-wise and what are the pros and cons of one over the other?
I cannot speak much about the direct performance difference, but it should be irrelevant anyway, since no matter if using glBindAttribLocation or glGetAttribLocation, you're doing it at initialization time anyway (and even then calling glGetAttribLocation shouldn't hurt that much).
But the main difference and advantage of an explicit glBindAttribLocation over letting GL decide is, that it allows you to establish your own attribue semantics and keep them consistent for each and every shader.
Say you have a whole bunch of objects and a whole bunch of shaders. But each shader has some notion of a position attribute (and normal, color, ...), likewise each object has attribute data for positions, normals, ... Now with glBindAttribLocation you can bind your position attribute to location 0 in each and every different shader. So when drawing your objects with different shaders, they can use a single vertex format (i.e. how you call glVertexAttribPointer for the individual attributes, and the individual enable calls).
On the other hand glGetAttribLocation doesn't give you any guarantees about what attributes get which indices (maybe one shader has some additional attribute and the compiler thinks it's a good way to reorder them, who knows). So in this case you have a different vertex format (glVertexAttribPointer call) for each object and each shader.
This is even more important when using Vertex Array Objects (which encapsulate all the attribute state, especially the glVertexAttribPointer and glEnableVertexAttribArray calls). In this case you usually don't need (and don't want) to call glVertexAttribPointer each time you draw an object with another shader.
So the bottom line is, always use glBindAttribLocation, at best (in a large application) it saves you many object and shader management issues and many unneccessary glVertexAttribPointer calls each frame (and that can likely be a performance gain), and at least (in a very small application) it is good practice and lets you stay open and flexible for extensions. As a side note, in desktop GL 3+ (or with the ARB_explicit_attrib_location extension) you can even assign attribute locations directly in the shader without the need for any API call.

Resources