I need to render some CPU generated images in Direct3D 9 and I'm not sure of the best way to get the texture data onto the graphics card as there seems to be a number of approaches.
My usage path goes along the following lines each frame
Render a bunch of stuff with the textures
Update a few parts of the texture (which may have been used by the previous renders)
Render some more stuff with the texture
Update another part of the texture
and so on
Ive thought of a couple of ways to do this, however I'm not sure which one to go with. I considered benchmarking each method however I have no way to know if any results I get are representative of hardware in general, or only my hardware.
Which pool is best for a texture for this task?
Whats the best way to update this texture?
Call LockRect and UnlockRect for each region I need to update
Call LockRect and UnlockRect for the entire texture
Call LockRect and UnlockRect for the entire texture with D3DLOCK_DISCARD and copy in a bitmap from RAM.
Create a completely new texture each time I need to "update it"
Use 1,2 or 3 to update a surface in D3DPOOL_SYSMEM, then UpdateSurface to update level 0 of my texture from this surface
Same as 5 but specify RECT to cover the entire area I need
Same as 5 but make multiple calls, one for each region I updated
Probably yet another way to do this I haven't thought of yet...
It should be noted that the areas I'm updating are usually fairly small compared to the size of the entire texture, eg the texture may be 1024*1024 and I might want to update 5 or so 64*64 regions of it.
If you need to update multiple areas, you should lock the whole texture and use the D3DLOCK_NO_DIRTY_UPDATE flag, then for each area call AddDirtyRect before unlocking.
This of course all depends on the size of the texture etc, for small texture it may be more efficient to copy the whole thing from ram.
D3DPOOL_DEFAULT
D3DUSAGE_DYNAMIC
call LockRect and UnlockRect for each region you need to update
--> This is the fastest!
Benchmark will follow...
Related
I have a generative art application which starts with a small set of points, grows them outwards, and checks the growth to make sure it doesn't intersect with anything. My first naive implementation was to do it all on the main UI thread with the expected consequences. As the size grows, there are more points to check, it slows down and eventually blocks the UI.
I did the obvious thing and moved the calculations to another thread so the UI could stay responsive. This helped, but only a little. I accomplished this by having an NSBitmapImageRep that I wrap an NSGraphicsContext around so I can draw into it. But I needed to ensure that I'm not trying to draw it to the screen on the main UI thread while I'm also drawing to it on the background thread. So I introduced a lock. The drawing can take a long time as the data gets larger, too, so even this was problematic.
My latest revision has 2 NSBitmapImageReps. One holds the most recently drawn version and is drawn to the screen whenever the view needs updating. The other is drawn to on the background thread. When the drawing on the background thread is done, it's copied to the other one. I do the copy by getting the base address of each and simply calling memcpy() to actually move the pixels from one to the other. (I tried swapping them rather than copying, but even though the drawing ends with a call to [-NSGraphicsContext flushContext], I was getting partially-drawn results drawn to the window.)
The calculation thread looks like this:
BOOL done = NO;
while (!done)
{
self->model->lockBranches();
self->model->iterate();
done = (!self->model->moreToDivide()) || (!self->keepIterating);
self->model->unlockBranches();
[self drawIntoOffscreen];
dispatch_async(dispatch_get_main_queue(), ^{
self.needsDisplay = YES;
});
}
This works well enough for keeping the UI responsive. However, every time I copy the drawn image into the blitting image, I call [-NSBitmapImageRep baseAddress]. Looking at a memory profile in instruments, each call to that function causes a CGImage to be created. Furthermore, that CGImage isn't released until the calculations finish, which can be several minutes. This causes memory to grow pretty large. I'm seeing around 3-4 Gigs of CGImages in my process, even though I never need more than 2 of them. After the calculations finish and the cache is emptied, my app's memory goes down to only 350-500 MB. I hadn't thought to use an autorelease pool in the calculation loop for this, but will give it a try.
It appears that the OS is caching the images it creates. However, it doesn't clear out the cache until the calculations are finished, so it grows without bound until then. Is there any way to keep this from happening?
Don't use -bitmapData and memcpy() to copy the image. Draw the one image into the other.
I often recommend that developers read the section "NSBitmapImageRep: CoreGraphics impedance matching and performance notes" from the 10.6 AppKit release notes:
NSBitmapImageRep: CoreGraphics impedance matching and performance notes
Release notes above detail core changes at the NSImage level for
SnowLeopard. There are also substantial changes at the
NSBitmapImageRep level, also for performance and to improve impedance
matching with CoreGraphics.
NSImage is a fairly abstract representation of an image. It's pretty
much just a thing-that-can-draw, though it's less abstract than NSView
in that it should not behave differently based aspects of the context
it's drawn into except for quality decisions. That's kind of an opaque
statement, but it can be illustrated with an example: If you draw a
button into a 100x22 region vs a 22x22 region, you can expect the
button to stretch its middle but not its end caps. An image should not
behave that way (and if you try it, you'll probably break!). An image
should always linearly and uniformly scale to fill the rect in which
its drawn, though it may choose representations and such to optimize
quality for that region. Similarly, all the image representations in
an NSImage should represent the same drawing. Don't pack some totally
different image in as a rep.
That digression past us, an NSBitmapImageRep is a much more concrete
object. An NSImage does not have pixels, an NSBitmapImageRep does. An
NSBitmapImageRep is a chunk of data together with pixel format
information and colorspace information that allows us to interpret the
data as a rectangular array of color values.
That's the same, pretty much, as a CGImage. In SnowLeopard an
NSBitmapImageRep is natively backed by a CGImageRef, as opposed to
directly a chunk of data. The CGImageRef really has the chunk of data.
While in Leopard an NSBitmapImageRep instantiated from a CGImage would
unpack and possibly process the data (which happens when reading from
a bitmap file format), in SnowLeopard we try hard to just hang onto
the original CGImage.
This has some performance consequences. Most are good! You should see
less encoding and decoding of bitmap data as CGImages. If you
initialize a NSImage from a JPEG file, then draw it in a PDF, you
should get a PDF of the same file size as the original JPEG. In
Leopard you'd see a PDF the size of the decompressed image. To take
another example, CoreGraphics caches, including uploads to the
graphics card, are tied to CGImage instances, so the more the same
instance can be used the better.
However: To some extent, the operations that are fast with
NSBitmapImageRep have changed. CGImages are not mutable,
NSBitmapImageRep is. If you modify an NSBitmapImageRep, internally it
will likely have to copy the data out of a CGImage, incorporate your
changes, and repack it as a new CGImage. So, basically, drawing
NSBitmapImageRep is fast, looking at or modifying its pixel data is
not. This was true in Leopard, but it's more true now.
The above steps do happen lazily: If you do something that causes
NSBitmapImageRep to copy data out of its backing CGImageRef (like call
bitmapData), the bitmap will not repack the data as a CGImageRef until
it is drawn or until it needs a CGImage for some other reason. So,
certainly accessing the data is not the end of the world, and is the
right thing to do in some circumstances, but in general you should be
thinking about drawing instead. If you think you want to work with
pixels, take a look at CoreImage instead - that's the API in our
system that is truly intended for pixel processing.
This coincides with safety. A problem we've seen with our SnowLeopard
changes is that apps are rather fond of hardcoding bitmap formats. An
NSBitmapImageRep could be 8, 32, or 128 bits per pixel, it could be
floating point or not, it could be premultiplied or not, it might or
might not have an alpha channel, etc. These aspects are specified with
bitmap properties, like -bitmapFormat. Unfortunately, if someone wants
to extract the bitmapData from an NSBitmapImageRep instance, they
typically just call bitmapData, treat the data as (say) premultiplied
32 bit per pixel RGBA, and if it seems to work, call it a day.
Now that NSBitmapImageRep is not processing data as much as it used
to, random bitmap image reps you may get ahold of may have different
formats than they used to. Some of those hardcoded formats might be
wrong.
The solution is not to try to handle the complete range of formats
that NSBitmapImageRep's data might be in, that's way too hard.
Instead, draw the bitmap into something whose format you know, then
look at that.
That looks like this:
NSBItmapImageRep *bitmapIGotFromAPIThatDidNotSpecifyFormat;
NSBitmapImageRep *bitmapWhoseFormatIKnow = [[NSBitmapImageRep alloc] initWithBitmapDataPlanes:NULL pixelsWide:width pixelsHigh:height
bitsPerSample:bps samplesPerPixel:spp hasAlpha:alpha isPlanar:isPlanar
colorSpaceName:colorSpaceName bitmapFormat:bitmapFormat bytesPerRow:rowBytes
bitsPerPixel:pixelBits];
[NSGraphicsContext saveGraphicsState];
[NSGraphicsContext setContext:[NSGraphicsContext graphicsContextWithBitmapImageRep:bitmapWhoseFormatIKnow]];
[bitmapIGotFromAPIThatDidNotSpecifyFormat draw];
[NSGraphicsContext restoreGraphicsState];
unsigned char *bitmapDataIUnderstand = [bitmapWhoseFormatIKnow bitmapData];
This produces no more copies of the data than just accessing
bitmapData of bitmapIGotFromAPIThatDidNotSpecifyFormat, since that
data would need to be copied out of a backing CGImage anyway. Also
note that this doesn't depend on the source drawing being a bitmap.
This is a way to get pixels in a known format for any drawing, or just
to get a bitmap. This is a much better way to get a bitmap than
calling -TIFFRepresentation, for example. It's also better than
locking focus on an NSImage and using -[NSBitmapImageRep
initWithFocusedViewRect:].
So, to sum up: (1) Drawing is fast. Playing with pixels is not. (2) If
you think you need to play with pixels, (a) consider if there's a way
to do it with drawing or (b) look into CoreImage. (3) If you still
want to get at the pixels, draw into a bitmap whose format you know
and look at those pixels.
In fact, it's a good idea to start at the earlier section with a similar title — "NSImage, CGImage, and CoreGraphics impedance matching" — and read through to the later section.
By the way, there's a good chance that swapping the image reps would work, but you just weren't synchronizing them properly. You would have to show the code where both reps were used for us to know for sure.
I have been working on an animated graphics project with very specific requirements, and after quite a bit of searching and test coding, I have figured that I could take several approaches, but the Khronos and MDN documentation I have been reading coupled with other posts I have seen here don't answer all of my questions regarding my particular project. In the meantime, I have written short test programs (setting infrastructure for testing).
Firstly, I should describe the project:
The main object drawn to the screen is a simple quad surrounded by a black outline (LINE_LOOP or LINES will do, probably, though I have had issues with z-fighting...that will be left for another question). When the user interacts with the program, exactly one new quad is created and immediately drawn, but for a set amount of time its vertices move around until the quad moves to its final destination. (Note that translations won't do.) Random black lines are also drawn, and sometimes those lines also move around.
Once one of the quads reaches its final spot, it never moves again.
A new quad is always atop old quads (closer to the screen). That means that I need to layer the quads and lines from oldest to newest.
*this also means that it would probably be best to assign z-values to each quad and line, even if the graphics are in pixel coordinates and use an orthographic matrix. Would everyone agree with this?
Given these parameters, I have a few options with varying levels of complexity:
1> Take the object-oriented approach and just assign a buffer to each quad, and the same goes for the random lines. --creation and destruction of buffers every frame for the one shape that is moving. I truthfully think that this is a terrible idea that might only work in a higher level library that does heavy optimization underneath. This approach also doesn't take advantage of the fact that almost every quad will stay the same.
[vertices0] ... , [verticesN]
Draw x N (many draws for many small-size buffers)
2> Assign a z-value to each quad, outline, and line (as mentioned above). Allocate a huge vertex buffer and element buffer to store all permanently-in-their-final-positions quads. Resize only in the very unlikely case someone interacts for long enough. Create a second tiny buffer to store the one temporary moving quad and use bufferSubData every frame. When the quad reaches its destination, bufferSubData it into the large buffer and overwrite the small buffer upon creation of the next quad...all on the same frame. The main questions I have here are: is it possible (safe?) to use bufferSubData and draw it on the same frame? Also, would I use DYNAMIC_DRAW on both buffers even though the larger one would see fewer updates?
[permanent vertices ... | uninitialized (keep a count)]
bufferSubData -> [tempVerticesForOneQuad]
Draw 2x
3> Still create the large and small buffers, but instead of using bufferSubData every frame, create a second shader program and add an attribute for the new/moving quad that explicitly sets the vertex positions for the animation (I would pass vertex index attributes). Only draw with the small buffer when the quad is moving. For the frame when the quad reaches its destination, draw both large and small buffer, but then bufferSubData the final coordinates into the large permanent buffer to be used in the next frame.
switchToShaderProgramA();
[permanent vertices...| uninitialized (keep a count)]
switchToShaderProgramB();
[temp vertices] <- shader B accepts indices for each vertex so we can do all animation in the vertex shader
---last frame of movement arrives : bufferSubData into the permanent vertices buffer for when the the next quad is created
I get the sense that the third option might be the best, but I would like to learn whether there are some other factors that I did not consider. For example, my assumption that a program switch, additional attributes, and vertex shader manipulation would be faster than just substituting the buffer values as in 2>. The advantage of approach 3> (I think) is that I can defer the buffer substitution to a time when nothing needs to be drawn.
Still, I am still not sure of how to work with the randomly-appearing lines. I can't take the "single quad vertex buffer" approach since the number of lines cannot be predicted. Might I also allocate a large buffer for the moving lines? Those also stay after the quad is finished moving, though I don't think that I could use the vertex shader trick because there would be too many attributes to set (as opposed to the 4 for the one quad). I suppose that I could create a large "permanent line data" buffer first, but what to do during the animation is tricky because the lines move. Maybe bufferSubData() + draw on the same frame is not terrible? Or it could be. This is where I need advise.
I understand that this question might not be too specific code-wise, but I don't believe that I would be allowed to show the core of the program. All I have is the typical WebGL boilerplate ready.
I am looking forward to hearing people's thoughts on how I might proceed and whether there are any trade-offs I might have missed when considering the three options above.
Thank you in advance, and please feel free to ask any additional questions if clarification is necessary.
Honestly, for what you're describing, it doesn't sound to me like it matters which you choose. On modern hardware, drawing a few hundred quads and a few thousand lines each frame would not really tax the hardware much.
Having said that, I agree that approach 1 seems very inefficient. Approach 2 sounds perfectly fine. You can safely draw a buffer on the same frame that you uploaded the data. I don't think it matters much whether you use DYNAMIC_DRAW or STATIC_DRAW for the buffer. I tend to think of dynamic buffers as being something you're updating every frame. If you only update it every few seconds or less, then static is fine. Approach 3 is also fine. Between 2 and 3, I'd say do whichever is easier for you to understand and program.
Likewise, for the lines, I would use a separate buffer. It sounds like that one changes per frame, so I would use DYNAMIC_DRAW for that. Allocating a single large buffer for it and performing a glBufferSubData() per frame is probably a fine strategy. As always, trying it and profiling it will tell you for sure.
I'm currently implementing a basic deferred renderer with multithreading in Vulkan. Since my G-Buffer should have the same resolution as the final image I want to do it in a single render-pass with multiple sub-passes, according to this presentation, on slide 44 (page 138). It says:
vkCmdBeginCommandBuffer
vkCmdBeginRenderPass
vkCmdExecuteCommands
vkCmdNextSubpass
vkCmdExecuteCommands
vkCmdEndRenderPass
vkCmdEndCommandBuffer
I get that in the first sub-pass, you iterate the scene graph and record one secondary commandbuffer for each entity/mesh. What I don't get is how you are supposed to do the shading pass with secondary command buffers. Do you somehow spit the screen into parts and render each part in a separate thread or just record one secondary commandbuffer for the entire second sub-pass?
To me, like you said, you can need to multi thread your command buffer for the "building g-buffer subpass". However for the shading pass, it must depends on how are you doing things. To me (again), you do not need to multi thread your shading subpasses. However, you must take into consideration that you can have one "by region dependency".
So, I encourage you to procede that way.
Before to begin your RenderPass, use a Compute Shader to splat all your lights on the screen (here you have a kind of array of "quad").
By splatting I mean this kind of thing. You have a point light (for example), the idea is to compute the quad in screen space affected by the light. With that you have 4 vertices (that represents a quad) that you put into a SSBO and you can use it as a vertex Buffer in the shading subpass.
Now you begin the render pass.
MT the scene graph rendering if needed. and do your vkCmdExecuteCommands();
NextSubpass
Use the "array of quads" you create from the earlier compute shader (do not forget a VK_SUBPASS_EXTERNAL dependency).
NextSubpass and so on
However, you said
you iterate the scene graph and record one secondary commandbuffer for each entity/mesh.
I am not sure I really understand what you meant, but if you intend to have one secondary command buffer for one mesh, I really advice you to change the way you are doing. You must use batching. Let's say you have 64 000 different meshes to draw. You could for exemple create 64 command buffers (that you dispatch on 4 threads) and each command buffers have 1000 meshes to draw. (The number are took randomly, so profile your application).
So to answer your question for the shading subpass, I would not use command buffers or only very few (by kind of lights (punctual, directional))
What I don't get is how you are supposed to do the shading pass with secondary command buffers.
The shading pass (assumably the second subpass) would possibly take the G-buffers created by the first subpass as an Input Attachment. Then it would draw to equally sized screen-size quad using data from the G-buffers + from a set of lights (or whatever your deferred shader tries to defer).
The presentation you link tries to hint at this structure style starting at page 13 (marked "Page 107").
First step would be to make it working. Use e.g. this SW example. Then the next step of optimizing it into single renderpass should be easier.
I have an application which renders many filled polygons with OpenGL, in 2D. Filling is done by tesselation but performance is not optimal. 1900 polygons made up of 122000 vertex (that is, about 64 vertex per polygon) are displayed in about 3 seconds.
Apparently, the CPU is not the bottleneck, as if I replace calls to gluTessVertex by calls to glColor - just to test where is the bottleneck, performance is doubled.
I have the same problem with loading many small textures.
Now, which are the options to improve the performance? Seems that most time is spend in the geometry subsystem. Rendering is fast enough.
I already have a worker thread which does the load (so tesselation, texture binding) in one context, and another thread which does the draw in another context. The two contexts share objects via wglShareLists and it works like a charm.
Can I have a third thread in a third context which would handle also tesselation for half of the polygons? Anyone tried that? Is it safe? Any example of sharing objects between three contexts?
Forgot to say, I have an ATI Radeon HD 4550 graphics card, suppose it can handle more than 39kB/s of data.
Increase Performance
Sounds like you're using the old fixed-function pipeline.
If you're unsure of what that is, well, the following functions are a part of the fixed-function pipeline.
glBegin()
glEnd()
glVertex*()
glTexCoord*()
glNormal*()
glColor*()
etc.
Those functions are old and render geometry immediately. That means that each time you call the above functions, that geometry gets send to the GPU. By doing that a lot of times, you can easy make the FPS go way under 60 just by rendering simple things.
Now you need to use buffers and to be more precise VAOs with/or VBOs (and IBOs).
VBO or Vertex Buffer Object, is a buffer which can store vertices which you then can render. This is much much faster and better to use than glBegin() and glEnd(). When you create a VBO you supply it with vertices and they only require to be send to the GPU once, that's basically why they are fast, because they already are in the GPU and only require a single draw call instead of multiple.
The reason I said "with/or" is because in the newer versions you need to create a VAO which then would use a VBO, where before you could simply render the VBOs.
Tessellation
There are multiple ways to do tessellation and things which look like/would give the effect of tessellation.
For instance you could also simply render different models according to the required LOD (Level of Detail), thereby when you're up close to an object you then render the model with all it's details which probably would have a high vertices count. Then the further you're away from the model you simply render another version of that model but which have less vertices, which also equals less detail. Though if you can't really do that on something like terrain and definitely shouldn't do it on something like dynamic terrain and/or procedurally generated terrain.
You can also do actual geometry tessellation and you would do that through a Shader. Since tessellation is a really huge topic I will provide you with 2 urls which both explain and have code on them.
Both of these articles uses modern 4.0/4.0+ OpenGL.
http://prideout.net/blog/?p=48
http://antongerdelan.net/opengl/tessellation.html
Texturing
Generating and binding textures are still the same.
Instead of using gluBuild2DMipmaps() you can use glGenerateMipmap(GL_TEXTURE_2D); it was added in OpenGL version 3.0'ish if I remember correctly.
Again you can (and should) change all you glBegin() - glEnd() (and everything in between) calls out with VAOs and VBOs. You can store everything you want inside a buffer vertices, texture coordinates, normals, colors, etc. You can store the things in separate buffers or you can store them inside a single buffer, usually called an Interleaved Buffer or Interleaved VBO.
You wouldn't be needing glEnable(GL_TEXTURE_2D) and glDisable(GL_TEXTURE_2D) anymore, because you do that within a Shader, you bind textures and use them in a Shader, and since you create the Shader Program you can make it act however you want it to.
I am using Direct3D to display a number of I-sections used in steel construction. There could be hundreds of instances of these I-sections all over my scene.
I could do this two ways:
Using method A, I have fewer surfaces. However, with backface culling turned on, the surfaces will be visible from only one side. If backface culling is turned off, then the flanges (horizontal plates) and web (vertical plate) may be rendered in the wrong order.
Method B seems correct (and I could keep backface culling turned on), but in my model the thickness of plates in the I-section is of no importance and I would like to avoid having to create a separate triangle strip for each side of the plates.
Is there a better solution? Is there a way to switch off backface culling for only certain calls of DrawIndexedPrimitives? I would also like a platform-neutral answer to this, if there is one.
First off, backface culling doesn't have anything to do with the order in which objects are rendered. Other than that, I'd go for approach B for no particular reason other than that it'll probably look better. Also this object probably isn't more than a hand full of triangles; having hundreds in a scene shouldn't be an issue. If it is, try looking into hardware instancing.
In OpenGL you can switch of backface culling for each triangle you draw:
glEnable(GL_CULL_FACE);
glCullFace(GL_FRONT);
// or
glCullFace(GL_BACK);
I think something similar is also possible in Direct3D
If your I-sections don't change that often, load all the sections into one big vertex/index buffer and draw them with a single call. That's the most performant way to draw things, and the graphic card will do a fast job even if you push half a million triangle to it.
Yes, this requires that you duplicate the vertex data for all sections, but that's how D3D9 is intended to be used.
I would go with A as the distance you would be seeing the B from would be a waste of processing power to draw all those degenerate triangles.
Also I would simply fire them at a z-buffer and allow that to sort it all out.
If it get's too slow then I would start looking at optimizing, but even consumer graphics cards can draw millions of polygons per second.