Fastest way to deal with many sprites in bevy-engine

Fastest way to deal with many sprites in bevy-engine - rust

I am building a Cellular Automata visualization "game" with Rust and the BevyEngine. Currently, when initializing the world, I spawn a sprite for every cell. Within each update the sprites color is changed according to wether it is alive or dead.
for (grid_pos, mut color_mat) in query.iter_mut() {
let cell_state = world.0[grid_pos.1][grid_pos.0];
match cell_state {
0 => *color_mat = materials.dead_cell.clone(),
1 => *color_mat = materials.alive_cell.clone(),
_ => (),
}
}
The thing is, when dealing with a larger and larger map, the number of sprites gets very high. So I wondered if it might be faster when I only spawn a sprite in the position of a living cell and remove it when the cell dies.
So my question is: Is it faster if I spawn all a sprite for every grid position OR is the better way to only spawn a sprite when a cell is alive and remove it when the cell dies.

I'm not familiar with Bevy's performance characteristics, but it is generally the case that to get good performance for drawing a cellular automaton, you don't want to be storing "sprite" objects for every cell, but an array of just the cell states, and drawing them in some kind of efficient batch form.
The state of of a cellular automaton is basically an image (possibly one with rather chunky pixels). So, treat it that way: copy the cell states into a texture and then draw the texture. If you want the cells to appear fancier than little squares, program the fragment shader to do that. (For example, you can read the CA-state texture and use it to decide which of several other textures to draw within the bounds of one square.)
This may or may not be necessary to get acceptable performance for your use case, but it will certainly be faster than managing individual sprites.

Related

Ways of drawing vertex ranges in Direct3D

Suppose I have a one big shader program and want only specific ranges of vertices/triangles rendered.
What are some performant ways of doing this? Which one looks most promising?
I came up with 3 methods, are there some more?
Batching draw calls Draw(position, count) using command lists. (Opengl has glMultiDrawArrays.)
If the ranges are known ahead of time, we don't need to worry about the time spent constructing them. However, since some ranges change unpredictably, it is probably unrealistic to keep command lists for all possibilities.
This method obviously reduces the time spent constructing the draw calls on CPU side and I assume that these calls already tell the GPU that it does not need to do any state changes.
Call Draw on the whole buffer and keep updating a boolean per-vertex float buffer which would just multiply the output positions by 0/1 in inactive/active ranges.
The benefit here is only one draw call. However, we need to update the buffer and it seems like the buffer needs to be locked while we update it...
Powerset defined in a constant buffer. For n ranges, we can use one n-bit constant mask compared against a static n-bit per_vertex_mask. Vertex is visible if mask & per_vertex_mask != 0.
Constant buffers are probably cheaper to update than a whole vertex buffer. The number of ranges may however be too large for this method.

Efficient 2D rendering with Glium

I'm using Glium to do rendering for an emulator I'm writing. I've pieced together something that works (based on this example) but I suspect that it's pretty inefficient. Here's the relevant function:
fn update_screen(display: &Display, screen: &Rc<RefCell<NesScreen>>) {
let target = display.draw();
// Write screen buffer
let borrowed_scr = screen.borrow();
let mut buf = vec![0_u8; 256 * 240 * 3];
buf.clone_from_slice(&borrowed_scr.screen_buffer[..]);
let screen = RawImage2d::from_raw_rgb_reversed(buf, SCREEN_DIMENSIONS);
glium::Texture2d::new(display, screen)
.unwrap()
.as_surface()
.fill(&target, MagnifySamplerFilter::Nearest);
target.finish().unwrap();
}
At a high level, this is what I'm doing:
Borrow NesScreen which contains the screen buffer, which is an array.
Clone the screen buffer into a vector
Create a texture from the vector data and render it
My suspicion is that cloning the entire screen buffer via clone_from_slice is really inefficient. The RawImage2d::from_raw_rgb_reversed function takes ownership of the vector passed into it, so I'm not sure how to do this in a way that avoids the clone.
So, two questions:
Is this actually inefficient? I don't have enough experience rendering stuff to know intuitively.
If so, is there a more efficient way to do this? I've scoured Glium quite a bit but there isn't much specific to 2D rendering.

This won't be a very good answer, but maybe a few things here could help you.
First of all: is this really inefficient? That's really hard to say, especially the OpenGL part, as OpenGL performance depends a lot on when synchronization is required/requested.
As for the cloning of the screen buffer: you are merely copying 180kb, which is not too much. I quickly benchmarked it on my machine and cloning a 180kb vector takes around 5µs, which is really not a lot.
Note that you can create a RawImage2d without using a method, because all fields are public. This means that you can avoid the simple 5µs clone if you create a reversed vector yourself. However, reversing the vector with the method glium uses is a lot slower than just cloning the vector; on my machine it takes 170µs for a vector of the same length. This is probably still tolerable if you just want to achieve 60fps = 17ms per frame, but still not very nice.
You could think about using the correct row ordering in your original array to avoid this problem. OR you could, instead of directly copying the texture to the framebuffer, just draw a fullscreen quad (one vertex for each screen corner) with the texture on it. Sure, then you need a mesh, a shader and all that stuff, but you could just "reverse" the image by tweaking the texture coordinates.
Lastly, I unfortunately don't know a lot about the time the GPU takes to execute the OpenGL commands. I'd guess that it's not optimal because OpenGL doesn't have a lot of room to schedule your commands, but has to execute them right away (forced synchronization). But maybe that's not avoidable in your case.

UAV counter indices used across multiple shaders?

I've been trying to implement a Compute Shader based particle system.
I have a compute shader which builds a structured buffer of particles, using a UAV with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.
When I add to this buffer, I check if this particle has any complex behaviours, which I want to filter out and perform in a separate compute shader. As an example, if the particle wants to perform collision detection, I add its index to another structured buffer, also with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.
I then run a second compute shader, which processes all the indices, and applies collision detection to those particles.
However, in the second compute shader, I'd estimate that about 5% of the indices are wrong - they belong to other particles, which don't support collision detection.
Here's the compute shader code that perfroms the list building:
// append to destination buffer
uint dstIndex = g_dstParticles.IncrementCounter();
g_dstParticles[ dstIndex ] = particle;
// add to behaviour lists
if ( params.flags & EMITTER_FLAG_COLLISION )
{
uint behaviourIndex = g_behaviourCollisionIndices.IncrementCounter();
g_behaviourCollisionIndices[ behaviourIndex ] = dstIndex;
}
If I split out the "add to behaviour lists" bit into a separate compute shader, and run it after the particle lists are built, everything works perfectly. However I think I shouldn't need to do this - it's a waste of bandwidth going through all the particles again.
I suspect that IncrementCounter is actually not guaranteed to return a unique index into the UAV, and that there is some clever optimisation going on that means the index is only valid inside the compute shader it is used in. And thus my attempt to pass it to the second compute shader is not valid.
Can anyone give any concrete answers to what's going on here? And if there's a way for me to keep the filtering inside the same compute shader as my core update?
Thanks!

IncrementCounter is an atomic operation and so will (driver/hardware bugs notwithstanding) return a unique value to each thread that calls it.
Have you thought about using Append/Consume buffers for this, as it's what they were designed for? The first pass simply appends the complex collision particles to an AppendStructuredBuffer and the second pass consumes from the same buffer but using a ConsumeStructuredBuffer view instead. The second run of compute will need to use DispatchIndirect so you only run as many thread groups as necessary for the number in the list (something the CPU won't know).
The usual recommendations apply though, have you tried the D3D11 Debug Layer and running it on the reference device to be sure it isn't a driver issue?

Re-positioning a Rigid Body in Bullet Physics

I am writing a character animation rendering engine that uses Bullet Physics as a physics simulation engine.
A sequence will start out with no model on the screen, then an animation will be assigned to that model, the model will be moved to frame 0 of the animation, and the engine will begin rendering the model with the animation.
What is the correct way to re-position the rigid bodies on the character model when it is initialized at frame 0?
Currently I am using this code, which is called immediately after the animation is assigned to the model and the bones are moved to the frame 0 position:
_world->removeRigidBody(_body);
bool k = (_type == Kinematics);
_body->setCollisionFlags(_body->getCollisionFlags() & ~btCollisionObject::CF_NO_CONTACT_RESPONSE);
btTransform tr = BulletPhysics::ConvertD3DXMatrix(&(_bone->getCombinedTrans()));
tr *= _trans;
_body->setCenterOfMassTransform(tr);
_body->clearForces();
_body->setLinearVelocity(btVector3(0,0,0));
_body->setAngularVelocity(btVector3(0,0,0));
_world->addRigidBody(_body, _groupID, _groupMask);
The issue is that sometimes this works, and other times not. For an example, take a skirt of a model. Sometimes it will show up in the natural position, other times slightly misaligned and it will fall into place, and other times it shows up completely clipped through the body, as if collision was turned off and some force pushed it in that direction. This does make sense most of the time, because in the test animation I am using the model's initial position is in the center of the screen, but the animation starts off the left side of the screen. Does anyone know how to solve this?
I know the bones on the skirt are not the problem, because I turned off physics and forced it to manually update the bone positions each frame, and everything was in the correct positions throughout the entire animation.
EDIT: I also have constraints, might that be what's causing this?

Here is my reposition method that does exactly this.
void LimbBt::reposition(btVector3 position,btVector3 orientation) {
btTransform initialTransform;
initialTransform.setOrigin(position);
initialTransform.setRotation(orientation);
mBody->setWorldTransform(initialTransform);
mMotionState->setWorldTransform(initialTransform);
}
The motion state mMotionState is the motion state you created for the btRigidBody in the beginning. Just add your clearForces() and velocities to it to stop the body from moving on from the new position as if it went through a portal. That should do it. It works nicely with me here.
Edit: The constraints will adapt if you reposition all rigidbodies correctly. For that purpose, it is easy to calculate the relative position and reposition the whole constrained rigidbody construct according to that. If you do it incorrectly, you will get severe twitching, as the constraints will try to adjust you construct numerically, causing high forces if the constraint gaps are large.
Edit2: Another issue is that if you need deterministic behavior (every time you reset your bodies, they should fall exactly the same), then you will have to kill your old dynamicsWorld, recreate it and add all the bodies again. The world stores some information about the bodies that just can not be cleared for now. This might change in the future as bullet4 is going to support deterministic resets. But for now, if you do experiments with deterministic resets, you need to drop the world and recreate it.
source: discussion with Erwin Coumans, the developer of Bullet Physics.

I can't tell you what causes the unusual outcome when moving rigid bodies but I can definitely sympathize!
There are three things you'll need to do in order to solve this:
Convert your rigid bodies to kinematic ones
Adjust the World Transform of the bodies motion state and NOT the rigid body
Convert the kinematic body back to a rigid body

A short tested code snippet effectively teleporting a rigid body by updating its motion state to its new position and orientation, plus nullifying all velocities and forces acting upon it.
void teleport(btVector3 position, btQuaternion& orientation) const {
btTransform transform;
transform.setIdentity();
transform.setOrigin(position);
transform.setRotation(orientation);
m_rigidBodyVehicle->setWorldTransform(transform);
m_rigidBodyVehicle->getMotionState()->setWorldTransform(transform);
m_rigidBodyVehicle->setLinearVelocity(btVector3(0.0f, 0.0f, 0.0f));
m_rigidBodyVehicle->setAngularVelocity(btVector3(0.0f, 0.0f, 0.0f));
m_rigidBodyVehicle->clearForces();
}

How would I have to imagine pixel-based rendering in Haskell?

Imagine an imperative rendering engine that blits sprites to a bitmap that later gets displayed. This heavily relies on the ability to efficiently mutate individual pixels in said bitmap. How would I do such a thing an a language without side effects? I guess a completely different data structure is called for?

You can convert any algorithm that uses mutable state into an algorithm that "strings" the state along with it. Haskell provides a way of doing this such that it still feels like imperative programming with the state Monad.
Although, it seems to me that the basic blit operation could be done in a more functional style. You are basically combining two bitmaps to produce a new bitmap via pixel by pixel operation. That sounds very functional to me.
High quality imperative code is often faster than good functional code, but if you are willing to give up a little speed you can normally create very nice architectures in a pure functional style

Haskell has side effects, and you should use them whenever they're appropriate. A high-speed blit routine that's going to be in your inner loop (and therefore is performance-critical) is certainly one place that mutation is appropriate, so use it! You have a couple of options:
Roll your own in Haskell, using ST(U)Array or IO(U)Array. Not recommended.
Roll your own in C, and call it with the FFI. Not recommended.
Use one of the many graphics toolkits that offers this kind of operation already, and has hundreds of programmer hours spent on making a good interface with high performance, such as Gtk or OpenGL. Highly recommended.
Enjoy!

A natural functional way of representing an image is by using the index function:
Image :: (Int,Int) -> Color
With this representation, blitting an area from one image to another would be achieved with
blit area a b = \(x,y) -> if (x,y) `isInsideOf` area then a (x,y) else b (x,y)
If translation or another transformation is required, it can be directly applied to the coordinates:
translate (dx,dy) image = \(x,y) -> b (x+dx,y+dy)
This representation gives you natural way of working with image points. You can, for example, easily work with non-rectangular areas, and do tricks like making image interpolation as separate function instead of being part of your usual image scaling algorithms:
quadraticInterpolation :: ((Int,Int) -> Color) -> ((Double,Double) -> Color)
The performance might suffer in some cases, such as when you blit multiple images into one and then do calculations with the result. This results in a chain of tests for each pixel for each successive calculation. However, by applying memoization, we can temporarily render the functional representation into an array and transform that back to it's index function, thus eliminating the performance hit for the successive operations.
Note that the memoization can also be used to introduce parallelism to the process.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string