Efficient 2D rendering with Glium - rust

I'm using Glium to do rendering for an emulator I'm writing. I've pieced together something that works (based on this example) but I suspect that it's pretty inefficient. Here's the relevant function:
fn update_screen(display: &Display, screen: &Rc<RefCell<NesScreen>>) {
let target = display.draw();
// Write screen buffer
let borrowed_scr = screen.borrow();
let mut buf = vec![0_u8; 256 * 240 * 3];
buf.clone_from_slice(&borrowed_scr.screen_buffer[..]);
let screen = RawImage2d::from_raw_rgb_reversed(buf, SCREEN_DIMENSIONS);
glium::Texture2d::new(display, screen)
.unwrap()
.as_surface()
.fill(&target, MagnifySamplerFilter::Nearest);
target.finish().unwrap();
}
At a high level, this is what I'm doing:
Borrow NesScreen which contains the screen buffer, which is an array.
Clone the screen buffer into a vector
Create a texture from the vector data and render it
My suspicion is that cloning the entire screen buffer via clone_from_slice is really inefficient. The RawImage2d::from_raw_rgb_reversed function takes ownership of the vector passed into it, so I'm not sure how to do this in a way that avoids the clone.
So, two questions:
Is this actually inefficient? I don't have enough experience rendering stuff to know intuitively.
If so, is there a more efficient way to do this? I've scoured Glium quite a bit but there isn't much specific to 2D rendering.

This won't be a very good answer, but maybe a few things here could help you.
First of all: is this really inefficient? That's really hard to say, especially the OpenGL part, as OpenGL performance depends a lot on when synchronization is required/requested.
As for the cloning of the screen buffer: you are merely copying 180kb, which is not too much. I quickly benchmarked it on my machine and cloning a 180kb vector takes around 5µs, which is really not a lot.
Note that you can create a RawImage2d without using a method, because all fields are public. This means that you can avoid the simple 5µs clone if you create a reversed vector yourself. However, reversing the vector with the method glium uses is a lot slower than just cloning the vector; on my machine it takes 170µs for a vector of the same length. This is probably still tolerable if you just want to achieve 60fps = 17ms per frame, but still not very nice.
You could think about using the correct row ordering in your original array to avoid this problem. OR you could, instead of directly copying the texture to the framebuffer, just draw a fullscreen quad (one vertex for each screen corner) with the texture on it. Sure, then you need a mesh, a shader and all that stuff, but you could just "reverse" the image by tweaking the texture coordinates.
Lastly, I unfortunately don't know a lot about the time the GPU takes to execute the OpenGL commands. I'd guess that it's not optimal because OpenGL doesn't have a lot of room to schedule your commands, but has to execute them right away (forced synchronization). But maybe that's not avoidable in your case.

Related

Fastest way to deal with many sprites in bevy-engine

I am building a Cellular Automata visualization "game" with Rust and the BevyEngine. Currently, when initializing the world, I spawn a sprite for every cell. Within each update the sprites color is changed according to wether it is alive or dead.
for (grid_pos, mut color_mat) in query.iter_mut() {
let cell_state = world.0[grid_pos.1][grid_pos.0];
match cell_state {
0 => *color_mat = materials.dead_cell.clone(),
1 => *color_mat = materials.alive_cell.clone(),
_ => (),
}
}
The thing is, when dealing with a larger and larger map, the number of sprites gets very high. So I wondered if it might be faster when I only spawn a sprite in the position of a living cell and remove it when the cell dies.
So my question is: Is it faster if I spawn all a sprite for every grid position OR is the better way to only spawn a sprite when a cell is alive and remove it when the cell dies.
I'm not familiar with Bevy's performance characteristics, but it is generally the case that to get good performance for drawing a cellular automaton, you don't want to be storing "sprite" objects for every cell, but an array of just the cell states, and drawing them in some kind of efficient batch form.
The state of of a cellular automaton is basically an image (possibly one with rather chunky pixels). So, treat it that way: copy the cell states into a texture and then draw the texture. If you want the cells to appear fancier than little squares, program the fragment shader to do that. (For example, you can read the CA-state texture and use it to decide which of several other textures to draw within the bounds of one square.)
This may or may not be necessary to get acceptable performance for your use case, but it will certainly be faster than managing individual sprites.

How do I rotate an object so that it's always facing the mouse position?

I'm using ggez to make a game with some friends, and I'm trying to have our character rotate to face the pointer at all times. I know so far that I need to get an angle value (f32) in radians, and I think I can use atan2 to get this (?) However, I just don't get the behavior that I want.
This is the code I have: (btw, move_data is a struct holding our player character's values, such as position, velocity, angle and rotation speed).
let m = mouse::position(ctx);
move_data.angle = ((m.y - move_data.position.y).atan2(move_data.position.x - m.x)) * (consts::PI / 2.0) as f32;
I think that I'm close, as I'm already able with this to rotate the character, but only in a sort of 'incomplete' way. The player character (pc) can mostly only face to the upper left corner, when I move the mouse there. Otherwise, if the pointer is to the right and/or below the pc, it rotates in a very slow and minor way, and stops facing the pointer. I don't know if this description makes sense.
I think the problem is that I'm not entirely sure what atan2 is doing in the first place (I only remember some basic trigonometry), and I am also not sure if I'm using it correctly, so I don't exactly know what my code is doing. (Here is the documentation I used for atan2: https://doc.rust-lang.org/std/primitive.f64.html#method.atan2)
I've gotten only so far after much trial and error, Googling as much as I can (mostly Unity tutorial results showed up when looking for algorithms to base my code off of) and I've also asked in the unofficial Rust community Discord server, but nothing so far has worked.
I also had this code earlier, but couldn't find how to make it work either.
let m = mouse::position(ctx); // Type Point2
let mouse_pos = Vector2::new(m.x, m.y); // Transformed to Vector2 to be read by Matrix
move_data.angle = Matrix::angle(&mouse_pos, &move_data.position);

Haskell IdleCallback too slow

I just started designing some graphics in haskell. I want to create an animated picture with a rotating sphere, so I created an IdleCallback function to constantly update the angle value:
idle :: IORef GLfloat -> IdleCallback
idle angle = do
a <- get angle
angle $= a+1
postRedisplay Nothing
I'm adding 1 each time to the angle because I want to make my sphere smoothly rotate, rather than just jump from here to there. The problem is that now it rotates TOO slow. Is there a way to keep the rotation smooth and make it faster??
Thanks a lot!
There's not a lot to go on here. I don't see an explicit delay anywhere, so I'm guessing it's slow just because of how long it takes to update?
It also doesn't look explicitly recursive, so it seems like the problem is outside the scope of this snippet.
Also I don't know which libraries you may be using.
In general, though, that IORef makes me feel unhappy.
While it may be common in other languages to have global variables, IORefs in Haskell have their place, but are often a bad sign.
Even in another language, I don't think I'd do this with a global variable.
If you want to do time-updating things in Haskell, one "common" approach is to use a Functional Reactive Programming library.
They are built to have chains of functions that trigger off of a signal coming from outside, modifying the state of something, which eventually renders an output.
I've used them in the past for (simple) games, and in your case you could construct a system that is fed a clock signal 24 times per second, or whatever, and uses that to update the counter and yield a new image to blit.
My answer is kind of vague, but the question is a little vague too, so hopefully I've at least given you something to look into.

UAV counter indices used across multiple shaders?

I've been trying to implement a Compute Shader based particle system.
I have a compute shader which builds a structured buffer of particles, using a UAV with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.
When I add to this buffer, I check if this particle has any complex behaviours, which I want to filter out and perform in a separate compute shader. As an example, if the particle wants to perform collision detection, I add its index to another structured buffer, also with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.
I then run a second compute shader, which processes all the indices, and applies collision detection to those particles.
However, in the second compute shader, I'd estimate that about 5% of the indices are wrong - they belong to other particles, which don't support collision detection.
Here's the compute shader code that perfroms the list building:
// append to destination buffer
uint dstIndex = g_dstParticles.IncrementCounter();
g_dstParticles[ dstIndex ] = particle;
// add to behaviour lists
if ( params.flags & EMITTER_FLAG_COLLISION )
{
uint behaviourIndex = g_behaviourCollisionIndices.IncrementCounter();
g_behaviourCollisionIndices[ behaviourIndex ] = dstIndex;
}
If I split out the "add to behaviour lists" bit into a separate compute shader, and run it after the particle lists are built, everything works perfectly. However I think I shouldn't need to do this - it's a waste of bandwidth going through all the particles again.
I suspect that IncrementCounter is actually not guaranteed to return a unique index into the UAV, and that there is some clever optimisation going on that means the index is only valid inside the compute shader it is used in. And thus my attempt to pass it to the second compute shader is not valid.
Can anyone give any concrete answers to what's going on here? And if there's a way for me to keep the filtering inside the same compute shader as my core update?
Thanks!
IncrementCounter is an atomic operation and so will (driver/hardware bugs notwithstanding) return a unique value to each thread that calls it.
Have you thought about using Append/Consume buffers for this, as it's what they were designed for? The first pass simply appends the complex collision particles to an AppendStructuredBuffer and the second pass consumes from the same buffer but using a ConsumeStructuredBuffer view instead. The second run of compute will need to use DispatchIndirect so you only run as many thread groups as necessary for the number in the list (something the CPU won't know).
The usual recommendations apply though, have you tried the D3D11 Debug Layer and running it on the reference device to be sure it isn't a driver issue?

How would I have to imagine pixel-based rendering in Haskell?

Imagine an imperative rendering engine that blits sprites to a bitmap that later gets displayed. This heavily relies on the ability to efficiently mutate individual pixels in said bitmap. How would I do such a thing an a language without side effects? I guess a completely different data structure is called for?
You can convert any algorithm that uses mutable state into an algorithm that "strings" the state along with it. Haskell provides a way of doing this such that it still feels like imperative programming with the state Monad.
Although, it seems to me that the basic blit operation could be done in a more functional style. You are basically combining two bitmaps to produce a new bitmap via pixel by pixel operation. That sounds very functional to me.
High quality imperative code is often faster than good functional code, but if you are willing to give up a little speed you can normally create very nice architectures in a pure functional style
Haskell has side effects, and you should use them whenever they're appropriate. A high-speed blit routine that's going to be in your inner loop (and therefore is performance-critical) is certainly one place that mutation is appropriate, so use it! You have a couple of options:
Roll your own in Haskell, using ST(U)Array or IO(U)Array. Not recommended.
Roll your own in C, and call it with the FFI. Not recommended.
Use one of the many graphics toolkits that offers this kind of operation already, and has hundreds of programmer hours spent on making a good interface with high performance, such as Gtk or OpenGL. Highly recommended.
Enjoy!
A natural functional way of representing an image is by using the index function:
Image :: (Int,Int) -> Color
With this representation, blitting an area from one image to another would be achieved with
blit area a b = \(x,y) -> if (x,y) `isInsideOf` area then a (x,y) else b (x,y)
If translation or another transformation is required, it can be directly applied to the coordinates:
translate (dx,dy) image = \(x,y) -> b (x+dx,y+dy)
This representation gives you natural way of working with image points. You can, for example, easily work with non-rectangular areas, and do tricks like making image interpolation as separate function instead of being part of your usual image scaling algorithms:
quadraticInterpolation :: ((Int,Int) -> Color) -> ((Double,Double) -> Color)
The performance might suffer in some cases, such as when you blit multiple images into one and then do calculations with the result. This results in a chain of tests for each pixel for each successive calculation. However, by applying memoization, we can temporarily render the functional representation into an array and transform that back to it's index function, thus eliminating the performance hit for the successive operations.
Note that the memoization can also be used to introduce parallelism to the process.

Resources