Threading Box2D with pthreads - multithreading

So I'm essentially trying to implement an AIR Native Extension that does the physics simulation in C with interfaces through Actionscript.
I've gone through quite a few iterations which I'll list below for interest sake and I'm at what I think could be my final attempt at getting this working in a more performant way.
Ultimately I'm looking for help in how I should be setting up a threading environment for running the simulation of Box2D on a separate thread and then polling for state in AS3.
Methods:
Brute Force:
In this method I simply call into C from AS3 and tell it to create a world and pass it some boxes to add to this world. Every frame in AS3, I call into C to tell the world to Step, then loop through all the bodies in the World, get their position and rotation, convert them to actionscript objects and put them in an actionscript array and then send that back to AS3. Once there I loop through the returning array and assign those position and rotation values to my sprites so they visually update.
The results are actually quite decent with about 116 boxes being added before the framerate suffers. This is compared to 30 boxes in a pure AS3 implementation. Note that these stats are in Debug mode. In release mode, they both make it to about 120 boxes. There is little difference between the AS3 implementation and the Native Extension implementation.
ByteArray Sharing
In order to improve performance I decided it would be a good idea to try and limit the amount of data being marshalled across C and AS3. ANE's support sharing a byte array's memory space and so I would send the ByteArray created in AS3 to C and have C simply update the ByteArray. This saves us from having to construct AS3 objects in C and pass them back. Every frame, AS3 simply needs to iterate through it's ByteArray and see what C has written into it and then assign those values to the sprites to set the visual state.
The results here are sadly about the same. Improvements are only marginal.
Direct Object Setting From C
Another thing ANE's are capable of is setting the property of an object that lives in AS3. In this sense I aimed to eliminate the overhead of passing back data to AS3, the looping through the bodies to collect data in C and the looping through in AS3 to assign the values. I directly modified the Box2D code so that when it's values were changed it would write the new x, y, rotation values directly on the corresponding Sprite.
The results are amazing at very low amounts of objects since the call to set these properties is well under a millisecond. The problem is that this scales linearly and around 90 or so objects, the overhead is too severe and things start to slow down.
Threading
At this point I was a bit stumped. There's overhead in marshalling data, there's a cost in C for iterating and constructing the data to return and there's a cost in AS3 for iterating to assign values to the sprites.
Obviously there needs to be a trade-off so my current solution is the best I can come up with for now.
On the AS3 side you call into C to create your world, call in to add a box to that world, and call in to tell C you want a refresh of your data. When boxes are created in AS3 they get a unique id and they are stored in a dictionary with the key being the id.
On the C side, the world is created and a new pthread is spawned to do the Step. Essentially simulating the world on another thread. After it steps, it assembles all the data and writes it into a double array. Then it does so again and again and again. It just simulates forever basically on it's own thread.
When we call in to C to add a new box, I need to create a new box and add it to that world. Since the world is Stepping this could cause problems which means I need to use mutexes I'm pretty sure.
Same thing when we call to get the values refreshed in AIR, I'll want to do a memcpy from the array of doubles into my AS3 bytearray and then loop through the bytearray to set the values on the visual.
The mutexes were giving me trouble so I basically implemented my own which you can see below... and laugh at :)
However it does work, just not as fast as I would like it too. Around 90 we slow down again.
Anyone have any thoughts or pointers? It'd be greatly appreciated!
C Code
The parser was acting up so i've pasted it here:
http://pastebin.com/eBQGuGJX
AS3 Code
Same thing with the parser. I've only included the relevant method dealing with every frame in AS3.
http://pastebin.com/R1Qs2Tyt

I had forgotten I had this question. Fortunately I have figured it out.
The idea of using mutexes etc was over engineered in the first place and unnecessary.
Since we're running in Flash, everything runs in the main thread. Which means for each "frame" flash will natively handle any media, then our client code which we have written, then actually render to the screen and finally do any garbage collection if necessary.
I don't actually need to have the physics sim simulating forever, I simply need to have it be one step ahead of my client code.
So what happens now is when the Client calls into the ANE to setup the world, it creates a new thread that simulates the world and returns immediately back to Flash. Flash will continue to do its work of executing the rest of the client code and then rendering and then GC.
Then on each frame in Flash we can simply call into the ANE to retrieve the results. In the case that the Simulation thread wasn't finished we wait via a join, extract the values and return them to Flash. Making sure to spawn another thread for the next step before returning of course.
In this way we are maximizing our efficiency since the simulation is happening while Flash is busy doing other things we don't have control over (like rendering and GC).
The good news is that performance almost doubles with this approach. Going from approx 90 boxes in a synchronous pure AS3 implementation to approx 170 boxes in a threaded ANE approach.
The bottleneck eventually becomes the iteration through the data coming back from the ANE and assigning those values to the Display Objects.
I hope this helps someone else who was looking for something similar. I'll be giving a talk about it at FITC Toronto at the end of April so there may be more information and material I can post then.
http://www.fitc.ca/events/presentations/presentation.cfm?event=124&presentation_id=1973

Related

Godot - How Are Scenes Handled Outside of the Viewport?

In the Godot Engine, I am wondering what happens when objects/scenes leave the viewport? For example: I am trying to make a large map with lots of scenes/entities (such as multiple moving enemies, as well as resource nodes). I am trying to figure out the best way to handle the entities that no longer need to be loaded in memory.
My initial thought was that every tile that is moved to, check the "map" array that holds all the tiles and load the new ones off the screen a little, and vice versa for the ones that will disappear. I assume this is horrible practice. I also thought of having "regions" that once entered, could load upcoming sections - but that also gets super complicated.
I noticed that Godot is already handling part of this problem. As an example, when an object emitting particles leaves the viewport, it stops emitting particles.
Globally performancewise, having multiple instances shouldn't be a problem, but if you have a lot of entities, you may want to execute code only when they are in viewport.
For instance :
if(isInViewport):
Do everything
Else:
Do nothing but exist
To that purpose the VisibilityNotifier2D class may be usefull.

F# update list in multiple threads at same time

I am new to F# so maybe the solution can be clear to someone, but I can not find it.
Imagine a game world of world chunks (similar to Minecraft), but for more players.
In theory language like C++, java, or C# can modify multiple chunks of world at same time. Two or more players try to place or remove block in different chunks and all these actions can change the state of the world without affecting each other as long as no more than one action in each chunk is happening. Serializing will only happen when multiple players in one chunk perform the modification.
My understanding of F# is I need to serialize these actions on global level and no two actions can happen in same time in entire world, because the update function need actual world state update params(like add/remove blok) and return new world state.
For that example the world state contains chunk list.
Is there a way to do world update in parallel?
Can the world state be stored differently to allow update to multiple chunks at same time?
It sounds like you need to ensure that each chunk has one action run at a time. You can protect pieces of state by storing them inside mailbox processors (often referred to as just "agents"). You can send agents several messages from several threads. They will be queued and processed one at a time.
There is a detailed discussion of this here: https://fsharpforfunandprofit.com/posts/concurrency-actor-model/
Firstly, I don't this really adds any technical detail to a previous answer, so you if you like their solution you should go ahead and mark that as the answer. However, I hope this gives some extra context...
Underlying your problem is the question of how consistent do you require the state of your world to be in order to make decisions about modifying chunks.
Consider a world where I have two chunks, let's call them A and B. Consider the use case where I want to add or remove a block from chunk A. The all important question is:
Do I need to know about the blocks in chunk B in order to validate, and then perform the addition / removal of a block from chunk A.
For example, if I only have finite number of blocks in my world, I may well need this information to validate that I can actually add a block without going over my limit. The key here is that my "consistency boundary" is my entire world - in order to perform the addition of a new block to chunk A I need consistent information about everythign in my world. It is no good if halfway through my decision making another thread jumps in and adds a block to chunk B. If this is a requirement then you have no option - even in the C#/C++ case - you need to lock down access to your world so only one such action can be performed at anyone time.
From the way you phrase the question, I suspect this is not the case. In which case we need to examine exactly what your consistency requirements are. A weaker requirement is that if I'm adding blocks to chunk A, I at least have to have consistent information about the number (and position) of blocks in chunk A. In the C#/C++ case this would mean having to put locks around accessing to individual "chunk data", but not the whole world.
A simple way of modelling this in F# would be (using the suggestion in this answer):
open FSharp.Core
type ChunkMessage =
AddBlock
| RemoveBlock
type MyWorld =
{
Blocks : List<MailboxProcessor<ChunkMessage>>
}
Note that MyWorld is a mutable, but each MailboxProcessor encapsulates state which can only change through processing one message at a time.
The implementation of Blocks doesn't have to be a list of MailboxProcessor's, you could use a thread-safe collection of objects for which you had thread-safe methods on, but the use of them here as suggested by The Quick Brown Fox leads to a particularly nice programming model.

What is the right way of building a long scroll list of button widgets in Motif?

I'm dealing with an old Motif application that needs to load and display a long list of entries (around 1500). It creates and manages an instance of xmFormWidgetClass via XtVaCreateManagedWidget() and then it stuffs it with a bunch of linear hierarchies xmFrameWidgetClass->xmFormWidgetClass->xmFormWidgetClass->xmPushButtonWidgetClass. Each PushButton contains a multi-line label. When this this thing is being populated, it takes a lot of CPU, which it spends doing some geometry calculations inside of X/Motif libraries. The pace at which new buttons are added, degrades very quickly. It looks like there is an O(N) algorithm being used inside of XtVaCreateManagedWidget().
The things get much much better if I do XtUnrealizeWidget() on the original instance of the xmFormWidgetClass. Entries are being added at almost constant speed but then I cannot find a way to display the whole thing that I built. XtRealizeWidget() for the original instance of the xmFormWidgetClass does not render it in the window.
What am I doing wrong? Is there a way to populate the hierarchy and then calculate the geometry and render it to the screen at once?
Redesigning the application is an option but it is a last resort type on an option.
Any advice that keeps me within Motif libraries will be highly appreciated!
Regards,
/Sergey
Try calling XtManageChild after XtRealizeWidget.
Try creating all widgets unmanaged and place them on a WidgetList, then call XtManageChildren(). Please see the following reference
http://www.s-and-b.su/syshlp/motif_guide/MotifProgGuide/Making_Widgets_Visible.html
Every time an individual widget is managed the parent changed_managed procedure is called.
XtManageChildren calls the changed_manage procedure only once. This may help.

Message Passing Arbitrary Object Graphs?

I'm looking to parallelize some code across a Beowulf cluster, such that the CPUs involved don't share address space. I want to parallelize a function call in the outer loop. The function calls do not have any "important" side effects (though they do use a random number generator, allocate memory, etc.).
I've looked at libs like MPI and the problem I see is that they seem to make it very non-trivial to pass complex object graphs between nodes. The input to my function is a this pointer that points to a very complex object graph. The return type of my function is another complex object graph.
At a language-agnostic level (I'm working in the D programming language, and I'm almost sure no canned solution is available here, but I'm willing to create one), is there a "typical" way that passing complex state across nodes is dealt with? Ideally, I want the details of how the state is copied to be completely abstracted away and for the calls to look almost like normal function calls. I don't care that copying this much state over a network isn't particularly efficient, as the level of parallelism in question is so coarse-grained that it probably won't matter.
Edit: If there is no easy way to pass complex state, then how is message passing typically used? It seems to me like anything involving copying data over a network requires coarse grained parallelism, yet coarse grained parallelism usually requires passing complex state so that a lot of work can be done in one work unit.
I do a fair bit of MPI programming but I don't know of any typical way of passing complex state (as you describe it) between processes. Here's how I've been thinking about your problem, it probably matches your own thinking ...
I surmise that your complex object graphs are represented, in memory, by blocks of data and pointers to other blocks of data -- a usual sort of implementation of a graph. How best can you move one of these COGs (to coin an abbreviation) from the address space of one process to the address space of another ? To the extent that a pointer is a memory address, a pointer in one address space is no use in another address space, so you will have to translate it into some neutral form for transport (I think ?).
To send a COG, therefore, it has to be put into some form from which the receiving process can build, in its own address space, a local version of the graph with the pointers pointing to local memory addresses. Do you ever write these COGs to file ? If so, you already have a form in which one could be transported. I hate to suggest it, but you could even use files to communicate between processes -- and that might be easier to handle than the combination of D and MPI. Your choice !
If you don't have a file form for the COGs can you easily represent them as adjacency matrices or lists ? In other words, work out your own representation for transport ?
I'll be very surprised (but pleased to learn) if you can pass a COG between processes without transforming it from pointer-based to some more static structure such as arrays or records.
Edit, in response to OP's edit. MPI does provide easy ways to pass complex state around, provided that the complex state is represented as values not pointers. You can pass complex state around in either the intrinsic or customised MPI datatypes; as one of the other answers shows you these are flexible and capable. If our program does not keep the complex state in a form that MPI custom datatypes can handle, you'll have to write functions to pack/unpack to a message-friendly representation. If you can do that, then your message calls will look (for most purposes) like function calls.
As to the issues surrounding complex state and the graininess of parallelism, I'm not sure I quite follow you. We (include yourself in this sweeping generalisation if you want, or not) typically resort to MPI programming because we can't get enough performance out of a single processor, we know that we'll pay a penalty in terms of computation delayed by waiting for communication, we work hard to minimise that penalty, but in the end we accept the penalty as the cost of parallelisation. Certainly some jobs are too small or too short to benefit from parallelisation, but a lot of what we (parallel computationalists that is) do is just too big and too long-running to avoid parallelisation
You can do marvelous things with custom MPI datatypes. I'm currently working on a project where several MPI processes are tracking particles in a piece of virtual space, and when particles cross over from one process' territory into another one's, their data (position/speed/size/etc) has to be sent over the network.
The way I achieved this is the following:
1) All processes share an MPI Struct datatype for a single particle that contains all its relevant attributes, and their displacement in memory compared to the base address of the particle object.
2) On sending, the process iterates over whatever data structure it stores the particles in, notes down the memory address of each one that needs to be sent, and then builds a Hindexed datatype where each block is 1 long (of the above mentioned particle datatype) and starts at the memory addresses previously noted down. Sending 1 object of the resulting type will send all the necessary data over the network, in a type safe manner.
3) On the receiving end, things are slightly trickier. The receiving process first inserts "blank" particles into its own data structure: "blank" means that all the attributes that will be received from the other process are initialized to some default value. The memory addresses of the freshly inserted particles are noted down, and a datatype similar to that of the sender is created from these addresses. Receiving the sender's message as a single object of this type will automatically unpack all the data into all the right places, again, in a type safe manner.
This example is simpler in the sense that there are no relationships between particles (as there would be between nodes of a graph), but you could transmit that data in a similar way.
If the above description is not clear, I can post the C++ code that implements it.
I'm not sure I understand the question correctly so forgive me if my answer is off. From what I understand you want to send non-POD datatypes using MPI.
A library that can do this is Boost.MPI. It uses a serialization library to send even very complex data structures. There is a catch though: you will have to provide code to serialize the data yourself if you use complicated structures that Boost.Serialize does not already know about.
I believe message passing is typically used to transmit POD datatypes.
I'm not allowed to post more links so here is what I wanted to include:
Explanation of POD: www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html
Serialization Library: www.boost.org/libs/serialization/doc
it depends on organization of your data. If you use pointers or automatic memory inside your objects, it will be difficult. If you can organize your objects to be contiguous in memory, you have two choices: send memory as bytes,cast it back to object type on the receiver or define mpi derived type for your object. If however you use inheritance, things will become complicated due to how objects are laid out in memory.
I do not know your problem, but maybe can take a look at ARMCI if you manage memory manually.

Best approach for game animation?

I have a course exercise in OpenGL to write a game with simple animation of a few objects
While discussing with my partner our design options we've realized we have two major choices for how the animation is going to work, Either
Set a timer for a constant interval, say 30 msec, when the timer hits, calculate where objects should be and draw the frame. or -
Don't use a timer, just a normal loop that runs all the time and in each iteration check how much time passed, calculate where the objects should be according to the interval and draw the frame.
What should generally be the preferred approach? Does anyone have concrete experience with either approach?
Render and compute as fast as you can to get the maximum frame rate (as capped by the vertical sync)
Don't use a timer, they're not reliable < 50-100 ms on Windows. Check how much time has passed. (Usually, you need both delta t and an absolute value, depending on if your animation is physics or keyframe based.)
Also, if you want to be stable, use an upper/lower bound on your time-step, to go into slow-motion if a frame takes a few secs to render (disc access by another process?) or skip an update if you get two of them within say 10 ms.
Update
(Since this is a rather popular answer)
I usually prefer having a fixed time-step, as it makes everything more stable. Most physics engines are pretty robust against varying time, but other things, like particle systems or various simpler animations or even game logic, are easier to tune when having everything run in a fixed time step.
Update2
(Since I got 10 upvotes ;)
For further stability over long periods of running (>4 hours), you probably want to make sure you're not using floats/doubles to compute large time differences, since you lose precision doing so and your game's animations/physics will suffer. Use fixed point (or 64-bit microsecond-based) integers instead.
For the hairy details, I recommend reading A matter of precision by Tom Forsyth.
Read this page about game loops.
In short, set a timer:
Update the state of the game at a fixed frequency (something like every 25 ms = 1s/40fps). That includes the properties of the game objects, the input, the physics, the AI, etc. I call that the Model and the Controller. The need for a fixed update rate comes from the problems that may appear on too slow or too fast hardware (read the article). Some physics engine also prefer to update at a fixed frequency.
Update the frame (the graphics) of the game as fast as possible. That would be the View. That way you'll provide a smooth game. You can also enable vsync so the display will wait for the graphic card (usually it's 60 fps).
So each iteration of the loop, you check if you should update the model/controller. If it's late, update until they are up to date. Then, update the frame once and continue your loop.
The tricky part is that because of the different update rates, in fast hardware, the view will update several times before the model and controller. Therefore you should interpolate the position of your game objects depending on "where they would be if the game state would have been updated". It's really not that difficult.
You may have to maintain 2 different data structures : one for the model and one for the view. For instance you could have a scene graph for your model and a BSP tree for your view.
The second would be my preferred approach, because timers are often not as accurate as you're probably thinking and have all the latency and overhead of the event handling system. Accounting for the time interval will give your animations a much more consistent look and be robust if/when your frame rate dips.
Having said that, if your animation is based on a physics simulation (eg rigid body or ragdoll animation), then having a fixed update interval for your physics can greatly simplify the implementation.
Option 2 is by far preferred. It will scale nicely across differently performing hardware.
The book "Game Programming Gems 1" had a chapter that covers exactly what you need:
Frame Rate Independent Linear Interpolation
Use the second method. Did a game for my senior project and from experience, there is no guarantee that your logic will be done processing when the timer wants to fire.
I would be tempted to use the loop, since it will render as fast as possible (i.e. immediately after your physics computations are done). This will probably be more robust if you run into any slow-down in computation, which would cause timer firings to start queueing up. However, in case of such a slow-down you may have to put a cap on the time step computed between updates, since your physics engine may go unstable with too large a jump in time.
I'd suggest setting the system up to work on a "delta" that's passed in from outside.
When I did this, inside the animation format I based everything on real time values. The delta I passed in was 1 / 30 seconds, but it could be anything. With this system you can get either your first or second option, depending on whether you pass in a fixed delta or if you pass in the amount of time that has passed since the last frame.
As for which is better, it depends on your game and your requirements. Ideally all of your systems would be based around the same delta so that your physics match your animations. If your game drops frames at all and if all of your systems work with a variable delta, I'd suggest the variable delta is the better of the two solutions for end user experience.

Resources