performance of wglGetProcAddress, should calls be cached? - linux

There is a simple code inside a function (from a legacy library) used in init phase - for instance in a Texture Loader module.
loadTexture() {
// ...
gl_func = wglGetProcAddress(...)
// ...
gl_func()
}
Should I worry about the cost of wglGetProcAddresscall? Or maybe it is so fast that no caching mechanism is needed? Or maybe WGL caches such calls for the process?
What about other similar functions from GLX and Apple? Should I worry or not about them as well?

wglGetProcAddress will at least do some string comparisons so it's not free. The big issue is that your code will be ugly if you insert wglGetProcAddress every time you use a gl function.
It's best if you use a generator that puts all the ugly wglGetProcAddress in a separate file. For example using glux or glloadgen.

Related

What is the proper way to get continuously processed data from a thread?

The following functions and fields are part of the same class in a Visual Studio DLL. Data is continuously being read and processed using the run function on a thread. However, getPoints is being accessed in a Qt app on a QTimer. I don't wan't to miss a single processed vector, because it seems it could be skipping leading to jumpy data. What's the safest way to get the points to the updated version?
If possible I'd like an answer that uses the C++ standard library as I've been exploring mutex-es, but it still seems to lead to jumpy data.
vector<float> points;
// std::mutex ioMutex;
// function running on a thread
void run(){
while(running){
//ioMutex.lock()
vector<byte> data = ReadData()
points = processData(data);
//ioMutex.unlock()
}
}
vector<float> getPoints(){
return points;
}
I believe there is a mistake in your code. The while loop will consume all the process activity and will not allow proper functionality of other functions. In Qt, in such continuous loops, usually it is a good habit to use the following because it actually gives other process time to access the event buffer properly. If this dll is written in Qt, please add the following within the while loop
QCoreApplication::processEvents();
The safest (and probably easiest) way to deliver your points-data to the main thread is by calling qApp->postEvent() with an object of a custom QEvent-subclass that contains your vector<float> as a member-variable.
That will cause the event(QEvent *) method of (whatever Qt object you specified as the first argument to postEvent()) to be called from inside the main/GUI thread, and so you can override that method to read the vector<float> out of the QEvent-subclassed object and update the GUI with that data.

Groovy: Is for..in significantly faster than .each?

I'm curious if for..in should be preferred to .each for performance reasons.
For .. in is part of the standard language flow control.
Instead each calls a closure so with extra overhead.
.each {...} is syntax sugar equivalent to the method call .each({...})
Moreover, due to the fact that it is a closure, inside each code block you can't use break and continue statements to control the loop.
http://kunaldabir.blogspot.it/2011/07/groovy-performance-iterating-with.html
Updated benchmark Java 1.8.0_45 Groovy 2.4.3:
3327981 each {}
320949 for(){
Here is another benchmark with 100000 iterations:
lines = (1..100000)
// with list.each {}
start = System.nanoTime()
lines.each { line->
line++;
}
println System.nanoTime() - start
// with loop over list
start = System.nanoTime()
for (line in lines){
line++;
}
println System.nanoTime() - start
results:
261062715 each{}
64518703 for(){}
Let's have a theoretical look at things in terms of what calls are done dynamic and what calls are done more directly with Java logic (I will call those static calls).
In case of for-in, Groovy operates on an Iterator, to get it, we have one dynamic call to iterator(). If I am not mistaken, the hasNext and next calls are done using normal Java method call logic. Thus for each iteration we have here 2 static calls only. Depending on the benchmark it has to be noted that that first iterator() call can cause serious initialization times, since that may init the meta class system and that takes a moment.
In case of each we have the dynamic call to each itself, as well as the object creation for the open block (instance of Closure). each(Closure) will then call iterator() as well, but uncached... well all one-time cost. During the loop hasNext and next are done using Java logic that makes 2 static calls. The call into Closure instance is done with java standard logic for the method call, which then will invoke doCall using a dynamic call.
To sum it up, per iteration for-in uses only 2 static calls, while each has 3 static and 1 dynamic call. The dynamic call is much slower than multiple static calls and much more difficult to optimize for the JVM, thus dominating the timing. As a result of this each should always be slower, as long as the open block requires the dynamic call.
Because of the complicated logic for Closure#call it is difficult to optimize the dynamic call away. And that is annoying, because it is not really needed and will be removed as soon as we find a workaround. Should we ever succeed in that, each might still be slower, but it is a much more difficult thing, since bytecode sizes and invocation profiles play a role here. In theory they could be equal then (ignoring the init time), but the JVM has much more work to do. Of course the same applies for for examples stream based lambda processing in Java8,

How do I Yield() to another thread in a Win8 C++/Xaml app?

Note: I'm using C++, not C#.
I have a bit of code that does some computation, and several bits of code that use the result. The bits that use the result are already in tasks, but the original computation is not -- it's actually in the callstack of the main thread's App::App() initialization.
Back in the olden days, I'd use:
while (!computationIsFinished())
std::this_thread::yield(); // or the like, depending on API
Yet this doesn't seem to exist for Windows Store apps (aka WinRT, pka Metro-style). I can't use a continuation because the bits that use the results are unconnected to where the original computation takes place -- in addition to that computation not being a task anyway.
Searching found Concurrency::Context::Yield(), but Context appears not to exist for Windows Store apps.
So... say I'm in a task on the background thread. How do I yield? Especially, how do I yield in a while loop?
First of all, doing expensive computations in a constructor is not usually a good idea. Even less so when it's the "App" class. Also, doing heavy work in the main (ASTA) thread is pretty much forbidden in the WinRT model.
You can use concurrency::task_completion_event<T> to interface code that isn't task-oriented with other pieces of dependent work.
E.g. in the long serial piece of code:
...
task_completion_event<ComputationResult> tce;
task<ComputationResult> computationTask(tce);
// This task is now tied to the completion event.
// Pass it along to interested parties.
try
{
auto result = DoExpensiveComputations();
// Successfully complete the task.
tce.set(result);
}
catch(...)
{
// On failure, propagate the exception to continuations.
tce.set_exception(std::current_exception());
}
...
Should work well, but again, I recommend breaking out the computation into a task of its own, and would probably start by not doing it during construction... surely an anti-pattern for a responsive UI. :)
Qt simply uses Sleep(0) in their WinRT yield implementation.

Multithreaded Game Loop Rendering/Updating (boost-asio)

So I have a single-threaded game engine class, which has separate functions for input, update and rendering, and I've just started learning to use the wonderful boost library (asio and thread components). And I was thinking of separating my update and render functions into separate threads (and perhaps separate the input and update functions from each other as well). Of course these functions will sometimes access the same locations in memory, so I decided to use boost/thread's strand functionality to prevent them from executing at the same time.
Right now my main game loop looks like this:
void SDLEngine::Start()
{
int update_time=0;
quit=false;
while(!quit)
{
update_time=SDL_GetTicks();
DoInput();//get user input and alter data based on it
DoUpdate();//update game data once per loop
if(!minimized)
DoRender();//render graphics to screen
update_time=SDL_GetTicks()-update_time;
SDL_Delay(max(0,target_time-update_time));//insert delay to run at desired FPS
}
}
If I used separate threads it would look something like this:
void SDLEngine::Start()
{
boost::asio::io_service io;
boost::asio::strand strand_;
boost::asio::deadline_timer input(io,boost::posix_time::milliseconds(16));
boost::asio::deadline_timer update(io,boost::posix_time::milliseconds(16));
boost::asio::deadline_timer render(io,boost::posix_time::milliseconds(16));
//
input.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoInput,this)));
update.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoUpdate,this)));
render.async_wait(strand_.wrap(boost::bind(&SDLEngine::DoRender,this)));
//
io.run();
}
So as you can see, before the loop went: Input->Update->Render->Delay->Repeat
Each one was run one after the other. If I used multithreading I would have to use strands so that updates and rendering wouldn't be run at the same time. So, is it still worth it to use multithreading here? They would still basically be running one at a time in separate cores. I basically have no experience in multithreaded applications so any help is appreciated.
Oh, and another thing: I'm using OpenGL for rendering. Would multithreading like this affect the way OpenGL renders in any way?
You are using same strand for all handlers, so there is no multithreading at all. Also, your deadline_timer is in scope of Start() and you do not pass it anywhere. In this case you will not able to restart it from the handler (note its not "interval" timer, its just a "one-call timer").
I see no point in this "revamp" since you are not getting any benefit from asio and/or threads at all in this example.
These methods (input, update, render) are too big and they do many things, you cannot call them without blocking. Its hard to say precisely because i dont know whats the game and how it works, but I'd prefer to do following steps:
Try to revamp network i/o so its become fully async
Try to use all CPU cores
About what you have tried: i think its possible if you search your code for actions that really can run in parallel right now. For example: if you calculate for each NPC something that is not depending on other characters you can io_service.post() each to make use all threads that running io_service.run() at the moment. So your program stay singlethreaded, but you can use, say, 7 other threads on some "big" operations

Is this a safe version of double-checked locking?

Slightly modified version of canonical broken double-checked locking from Wikipedia:
class Foo {
private Helper helper = null;
public Helper getHelper() {
if (helper == null) {
synchronized(this) {
if (helper == null) {
// Create new Helper instance and store reference on
// stack so other threads can't see it.
Helper myHelper = new Helper();
// Atomically publish this instance.
atomicSet(helper, myHelper);
}
}
}
return helper;
}
}
Does simply making the publishing of the newly created Helper instance atomic make this double checked locking idiom safe, assuming that the underlying atomic ops library works properly? I realize that in Java, one could just use volatile, but even though the example is in pseudo-Java, this is supposed to be a language-agnostic question.
See also:
Double checked locking Article
It entirely depends on the exact memory model of your platform/language.
My rule of thumb: just don't do it. Lock-free (or reduced lock, in this case) programming is hard and shouldn't be attempted unless you're a threading ninja. You should only even contemplate it when you've got profiling proof that you really need it, and in that case you get the absolute best and most recent book on threading for that particular platform and see if it can help you.
I don't think you can answer the question in a language-agnostic fashion without getting away from code completely. It all depends on how synchronized and atomicSet work in your pseudocode.
The answer is language dependent - it comes down to the guarantees provided by atomicSet().
If the construction of myHelper can be spread out after the atomicSet() then it doesn't matter how the variable is assigned to the shared state.
i.e.
// Create new Helper instance and store reference on
// stack so other threads can't see it.
Helper myHelper = new Helper(); // ALLOCATE MEMORY HERE BUT DON'T INITIALISE
// Atomically publish this instance.
atomicSet(helper, myHelper); // ATOMICALLY POINT UNINITIALISED MEMORY from helper
// other thread gets run at this time and tries to use helper object
// AT THE PROGRAMS LEISURE INITIALISE Helper object.
If this is allowed by the language then the double checking will not work.
Using volatile would not prevent a multiple instantiations - however using the synchronize will prevent multiple instances being created. However with your code it is possible that helper is returned before it has been setup (thread 'A' instantiates it, but before it is setup thread 'B' comes along, helper is non-null and so returns it straight away. To fix that problem, remove the first if (helper == null).
Most likely it is broken, because the problem of a partially constructed object is not addressed.
To all the people worried about a partially constructed object:
As far as I understand, the problem of partially constructed objects is only a problem within constructors. In other words, within a constructor, if an object references itself (including it's subclass) or it's members, then there are possible issues with partial construction. Otherwise, when a constructor returns, the class is fully constructed.
I think you are confusing partial construction with the different problem of how the compiler optimizes the writes. The compiler can choose to A) allocate the memory for the new Helper object, B) write the address to myHelper (the local stack variable), and then C) invoke any constructor initialization. Anytime after point B and before point C, accessing myHelper would be a problem.
It is this compiler optimization of the writes, not partial construction that the cited papers are concerned with. In the original single-check lock solution, optimized writes can allow multiple threads to see the member variable between points B and C. This implementation avoids the write optimization issue by using a local stack variable.
The main scope of the cited papers is to describe the various problems with the double-check lock solution. However, unless the atomicSet method is also synchronizing against the Foo class, this solution is not a double-check lock solution. It is using multiple locks.
I would say this all comes down to the implementation of the atomic assignment function. The function needs to be truly atomic, it needs to guarantee that processor local memory caches are synchronized, and it needs to do all this at a lower cost than simply always synchronizing the getHelper method.
Based on the cited paper, in Java, it is unlikely to meet all these requirements. Also, something that should be very clear from the paper is that Java's memory model changes frequently. It adapts as better understanding of caching, garbage collection, etc. evolve, as well as adapting to changes in the underlying real processor architecture that the VM runs on.
As a rule of thumb, if you optimize your Java code in a way that depends on the underlying implementation, as opposed to the API, you run the risk of having broken code in the next release of the JVM. (Although, sometimes you will have no choice.)
dsimcha:
If your atomicSet method is real, then I would try sending your question to Doug Lea (along with your atomicSet implementation). I have a feeling he's the kind of guy that would answer. I'm guessing that for Java he will tell you that it's cheaper to always synchronize and to look to optimize somewhere else.

Resources