Using callgrind/kcachegrind to get per-thread statistics

Using callgrind/kcachegrind to get per-thread statistics - multithreading

I'd like to be able to see how "expensive" each thread in my application is using callgrind. I profiled with the --separate-thread=yes option which gives you a callgrind file for the whole app and then one per-thread.
This is useful for viewing the profile of any given thread, but what I really want is just a sorted list of CPU time from each thread so I can see which threads are the the biggest hogs.

Valgrind/Callgrind doesn't allow this behaviour. Neither kcachegrind does, but I think it will be a good improvement. Maybe some answers could be found on their mailing-list.
A working but really boring way could be to use option --separate-thread=no, and update your code to use for each thread a different function name or class name. Depending your code complexity, it could be the answer (using 1computeData(), 2computeData(), ..)

Just open multiple profiles in kcachegrind at the same time.

Related

Simple Qt threading mechanism with progress?

I want to look for files with given extensions recursively from a given root directory and to display the number of files currently found in my GUI.
Since this kind of processing may be long, the GUI may be blocked.
I could just wait for the end of the processing and get the file count, but I am learning Qt (PyQt), so I see this as a training.
So I have read Qt doc:
When to Use Alternatives to Threads, and I don't think it's for me.
Then I read:
Choosing an Appropriate Approach, and I think my solution is the first one:
Run a new linear function within another thread, optionally with
progress updates during the run
But in this case you have 3 choices:
Qt provides different solutions:
Place the function in a reimplementation of QThread::run() and start the QThread. Emit signals to update progress. OR
Place the function in a reimplementation of QRunnable::run() and add the QRunnable to a QThreadPool. Write to a thread-safe variable
to update progress. OR
Run the function using QtConcurrent::run(). Write to a thread-safe variable to update progress.
Could you tell me how to choose the best one?
I have read some "solutions" but I'd like to understand why you should use one methodology instead of another one.
And also since I am looking for files, I may have a directory in which many files would match the search criteria. So it would mean lots of interruptions. Is there something special to keep in mind regarding this?
Thank you!

From what I know (hopefully more can chime in).
QThread offers support with signal interaction. For example, you'd be able to stop your concurrent function with a signal. Not sure how you'd do that with the other options, if at all.
Things to keep in mind: widgets all have to live in the main thread, but can communicate with other other threads via signals & slots.
Another quick thread on the topic w/ some decent bullet-points.
https://qt-project.org/forums/viewthread/50165/
Best of luck on your project, and welcome to Qt!

What are the benefits of coroutines?

I've been learning some lua for game development. I heard about coroutines in other languages but really came up on them in lua. I just don't really understand how useful they are, I heard a lot of talk how it can be a way to do multi-threaded things but aren't they run in order? So what benefit would there be from normal functions that also run in order? I'm just not getting how different they are from functions except that they can pause and let another run for a second. Seems like the use case scenarios wouldn't be that huge to me.
Anyone care to shed some light as to why someone would benefit from them?
Especially insight from a game programming perspective would be nice^^

OK, think in terms of game development.
Let's say you're doing a cutscene or perhaps a tutorial. Either way, what you have are an ordered sequence of commands sent to some number of entities. An entity moves to a location, talks to a guy, then walks elsewhere. And so forth. Some commands cannot start until others have finished.
Now look back at how your game works. Every frame, it must process AI, collision tests, animation, rendering, and sound, among possibly other things. You can only think every frame. So how do you put this kind of code in, where you have to wait for some action to complete before doing the next one?
If you built a system in C++, what you would have is something that ran before the AI. It would have a sequence of commands to process. Some of those commands would be instantaneous, like "tell entity X to go here" or "spawn entity Y here." Others would have to wait, such as "tell entity Z to go here and don't process anymore commands until it has gone here." The command processor would have to be called every frame, and it would have to understand complex conditions like "entity is at location" and so forth.
In Lua, it would look like this:
local entityX = game:GetEntity("entityX");
entityX:GoToLocation(locX);
local entityY = game:SpawnEntity("entityY", locY);
local entityZ = game:GetEntity("entityZ");
entityZ:GoToLocation(locZ);
do
coroutine.yield();
until (entityZ:isAtLocation(locZ));
return;
On the C++ size, you would resume this script once per frame until it is done. Once it returns, you know that the cutscene is over, so you can return control to the user.
Look at how simple that Lua logic is. It does exactly what it says it does. It's clear, obvious, and therefore very difficult to get wrong.
The power of coroutines is in being able to partially accomplish some task, wait for a condition to become true, then move on to the next task.

Coroutines in a game:
Easy to use, Easy to screw up when used in many places.
Just be careful and not use it in many places.
Don't make your Entire AI code dependent on Coroutines.
Coroutines are good for making a quick fix when a state is introduced which did not exist before.
This is exactly what java does. Sleep() and Wait()
Both functions are the best ways to make it impossible to debug your game.
If I were you I would completely avoid any code which has to use a Wait() function like a Coroutine does.
OpenGL API is something you should take note of. It never uses a wait() function but instead uses a clean state machine which knows exactly what state what object is at.
If you use coroutines you end with up so many stateless pieces of code that it most surely will be overwhelming to debug.
Coroutines are good when you are making an application like Text Editor ..bank application .. server ..database etc (not a game).
Bad when you are making a game where anything can happen at any point of time, you need to have states.
So, in my view coroutines are a bad way of programming and a excuse to write small stateless code.
But that's just me.

It's more like a religion. Some people believe in coroutines, some don't. The usecase, the implementation and the environment all together will result into a benefit or not.
Don't trust benchmarks which try to proof that coroutines on a multicore cpu are faster than a loop in a single thread: it would be a shame if it were slower!
If this runs later on some hardware where all cores are always under load, it will turn out to be slower - ups...
So there is no benefit per se.
Sometimes it's convenient to use. But if you end up with tons of coroutines yielding and states that went out of scope you'll curse coroutines. But at least it isn't the coroutines framework, it's still you.

We use them on a project I am working on. The main benefit for us is that sometimes with asynchronous code, there are points where it is important that certain parts are run in order because of some dependencies. If you use coroutines, you can force one process to wait for another process to complete. They aren't the only way to do this, but they can be a lot simpler than some other methods.

I'm just not getting how different they are from functions except that
they can pause and let another run for a second.
That's a pretty important property. I worked on a game engine which used them for timing. For example, we had an engine that ran at 10 ticks a second, and you could WaitTicks(x) to wait x number of ticks, and in the user layer, you could run WaitFrames(x) to wait x frames.
Even professional native concurrency libraries use the same kind of yielding behaviour.

Lots of good examples for game developers. I'll give another in the application extension space. Consider the scenario where the application has an engine that can run a users routines in Lua while doing the core functionality in C. If the user needs to wait for the engine to get to a specific state (e.g. waiting for data to be received), you either have to:
multi-thread the C program to run Lua in a separate thread and add in locking and synchronization methods,
abend the Lua routine and retry from the beginning with a state passed to the function to skip anything, least you rerun some code that should only be run once, or
yield the Lua routine and resume it once the state has been reached in C
The third option is the easiest for me to implement, avoiding the need to handle multi-threading on multiple platforms. It also allows the user's code to run unmodified, appearing as if the function they called took a long time.

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?

If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.

Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.

There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.

Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.

If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.

Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.

If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

Tracing pthread scheduling

What I want to do is create some kind of graph detailing the execution of (two) threads in Linux. I don't need to see what the threads do, just when they are scheduled and for how long, a time line basically.
I've spend the last few hours searching the internet for a way to trace the scheduling of pthreads. Unfortunately, the two projects I found require either kernel recompilation (LTTng) or glibc patching (NPTL Trace Tool), both of which I can not do (large, centrally managed system, on which I have no sudo rights).
Is there any other way to do something like this or will I have to resort to finding a laptop on which I can patch/recompile whatever I want?
Best regards
PS: I would have linked to both projects, but the site doesn't allow me (reputation < 10). The first search result on Google for the project names is the correct one though.

Superuser privileges are not needed to build an instrumented glibc / libpthread.so. The ptt_trace program that is part of NPTL Trace Tool will run your program using the instrumented library.

Maybe something like Intel's VTune?

There is also a tool called pthreadw (on sourceforge)
It's a wrapper library which intercepts calls to the usual functions of the pthread library, and reports stats, like typical times spent playing with locks, condition variables, etc...
It is not currently able to export traces, only textual summary reports.

How do I get the current state of a thread (e.g. blocking, suspended, running, etc..) in win32?

I couldn't find a documented API that yields this information.
A friend suggested I use NtQuerySystemInformation. After looking it up, the information is there (see SYSTEM_THREAD ) but it is undocumented, and not very elegant - I get the information for all threads in the system.
Do you know of a more elegant, preferably documented API to do this?

There is no other way than using NtQuerySystemInformation.
However it could be less complicated, that's true, but Microsoft lacks an implementation.
I posted a working class here that is very elegant to use:
How to get thread state (e.g. suspended), memory + CPU usage, start time, priority, etc

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string