How does [branch] get executed in HLSL? - multithreading

The [branch] attribute can mark an if statement in HLSL to make it execute only one branch instead of all branches and discarding the results like when using [flatten].
My question is how can this actually work, when a branch diverges withing a warp/wavefront? As far as I know, in this case all threads must execute all branches taken by any of the threads in the warp (like when using [flatten]) which is consequence of the fact, that they are all within the same SIMD block and must execute the same instruction.

Since GeForce series 6xx GPUs do actually support branching, though in limited form and with performance cost. The [branch] and [flatten] tags are just hints to the compiler to prefer one or the other if supported and possible. It basically depends on hardware and on the driver, so different hardware or different driver versions might in the end determine a different execution from what you specified with the tag.
You can find more info online, for example check this link

Related

What is the meaning of using COMM_WORLD or COMM_SELF to instantiate a TS, DMDA, Vec, etc

I'm looking at several examples from PETSc and petsc4py and looking at the PDF user manual of PETSc. The manual states:
For those not familiar with MPI, acommunicatoris a way of indicating a collection of processes that will be involved together in a calculation or communication. Communicators have the variable type MPI_Comm. In most cases users can employ the communicator PETSC_COMM_WORLD to indicate all processes in a given run and PETSC_COMM_SELF to indicate a single process.
I believe I understand that statement, but I'm unsure of the real consequences of actually using these communicators are. I'm unsure of what really happens when you do TSCreate(PETSC_COMM_WORLD,...) vs TSCreate(PETSC_COMM_SELF,...) or likewise for a distributed array. If you created a DMDA with PETSC_COMM_SELF, does this maybe mean that the DM object won't really be distributed across multiple processes? Or if you create a TS with PETSC_COMM_SELF and a DM with PETSC_COMM_WORLD, does this mean the solver can't actually access ghost nodes? Does it effect the results of DMCreateLocalVector and DMCreateGlobalVector?
The communicator for a solver decides which processes participate in the solver operations. For example, a TS with PETSC_COMM_SELF would run independently on each process, whereas one with PETSC_COMM_WORLD would evolve a single system across all processes. If you are using a DM with the solver, the communicators must be congruent.

How to extend GHC's Thread State Object

I'd like to add two extra fields of type StgWord32 to the thread state object (TSO). Based on the information I found on the GHC-Wiki and from looking at the source code, I have extended the struct in /includes/rts/storage/TSO.h and changed the program that creates different offsets (creating DerivedConstants.h). The compiler, the rts, and a simple application re-compile, but at the end of the execution (in hs_exit_) the garbage collector complains:
internal error: scavenge_stack: weird activation record found on stack: 45
I guess it has to to with cmm and/or the STG implementation details (the offsets are generated since the structs are not visible at cmm level, correct me if I'm wrong). Is the order of fields significant? Have I missed a file that should be changed?
I use a debug build of the compiler and RTS and a rather dated ghc 6.12.3 on a 64bit architecture. Any hints to relevant documentation and comments
on the difference between ghc 6 and 7 regarding TSO handling are welcome, too.
The error that you are getting comes from: ghc/rts/sm/Scav.c. Specifically at line 1917:
default:
barf("scavenge_stack: weird activation record found on stack: %d", (int)(info->i.type));
It looks like you need to also modify ClosureTypes.h, which you can find in ghc/includes/rts/storage. This file seems to contain the different kinds of headers that can appear in a heap object. I've also run into some strange bootstrapping errors, where if I try to rebuild using the stage-1 compiler, I get the error you mentioned, but if I do a clean build, then it compiles just fine.
A workaround that turned out good enough for me was to introduce a separate data structure for each Capability that would hold the additional information for each lightweight thread. I have used a HashTable (see rts/Hash.h and .c) mapping from thread id to the custom info struct. The entries were added when the threads were created from sparks (in schduleActiveteSpark).
Timing the creation, insertion, lookup and destruction of the entries and the table showed negligible overhead for small programs. The main overhead results from the actual usage of the information and should ideally be kept outside of the innermost scheduler loop. For the THREADED_RTS build one needs to ensure that other Capabilities don't access tables that are not their own (or use a mutex if such access is required, which is potential source of additional overhead).

there is __cond_lock(x,c) define in compiler.h file , but no __cond_unlock(x,c) define?

In complier.h, there is a macro define as below:
# define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0)
But here I have a question, that is, where there is a __cond_lock definition, but does not define the corresponding __cond_unlock, then the variable on the release, how to keep consistent between __cond_lock and __cond_unlock?
And I checked the definition of function spin_trylock (), and it is used __cond_lock, but which also used a _spin_trylock function.in _spin_trylock function, after a few calls, it will use to __acquire function in this case, the equivalent of an operation, it carried out two calculations would lead Sparse detection warning message appears, after I wrote the code for an experiment to test my judgment, is indeed a warning message will appear, if I wrote it twice unlock instruction, there is no alarm information, but this is inconsistent as program running.
Protecting critical sections using locking is up to the programmer. That means, if you hold a lock to protect a critical reason, you've must have to release the lock when you're finished.
There are various types of locking primitives inside Linux kernel like. spinlock(), spinlock_irq(), spin_trylock(). They have their own purposes. Now, spin_trylock() using __cond_lock inside of it, it's because to make sure, whether that particular lock is available for locking or it's been already taken. Take a look at few examples of how spin_trylock or __cond_lock is being used. For ex. at kernel/sched/fair.c::rebalance_domain (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?id=d8dfad3876e4386666b759da3c833d62fb8b2267#n5574) see how the balancing is used, it's been using spin_trylock() to hold the lock and while releasing doing it conditionally. Another example could be found at kernel/posix-timers.c, lock_timer() macro. If you closely look at the uses of lock_timer() you'll find how __cond_lock is being used inside kernel and hopefully your confusion will disappear.
In other words, __cond_lock is used to hold a lock conditionally and not being used directly. It's possible to check a particular lock before releasing the lock and this what has been done so far.

Specifying runtime behavior of a program

Are there any languages/extensions that allow the programmer to define general runtime behavior of a program during specific code segments?
Some garbage-collected languages let you modify the behavior of the GC at runtime. Like in lua, the collectgarbage function lets you do this. So, for example, you can stop the GC when you want to be sure that CPU resources aren't used in garbage collection for a critical section of code (after which you start the GC again).
I'm looking for a general way to specify intended behavior of the program without resorting to specifying specific GC tweaks. I'm interested even in an on-paper sort of specification method (ie something a programmer would code toward, but not program syntax that would actually implement that behavior). The point would be that this could be used to specify critical sections of code that shouldn't be interrupted (latency dependent activity) or other intended attributes of certain codepaths (maximum time between an output and an input or two outputs, average running time, etc).
For example, this syntax might describe that maximum time latencyDependentStuff should take is 5 milliseconds:
requireMaxTime(5) {
latencyDependentStuff();
}
Has anyone seen anything like this anywhere before?

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?
If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.
Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.
There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.
Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.
If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.
Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.
If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

Resources