Is Entrance into a Windows Critical Section an atomic operation? - multithreading

I wrote an FFI for critical sections, and I wrote a test for it in Haxe.
Tests run in order defined (public functions are tests)
This test test_critical_section will intermittently hang and fail:
1 var criticalSection:CriticalSection;
2
3 #if master
4 public function test_init_critical_section() {
5 return assert(attempt({
6 criticalSection = synch.SynchLib.critical_section_init(SPIN_COUNT);
7 trace('criticalSection: $criticalSection');
8 }));
9 }
10 var criticalValue = 0;
11 var done = 0;
12 var numThreads = 50;
13 function work_in_critical_section(ID:Int, a:AssertionBuffer) {
14 sys.thread.Thread.create(() -> {
15 inline function threadMsg(msg:String)
16 trace('Thread ID $ID: $msg');
17
18
19 threadMsg("Attempting to enter critical section");
20 criticalSection.critical_section_enter();
21 threadMsg("Entering crtiical section. Doing work.");
22 Sys.sleep(Std.random(100)/500); // simulate work in section
23 criticalValue+= 10;
24 done++;
25 a.assert(criticalValue == done * 10);
26 threadMsg("Leaving critical section. Work done. done: " + done);
27 criticalSection.critical_section_leave();
28 if (done == numThreads) {
29 a.assert(criticalValue == numThreads * 10);
30 a.done();
31
32 }
33 });
34 }
35 #:timeout(30000)
36 public function test_critical_section() {
37 var a = new AssertionBuffer();
38 for (i in 0...numThreads)
39 work_in_critical_section(i, a);
40 return a;
41 }
But when I add Sys.sleep(ID/5); just before entrance into the critical section (on the blank line 18), the test passses every single time (with any number of threads). Without it, the test fails randomly (more often with a higher number of threads).
My conclusion from this test is that entrance to a critical section is not atomic, and multiple threads simultaneously attempting to enter may leave the critical section in an undefined state (leading to undefined/hanging behavior).
Is this the right conclusion or am I simply mis-using critical sections (and thus, the test needs to be re-written)? And if it is the right conclusion.. does this not mean that entrance into the critical section needs its own atomic locking/synchronization mechanism..? (and further, if that is the case.. what is the point of critical sections, why would I not just use whatever that atomic synchronization mechanism is?)
To me, this seems problematic, for example, consider 10 threads meet at a synchronization barrier (with a capacity of 10), and then all 10 need to proceed through a critical section immediately after the 10th thread arrives, does that mean I'd have to synchronize/serialize access to the critical section entrance method (for instance, by sleeping such as to ensure only one thread attempts to enter the section at a given tick, as done to fix the failing test above)?
The FFI is writen ontop of synchapi.h (see EnterCriticalSection)

You read done outside the critical section. That is a race condition. If you want to look at the value of done, you need to do it before you leave the critical section.
You might see a write to done from another thread, triggering the assert before the write to criticalValue is visible to the thread that saw the write to done.
If the critical section protects criticalValue and done, then it is an error to access either of them without being in the critical section unless you are sure every thread that might access them has terminated. Your code violates this rule.

Related

How many promises can Perl 6 keep?

That's a bit of a glib title, but in playing around with Promises I wanted to see how far I could stretch the idea. In this program, I make it so I can specify how many promises I want to make.
The default value in the thread scheduler is 16 threads (rakudo/ThreadPoolScheduler.pm)
If I specify more than that number, the program hangs but I don't get a warning (say, like "Too many threads").
If I set RAKUDO_MAX_THREADS, I can stop the program hanging but eventually there is too much thread competition to run.
I have two questions, really.
How would a program know how many more threads it can make? That's slightly more than the number of promises, for what that's worth.
How would I know how many threads I should allow, even if I can make more?
This is Rakudo 2017.01 on my puny Macbook Air with 4 cores:
my $threads = #*ARGS[0] // %*ENV<RAKUDO_MAX_THREADS> // 1;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread {$*THREAD.id} got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await |#promises;
put "Done!";
This isn't actually about Promise per se, but rather about the thread pool scheduler. A Promise itself is just a synchronization construct. The start construct actually does two things:
Ensures a fresh $_, $/, and $! inside of the block
Calls Promise.start with that block
And Promise.start also does two things:
Creates and returns a Promise
Schedules the code in the block to be run on the thread pool, and arranges that successful completion keeps the Promise and an exception breaks the Promise.
It's not only possible, but also relatively common, to have Promise objects that aren't backed by code on the thread pool. Promise.in, Promise.anyof and Promise.allof factories don't immediately schedule anything, and there are all kinds of uses of a Promise that involve doing Promise.new and then calling keep or break later on. So I can easily create and await on 1000 Promises:
my #p = Promise.new xx 1000;
start { sleep 1; .keep for #p };
await #p;
say 'done' # completes, no trouble
Similarly, a Promise is not the only thing that can schedule code on the ThreadPoolScheduler. The many things that return Supply (like intervals, file watching, asynchronous sockets, asynchronous processes) all schedule their callbacks there too. It's possible to throw code there fire-and-forget style by doing $*SCHEDULER.cue: { ... } (though often you care about the result, or any errors, so it's not especially common).
The current Perl 6 thread pool scheduler has a configurable but enforced upper limit, which defaults to 16 threads. If you create a situation where all 16 are occupied but unable to make progress, and the only thing that can make progress is stuck in the work queue, then deadlock will occur. This is nothing unique to Perl 6 thread pool; any bounded pool will be vulnerable to this (and any unbounded pool will be vulnerable to using up all resources and getting the process killed :-)).
As mentioned in another post, Perl 6.d will make await and react non-blocking constructs; this has always been the plan, but there was insufficient development resources to realize it in time for Perl 6.c. The use v6.d.PREVIEW pragma provides early access to this feature. (Also, fair warning, it's a work in progress.) The upshot of this is that an await or react on a thread owned by the thread pool will pause the execution of the scheduled code (for those curious, by taking a continuation) and and allow the thread to get on with further work. The resumption of the code will be scheduled when the awaited thing completes, or the react block gets done. Note that this means you can be on a different OS thread before and after the await or react in 6.d. (Most Perl 6 users will not need to care about this. It's mostly relevant for those writing bindings to C libraries, or doing over systems-y stuff. And a good C library binding will make it so users of the binding don't have to care.)
The upcoming 6.d change doesn't eliminate the possibility of exhausting the thread pool, but it will mean a bunch of ways that you can do in 6.c will no longer be of concern (of note, writing recursive conquer/divide things that await the results of the divided parts, or having thousands of active react blocks launched with start react { ... }).
Looking forward, the thread pool scheduler itself will also become smarter. What follows is speculation, though given I'll likely be the one implementing the changes it's probably the best speculation on offer. :-) The thread pool will start following the progress being made, and use it to dynamically tune the pool size. This will include noticing that no progress is being made and, combined with the observation that the work queues contain items, adding threads to try and resolve the deadlock - at the cost of memory overhead of added threads. Today the thread pool conservatively tends to spawn up to its maximum size anyway, even if this is not a particularly optimal choice; most likely some kind of hill-climbing algorithm will be used to try and settle on an optimal number instead. Once that happens, the default max_threads can be raised substantially, so that more programs will - at the cost of a bunch of memory overhead - be able to complete, but most will run with just a handful of threads.
Quick fix, add use v6.d.PREVIEW; on the first line.
This fixes a number of thread exhaustion issues.
I added a few other changes like $*SCHEDULER.max_threads, and adding the Promise “id” so that it is easy to see that the Thread id doesn't necessarily correlate with a given Promise.
#! /usr/bin/env perl6
use v6.d.PREVIEW; # <--
my $threads = #*ARGS[0] // $*SCHEDULER.max_threads;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread $*THREAD.id() ($_) got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await #promises;
put "Done!";
There are 16 threads
Done making threads
Thread 4 (14) got 0
Thread 4 (14) got 1
Thread 8 (8) got 3
Thread 10 (6) got 4
Thread 6 (1) got 5
Thread 16 (5) got 2
Thread 3 (16) got 7
Thread 7 (8) got 8
Thread 7 (9) got 9
Thread 5 (3) got 6
Thread 3 (6) got 10
Thread 11 (2) got 11
Thread 14 (5) got 12
Thread 4 (16) got 13
Thread 16 (15) got 14 # <<
Thread 13 (11) got 15
Thread 4 (15) got 16 # <<
Thread 4 (15) got 17 # <<
Thread 4 (15) got 18 # <<
Thread 11 (15) got 19 # <<
Thread 13 (15) got 20 # <<
Thread 3 (15) got 21 # <<
Thread 9 (13) got 22
Thread 18 (15) got 23 # <<
Thread 18 (15) got 24 # <<
Thread 8 (13) got 25
Thread 7 (15) got 26 # <<
Thread 3 (15) got 27 # <<
Thread 7 (15) got 28 # <<
Thread 8 (15) got 29 # <<
Thread 13 (13) got 30
Thread 14 (13) got 31
Thread 8 (13) got 32
Thread 6 (13) got 33
Thread 9 (15) got 34 # <<
Thread 13 (15) got 35 # <<
Thread 9 (15) got 36 # <<
Thread 16 (15) got 37 # <<
Thread 3 (15) got 38 # <<
Thread 18 (13) got 39
Thread 3 (15) got 40 # <<
Thread 7 (14) got 41
Thread 12 (15) got 42 # <<
Thread 15 (15) got 43 # <<
Thread 4 (1) got 44
Thread 11 (1) got 45
Thread 7 (15) got 46 # <<
Thread 8 (15) got 47 # <<
Thread 7 (15) got 48 # <<
Thread 17 (15) got 49 # <<
Thread 10 (10) got 50
Thread 10 (15) got 51 # <<
Thread 11 (14) got 52
Thread 6 (8) got 53
Thread 5 (13) got 54
Thread 11 (15) got 55 # <<
Thread 11 (13) got 56
Thread 3 (13) got 57
Thread 7 (13) got 58
Thread 16 (16) got 59
Thread 5 (15) got 60 # <<
Thread 5 (15) got 61 # <<
Thread 6 (15) got 62 # <<
Thread 5 (15) got 63 # <<
Thread 5 (15) got 64 # <<
Thread 17 (11) got 65
Thread 15 (15) got 66 # <<
Thread 17 (15) got 67 # <<
Thread 11 (13) got 68
Thread 10 (15) got 69 # <<
Thread 3 (15) got 70 # <<
Thread 11 (15) got 71 # <<
Thread 6 (15) got 72 # <<
Thread 16 (13) got 73
Thread 6 (13) got 74
Thread 17 (15) got 75 # <<
Thread 4 (13) got 76
Thread 8 (13) got 77
Thread 12 (15) got 78 # <<
Thread 6 (11) got 79
Thread 3 (15) got 80 # <<
Thread 11 (13) got 81
Thread 7 (13) got 82
Thread 4 (15) got 83 # <<
Thread 7 (15) got 84 # <<
Thread 7 (15) got 85 # <<
Thread 10 (15) got 86 # <<
Thread 7 (15) got 87 # <<
Thread 12 (13) got 88
Thread 3 (13) got 89
Thread 18 (13) got 90
Thread 6 (13) got 91
Thread 18 (13) got 92
Thread 15 (15) got 93 # <<
Thread 16 (15) got 94 # <<
Thread 12 (15) got 95 # <<
Thread 17 (15) got 96 # <<
Thread 11 (13) got 97
Thread 15 (16) got 98
Thread 18 (7) got 99
Done sending
Done!

Fetch and Add description wrong?

I am trying to understand how to use fetch and add (atomic operation) in a lock implementation.
I came across this article in Wikipedia, I found it duplicated in at least one other place. The implementation does not make sense and looks to me to have a bug or more in it. Of course I could be missing a subtle point and not really understanding what is being described.
From https://en.wikipedia.org/wiki/Fetch-and-add
<< atomic >>
function FetchAndAdd(address location, int inc) {
int value := *location
*location := value + inc
return value
}
record locktype {
int ticketnumber
int turn
}
procedure LockInit( locktype* lock ) {
lock.ticketnumber := 0
lock.turn := 0
}
procedure Lock( locktype* lock ) {
int myturn := FetchAndIncrement( &lock.ticketnumber ) //must be atomic, since many threads might ask for a lock at the same time
while lock.turn ≠ myturn
skip // spin until lock is acquired
}
procedure UnLock( locktype* lock ) {
FetchAndIncrement( &lock.turn ) //this need not be atomic, since only the possessor of the lock will execute this
}
According to the article they first do LockInit. FetchAndIncrement calls FetchAndAdd with inc set to 1.
If this does not contain a bug I do not understand how it could possibly work.
The first thread to access it will get it:
lock.ticketnumber = 1
lock.turn = 0.
Let's say 5 more accesses to the lock happen before it is released.
lock.ticketnumber = 6
lock.turn = 0
First thread releases the lock.
lock.ticketnumber = 6
lock.turn = 1
Next thread comes in and the status would be
lock.ticketnumber = 7
lock.turn = 1
And the returned value: myturn = 6 (lock.ticketnumber before the faa).
In this case the:
while lock.turn ≠ myturn
can never be true.
Is there a bug in this illustration or am I missing something?
If there is a bug in this implementation what would fix it?
Thanx
Julian
Dang it, I see it now. I found it referring to a general description of the algorithm and then I looked at it more closely.
When a thread calls Lock it spins waiting on the value it got back, for some reason I was thinking it kept calling that function.
When it spins it waits until another thread increments turn and eventually becomes the number of myturn.
Sorry for wasting your time.
https://en.wikipedia.org/wiki/Ticket_lock

Atomic compare exchange. Is it correct?

I want to atomically add 1 to a counter under certain conditions, but I'm not sure if following is correct in a threaded environment:
void UpdateCounterAndLastSessionIfMoreThan60Seconds() const {
auto currentTime = timeProvider->GetCurrentTime();
auto currentLastSession = lastSession.load();
bool shouldIncrement = (currentTime - currentLastSession >= 1 * 60);
if (shouldIncrement) {
auto isUpdate = lastSession.compare_exchange_strong(currentLastSession, currentTime);
if (isUpdate)
changes.fetch_add(1);
}
}
private:
std::shared_ptr<Time> timeProvider;
mutable std::atomic<time_t> lastSession;
mutable std::atomic<uint32_t> changes;
I don't want to increment changes multiple times if 2 threads simultaneously evaluate to shouldIncrement = true and isUpdate = true also (only one should increment changes in that case)
I'm no C++ expert, but it looks to me like you've got a race condition between the evaluation of "isUpdate" and the call to "fetch_add(1)".
So I think the answer to your question "Is this thread safe?" is "No, it is not".
It is at least a bit iffy, as following scenario will show:
First thread 1 does these:
auto currentTime = timeProvider->GetCurrentTime();
auto currentLastSession = lastSession.load();
bool shouldIncrement = (currentTime - currentLastSession >= 1 * 60);
Then thread 2 does the same 3 statements, but so that currentTime is more than it just was for Thread 1.
Then thread 1 continues to update lastSession with it's time, which is less than thread 2's time.
Then thread 2 gets its turn, but fails to update lastSession, because thread 1 changed the value already.
So end result is, lastSession is outdated, because thread 2 failed to update it to the latest value. This might not matter in all cases, situation might be fixed very soon after, but it's one ugly corner which might break some assumptions somewhere, if not with current code then after some later changes.
Another thing to note is, lastSession and chnages are not atomically in sync. Other threads occasionally see changed lastSession with changes counter still not incremeted for that change. Again this is something which might not matter, but it's easy to forget something like this and accidentally code something which assumes they are in sync.
I'm not immediately sure if you can make this 100% safe with just atomics. Wrap it in a mutex instead.

high resource usage program stalls/crashes linux

I have a program that reads about 1000 images and creates a statistical summary of their contents. Each image is processed in its own thread using OpenMP, and I have the thread limit set to match my number of processors.
Until about two weeks ago, the program ran fine. Now, however, if I run the program more than once, my system slows down and eventually freezes up.
In order to troubleshoot, I wrote the simple code listed below that emulates what my program is doing. This code will freeze my system, just as my original program does, after trying to read only a few files at line 35.
I ran the program, successively reverting to an earlier kernel after each failure, and found that it fails with all 3.6 kernels up to version 3.6.8.
However, when I go back to kernel 3.5.6, it works.
test.cc:
1 #include <cstdio>
2 #include <iostream>
3 #include <vector>
4 #include <unistd.h>
5
6 using namespace std;
7
8 int main ()
9 {
10 // number of files
11 const size_t N = 1000;
12 // total system memory
13 const size_t MEM = sysconf (_SC_PHYS_PAGES) * sysconf (_SC_PAGE_SIZE);
14 // file size
15 const size_t SZ = MEM/N;
16
17 // create temp filenames
18 vector<string> fn (N);
19 for (size_t i = 0; i < fn.size (); ++i)
20 fn[i] = string (tmpnam (NULL));
21
22 // write a bunch of files to disk
23 for (size_t i = 0; i < fn.size (); ++i)
24 {
25 vector<char> a (SZ);
26 FILE *fp = fopen (fn[i].c_str (), "wb");
27 fwrite (&a[0], a.size (), 1, fp);
28 clog << fn[i] << " written" << endl;
29 }
30
31 // read a bunch of files from disk
32 #pragma omp parallel for
33 for (size_t i = 0; i < fn.size (); ++i)
34 {
35 vector<char> a (SZ);
36 FILE *fp = fopen (fn[i].c_str (), "rb");
37 fread (&a[0], a.size (), 1, fp);
38 clog << fn[i] << " read" << endl;
39 }
40
41 return 0;
42 }
Makefile:
1 a:$
2 g++ -fopenmp -Wall -o test -g test.cc$
3 ./test$
My question is: What is different about kernel 3.6 that would cause this program to fail, but does not cause it to fail in version 3.5?
Without going through the code, if you want to set some limits to your processes, have a look at cgroups for limiting resource usage.
As for the freezing - you are trying to read/write GBs of data to disk at once. Given the speeds of ~100MB/s of today's hard-drives, I would expect a freeze at the time the kernel decides to flush the caches to the disk - which will probably occur as soon as you try to read a reasonably sized chunk of data from the disk under memory pressure (since you allocated lots of memory, the space for caches is limited).
You can try to mmap() the files or change kernel I/O scheduler.
I haven't look in deep at your code, but I realised some bad practices (at least, I thing they're) :
First, the critical section inside the openmp loop. That is a synchronism point, and putting it in every iteration sounds kind of problematic to me. Since each thread must be sure no other one has entered there, probably the overhead that synchronism introduces increases with the number of threads.
Second: I am not very used to C++, but I guess that every time vector<char> a (SZ) is executed memory is allocated (and freed at the end of the block). Excuse me if I am wrong. Since you know beforehand the value of SZ, you'll better allocate a vector<vector<char> > with as many elements as threads before the parallel region. Then, in the parallel region, you'd make each thread access its vector<char>.

Choppy SDL+OpenGL animation when vsync is on

Uint32 prev = SDL_GetTicks();
while ( true )
{
Draw();
Uint32 now = SDL_GetTicks();
Uint32 delta = now - prev;
printf( "%u\n" , delta );
Update( delta / 1000.0f );
prev = now;
ProcessEvents();
}
The application is a simple moving square. My loop looks like that and when vsync is on the whole thing just runs quite smoothly; turning it off instead causes some kind of jumps of the animation. I've inserted some prints and here's what I've found:
[...]
16
15
16
66 #
2 #
0 #
0 #
16
16
21
[...]
I know there are several issues with this kind of loop but none of them seem to apply to this simple example (am I wrong?). What causes this behavior and how can I overcome it?
I'm using an ATI card on a Linux system, but I'm expecting a portable explanation/solution.
It seems that it was a lack of glFinish(), I've read somewhere that calls to that function are in most cases useless (here or here for example). Well, I'm maybe misunderstanding some fundamental concepts but that worked for me and now the Draw() function ends with:
[...]
glFinish();
SDL_GL_SwapBuffers();
}

Resources