I'm interested in causing a failure in the readers/writers semaphore solution, with writers priority.
In the following code, taken from Wikipedia:
READER
P(mutex_3);
P(r);
P(mutex_1);
readcount := readcount + 1;
if readcount = 1 then P(w);
V(mutex_1);
V(r);
V(mutex_3);
reading is performed
P(mutex_1);
readcount := readcount - 1;
if readcount = 0 then V(w);
V(mutex_1);
...there is a binary semaphore mutex_3, which limits number of threads trying to get access to r semaphore, so that writers have priority.
I tried removing that mutex, expecting writers to starve, but didn't succeed.
I wrote a program in Java, where threads wait a fixed amount of time twice: after and during reading/writing. I created one writer process and 8 readers processes and set waiting amount to 1 ms for all. I tried creating a situation wherein the r semaphore is being constantly attacked by one writer and many readers. None of this caused the failure I expected.
Am I doing something wrong? How can I cause writer starvation?
In this problem - from wikipedia -
int readcount, writecount; (initial value = 0)
semaphore mutex_1, mutex_2, mutex_3, w, r ; (initial value = 1)
READER
P(mutex_3);
P(r);
P(mutex_1);
readcount := readcount + 1;
if readcount = 1 then P(w);
V(mutex_1);
V(r);
V(mutex_3);
reading is performed
P(mutex_1);
readcount := readcount - 1;
if readcount = 0 then V(w);
V(mutex_1);
WRITER
P(mutex_2);
writecount := writecount + 1;
if writecount = 1 then P(r);
V(mutex_2);
P(w);
writing is performed
V(w);
P(mutex_2);
writecount := writecount - 1;
if writecount = 0 then V(r);
V(mutex_2);
It will be hard in practice to starve Readers, but in theory, they could.
The thing is that Writers have priority over Readers, so if you have Writers coming all the time, the Readers will be waiting for execution for ever.
Hope it helps!
Related
semaphore mutex = 1;
semaphore barrier = 0;
int count = 0;
void barrier-done() {
wait(mutex);
count++;
if (count < N ) {
post(mutex);
wait(barrier);
}
else {
post(mutex);
count = 0;
for (int i = 1; i < N; i++) {
post(barrier);
}
}
}
does anyone know the problem with this code? I'm trying to implement a code for barrier.
Assuming N is the number of threads you are expecting to wait for the barrier.
For Example N=10, then the threads 1 to 9 will have if condition true and they will wait for barrier.
The 10th Thread calling this will have that condition false because (10 !< 10).
So it will go ahead and post barrier 9 times.
I am not sure of the exact situation you want to achieve. But, this is what I understood from your code. May be you might need to tweak the if condition a bit.
I had the same issue but the problem is that you can't use minus sign in the name of function "barrier-done" after fixing this bug the code will be correct.
There are n threads. I'm trying to implement a function (pseudo code) which will directly block if it's called by a thread. Every thread will be blocked and the function will stop blocking threads if it was called by more than n/2 threads. If more than n/2 threads called the function, the function will no longer block other threads and will immediately return instead.
I did it like this but I'm not sure if I did the last part correctly where the function will immediately return if more than n/2 threads called it? :S
(Pseudocode is highly appreciated because then I have a better chance to understand it! :) )
int n = total amount of threads
sem waiter = 0
sem mutex = 1
int counter = 0
function void barrier()
int x
P(mutex)
if counter > n / 2 then
V(mutex)
for x = 0; x <= n / 2; x++;
V(waiter)
end for
end if
else
counter++
V(mutex)
P(waiter)
end else
end function
What you describe is a non-resetting barrier. Pthreads has a barrier implementation, but it is of the resetting variety.
To implement what you're after with pthreads, you will want a mutex plus a condition variable, and a shared counter. A thread entering the function locks the mutex and checks the counter. If not enough other threads have yet arrived then it waits on the CV, otherwise it broadcasts to it to wake all the waiting threads. If you wish, you can make it just the thread that tips the scale that broadcasts. Example:
struct my_barrier {
pthread_mutex_t barrier_mutex;
pthread_cond_t barrier_cv;
int threads_to_await;
};
void barrier(struct my_barrier *b) {
pthread_mutex_lock(&b->barrier_mutex);
if (b->threads_to_await > 0) {
if (--b->threads_to_await == 0) {
pthread_cond_broadcast(&b->barrier_cv);
} else {
do {
pthread_cond_wait(&b->barrier_cv, &b->barrier_mutex);
} while (b->threads_to_await);
}
}
pthread_mutex_unlock(&b->barrier_mutex);
}
Update: pseudocode
Or since a pseudocode representation is important to you, here's the same thing in a pseudocode language similar to the one used in the question:
int n = total amount of threads
mutex m
condition_variable cv
int to_wait_for = n / 2
function void barrier()
lock(mutex)
if to_wait_for == 1 then
to_wait_for = 0
broadcast(cv)
else if to_wait_for > 1 then
to_wait_for = to_wait_for - 1
wait(cv)
end if
unlock(mutex)
end function
That's slightly higher-level than your pseudocode, in that it does not assume that the mutex is implemented as a semaphore. (And with pthreads, which you tagged, you would need a pthreads mutex, not a semaphore, to go with a pthreads condition variable). It also omits the details of the real C code that deal with spurrious wakeup from waiting on the condition variable and with initializing the mutex and cv. Also, it presents the variables as if they are all globals -- such a function can be implemented that way in practice, but it is poor form.
Note also that it assumes that pthreads semantics for the condition variable: that waiting on the cv will temporarily release the mutex, allowing other threads to lock it, but that a thread that waits on the cv will reacquire the mutex before itself proceeding past the wait.
A few assumptions I am making within my answer:
P(...) is analogous to sem_wait(...)
V(...) is analogous to sem_post(...)
the barrier cannot be reset
I'm not sure if I did the last part correctly where the function will immediately return if more than n/2 threads called it
The pseudocode should work fine for the most part, but the early return/exit conditions could be significantly improved upon.
Some concerns (but nothing major):
The first time the condition counter > n / 2 is met, the waiter semaphore is signaled (i.e. V(...)) (n / 2) + 1 times (since it is from 0 to n / 2 inclusive), instead of n / 2 (which is also the value of counter at that moment).
Every subsequent invocation after counter > n / 2 is first met will also signal (i.e. V(...)) the waiter semaphore another (n / 2) + 1 times. Instead, it should early return and not re-signal.
These can be resolved with a few minor tweaks.
int n = total count of threads
sem mutex = 1;
sem waiter = 0;
int counter = 0;
bool released = FALSE;
function void barrier() {
P(mutex)
// instead of the `released` flag, could be replaced with the condition `counter > n / 2 + 1`
if released then
// ensure the mutex is released prior to returning
V(mutex)
return
end if
if counter > n / 2 then
// more than n/2 threads have tried to wait, mark barrier as released
released = TRUE
// mutex can be released at this point, as any thread acquiring `mutex` after will see that `release` is TRUE and early return
V(mutex)
// release all blocked threads; counter is guaranteed to never be incremeneted again
int x
for x = 0; x < counter; x++
V(waiter)
end for
else
counter++
V(mutex)
P(waiter)
end else
}
I am trying to understand how to use fetch and add (atomic operation) in a lock implementation.
I came across this article in Wikipedia, I found it duplicated in at least one other place. The implementation does not make sense and looks to me to have a bug or more in it. Of course I could be missing a subtle point and not really understanding what is being described.
From https://en.wikipedia.org/wiki/Fetch-and-add
<< atomic >>
function FetchAndAdd(address location, int inc) {
int value := *location
*location := value + inc
return value
}
record locktype {
int ticketnumber
int turn
}
procedure LockInit( locktype* lock ) {
lock.ticketnumber := 0
lock.turn := 0
}
procedure Lock( locktype* lock ) {
int myturn := FetchAndIncrement( &lock.ticketnumber ) //must be atomic, since many threads might ask for a lock at the same time
while lock.turn ≠ myturn
skip // spin until lock is acquired
}
procedure UnLock( locktype* lock ) {
FetchAndIncrement( &lock.turn ) //this need not be atomic, since only the possessor of the lock will execute this
}
According to the article they first do LockInit. FetchAndIncrement calls FetchAndAdd with inc set to 1.
If this does not contain a bug I do not understand how it could possibly work.
The first thread to access it will get it:
lock.ticketnumber = 1
lock.turn = 0.
Let's say 5 more accesses to the lock happen before it is released.
lock.ticketnumber = 6
lock.turn = 0
First thread releases the lock.
lock.ticketnumber = 6
lock.turn = 1
Next thread comes in and the status would be
lock.ticketnumber = 7
lock.turn = 1
And the returned value: myturn = 6 (lock.ticketnumber before the faa).
In this case the:
while lock.turn ≠ myturn
can never be true.
Is there a bug in this illustration or am I missing something?
If there is a bug in this implementation what would fix it?
Thanx
Julian
Dang it, I see it now. I found it referring to a general description of the algorithm and then I looked at it more closely.
When a thread calls Lock it spins waiting on the value it got back, for some reason I was thinking it kept calling that function.
When it spins it waits until another thread increments turn and eventually becomes the number of myturn.
Sorry for wasting your time.
https://en.wikipedia.org/wiki/Ticket_lock
I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.
The code is as follows:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
workerWG.Add(1)
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
workerWG.Done()
}
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
}
workerWG.Wait()
}
If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.
What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?
My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:
When I replace the worker function with the following code:
for {
<-fibNum
time.Sleep(500 * time.Millisecond)
}
both one available thread and four available threads take the same amount of time.
I revised your program to look like the following:
package main
import (
"math/rand"
"runtime"
"sync"
"time"
)
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
}
}
workerWG.Done()
}
func main() {
rand.Seed(time.Now().UnixNano())
runtime.GOMAXPROCS(1) // LINE IN QUESTION
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
workerWG.Add(1)
}
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
}
close(fibNum)
workerWG.Wait()
}
I cleaned up the wait group usage.
I changed rand.Intn(1000) to rand.Intn(100000)
On my machine that produces:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.
In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.
Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
The code written by #thwd is correct and idiomatic Go.
Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.
Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).
A couple hints for simplifying code:
For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
Your main computation routine in worker does not allow the scheduler to run.
Calling the scheduler manually like
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
runtime.Gosched()
}
}
Reduces wall clock by 30% when switching from one to two threads.
Such artificial microbenchmarks are really hard to get right.
I wrote app, Caesar Cipher in Windows Forms CLI with dynamic linking libraries(in C++ and in ASM) with my alghorithms for model(eciphering and deciphering). That part of my app is working.
Here is also a multithreading from Windows Forms. User can chose number of threads(1-64). If he chose 2, message to encipher(decipher) will be divided on two substrings which will be divided on two threads. And I want to execute these threads paraller, and finally reduce cost of execution time.
When user push encipher or decipher button there will be displayed enciphered or deciphered text and time costs for execution functions in C++ and ASM. Actualy everything is alright, but times for greater threads than 1 aren't smaller, they are bigger.
There is some code:
/*Function which concats string for substrings to threads*/
array<String^>^ ThreadEncipherFuncCpp(int nThreads, string str2){
//Tablica wątków
array<String^>^ arrayOfThreads = gcnew array <String^>(nThreads);
//Przechowuje n-tą część wiadomosci do przetworzenia
string loopSubstring;
//Długość podstringa w wiadomości
int numberOfSubstring = str2.length() / nThreads;
int isModulo = str2.length() % nThreads;
array<Thread^>^ xThread = gcnew array < Thread^ >(nThreads);
for (int i = 0; i < nThreads; i++)
{
if (i == 0 && numberOfSubstring != 0)
loopSubstring = str2.substr(0, numberOfSubstring);
else if ((i == nThreads - 1) && numberOfSubstring != 0){
if (isModulo != 0)
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring + isModulo);
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
}
else if (numberOfSubstring == 0){
loopSubstring = str2.substr(0, isModulo);
i = nThreads - 1;
}
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
ThreadExample::inputString = gcnew String(loopSubstring.c_str());
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
arrayOfThreads[i] = ThreadExample::outputString;
}
return arrayOfThreads;
}}
Here is a fragment which is responsible for the calculation of the time for C++:
/*****************C++***************/
auto start = chrono::high_resolution_clock::now();
array<String^>^ arrayOfThreads = ThreadEncipherFuncCpp(nThreads, str2);
auto elapsed = chrono::high_resolution_clock::now() - start;
long long milliseconds = chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
double micro = milliseconds;
this->label4->Text = Convert::ToString(micro + " microseconds");
String^ str3;
String^ str4;
str4 = str3->Concat(arrayOfThreads);
this->textBox2->Text = str4;
/**********************************/
And example of working:
For input data: "Some example text. Some example text2."
Program will display: "Vrph hadpsoh whaw. Vrph hadpsoh whaw2."
Times of execution for 1 thread:
C++ time: 31231us.
Asm time: 31212us.
Times of execution for 2 threads:
C++ time: 62488us.
Asm time: 62505us.
Times of execution for 4 threads:
C++ time: 140254us.
Asm time: 124587us.
Times of execution for 32 threads:
C++ time: 1002548us.
Asm time: 1000020us.
How to solve this problem?
I need this structure of program, this is academic project.
My CPU has 4 cores.
The reason it's not going any faster is because you aren't letting your threads run in parallel.
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
These three lines create the thread, start it running, and then wait for it to finish. You're not getting any parallelism here, you're just adding the overhead of spawning & waiting for threads.
If you want to have a speedup from multithreading, the way to do it is to start all the threads at once, let them all run, and then collect up the results.
In this case, I'd make it so that ThreadEncipher (which you haven't shown us the source of, so I'm making assumptions) takes a parameter, which is used as an array index. Instead of having ThreadEncipher read from inputString and write to outputString, have it read from & write to one index of an array. That way, each thread can read & write at the same time. After you've spawned all these threads, then you can wait for all of them to finish, and you can either process the output array, or since array<String^>^ is already your return type, just return it as-is.
Other thoughts:
You've got a mix of unmanaged and managed objects here. It will be better if you pick one and stick with it. Since you're in C++/CLI, I'd recommend that you stick with the managed objects. I'd stop using std::string, and use System::String^ exclusively.
Since your CPU has 4 cores, you're not going to get any speedup by using more than 4 threads. Don't be surprised when 32 threads takes longer than 4, because you're doing 8x the string manipulation, and you've got 32 threads fighting over 4 processor cores.
Your string splitting code is more complex than it needs to be. You've got five different cases in there, I'd have to sit down and think about it for a while to be sure it's correct. Try this:
int totalLen = str2->length;
for (int i = 0; i < nThreads; i++)
{
int startIndex = totalLen * i / nThreads;
int endIndex = totalLen * (i+1) / nThreads;
int substrLen = endIndex - startIndex;
String^ substr = str2->SubString(startIndex, substrLen);
...
}