AnyLogic: stop Delay for a specific agent in the delay block - delay

In my process, I have a delay block with unlimited capacity. The agents in the delay block are only freed if stopDelay() is called.
If an event occurs, I want to stop the delay for some of the agents stored in this delay block.
However, I only want to free the x (can vary) agents, which have been in the block for the longest time.
Is there a function/trick for that or do I have to compare the "getElapsedTime(Agent agent)" for all agents in the delay block manually?
Thanks a lot in advance.

I would recommend replacing the Delay block with a Wait block. You can also give it unlimited capacity. Instead of using stopDelay(), use free().
Your code would look something like that (where x is the number of agents you want to go through):
int x = 0;
x = 3;
for(int i = 0 ; i < x ; i++) {
wait.free(wait.get(i));
}
Also, make sure to replace wait in the code with the wait block's name.

Related

Why is the race detector not detecting this race condition?

I am currently learning the Go programming language and I am now experimenting with the atomic package.
In this example, I am spawning a number of Goroutines that all need to increment a package level variable. There are several methods to avoid race conditions but for now I want to solve this using the atomic package.
When running the following code on my Windows PC (go run main.go) the results are not what I expect them to be (I expect the final result to be 1000). The final number is somewhere between 900 and 1000. When running the code in the Go Playground the value is 1000.
Here is the code I am using: https://play.golang.org/p/8gW-AsKGzwq
var counter int64
var wg sync.WaitGroup
func main() {
num := 1000
wg.Add(num )
for i := 0; i < num ; i++ {
go func() {
v := atomic.LoadInt64(&counter)
v++
atomic.StoreInt64(&counter, v)
// atomic.AddInt64(&counter, 1)
// fmt.Println(v)
wg.Done()
}()
}
wg.Wait()
fmt.Println("final", counter)
}
go run main.go
final 931
go run main.go
final 960
go run main.go
final 918
I would have expected the race detector to give an error, but it doesn't:
go run -race main.go
final 1000
And it outputs the correct value (1000).
I am using go version go1.12.7 windows/amd64 (latest version at this moment)
My questions:
Why is the race detector not giving an error, but am I seeing different values when running the code without the race detector?
My theory why the Load/Store combination is not working is that the two atomic calls are not atomic as a whole. In this case I should be using the atomic.AddInt64 method, is that right?
Any help would be greatly appreciated :)
There is nothing racy in your code, so that's why the race detector detects nothing. Your counter variable is always accessed via the atomic package from the launched goroutines and not directly.
The reason why sometimes you get 1000 and sometimes less is due to the number of active threads that run goroutines: GOMAXPROCS. On the Go Playground it's 1, so at any time you have one active goroutine (so basically your app is executed sequentially, without any parallelism). And the current goroutine scheduler does not put goroutines to park arbitrarily.
On your local machine you probably have a multicore CPU, and GOMAXPROCS defaults to the number of available logical CPUs, so GOMAXPROCS is greater than 1, so you have multiple goroutines running parallel (truly parallel, not just concurrent).
See this fragment:
v := atomic.LoadInt64(&counter)
v++
atomic.StoreInt64(&counter, v)
You load counter's value and assign it to v, you increment v, and you store back the value of the incremented v. What happens if 2 parallel goroutines do this at the same time? Let's say both load the value 100. Both increment their local copy: 101. Both write back 101, even though it should be at 102.
Yes, the proper way to increment counters atomically would be to use atomic.AddInt64() like this:
for i := 0; i < num; i++ {
go func() {
atomic.AddInt64(&counter, 1)
wg.Done()
}()
}
This way you'll always get 1000, no matter what GOMAXPROCS is.

Peterson's solution with single variable

do {
turn = j; // = (1-i)
while(turn==j);
//critical section
turn = j; //exit section.
} while(true);
Can peterson's algorithm work with just the turn variable. Why is the flag variable required?
Yes obviously. Using this approach, the Progress condition is not satisfied.
Clearly, here s process j will move forward (i.e. get out of busy wait at while();) if and only if someone else changes the turn variable to its own id (say i).
Thus clearly, the progress of process j is in hands of process 'i'.
For eg. Suppose the other process got busy in a non critical section area above the critical section. Or maybe the other process is killed/ deadlocked etc. Then this poor process j keeps waiting forever.

Reusable Barrier Algorithm

I'm looking into the Reusable Barrier algorithm from the book "The Little Book Of Semaphores" (archived here).
The puzzle is on page 31 (Basic Synchronization Patterns/Reusable Barrier), and I have come up with a 'solution' (or not) which differs from the solution from the book (a two-phase barrier).
This is my 'code' for each thread:
# n = 4; threads running
# semaphore = n max., initialized to 0
# mutex, unowned.
start:
mutex.wait()
counter = counter + 1
if counter = n:
semaphore.signal(4) # add 4 at once
counter = 0
mutex.release()
semaphore.wait()
# critical section
semaphore.release()
goto start
This does seem to work, I've even inserted different sleep timers into different sections of the threads, and they still wait for all the threads to come before continuing each and every loop. Am I missing something? Is there a condition that this will fail?
I've implemented this using the Windows library Semaphore and Mutex functions.
Update:
Thank you to starblue for the answer. Turns out that if for whatever reason a thread is slow between mutex.release() and semaphore.wait() any of the threads that arrive to semaphore.wait() after a full loop will be able to go through again, since there will be one of the N unused signals left.
And having put a Sleep command for thread number 3, I got this result where one can see that thread 3 missed a turn the first time, with thread 1 having done 2 turns, and then catching up on the second turn (which was in fact its 1st turn).
Thanks again to everyone for the input.
One thread could run several times through the barrier while some other thread doesn't run at all.

pthreads: If I increment a global from two different threads, can there be sync issues?

Suppose I have two threads A and B that are both incrementing a ~global~ variable "count". Each thread runs a for loop like this one:
for(int i=0; i<1000; i++)
count++; //alternatively, count = count + 1;
i.e. each thread increments count 1000 times, and let's say count starts at 0. Can there be sync issues in this case? Or will count correctly equal 2000 when the execution is finished? I guess since the statement "count = count + 1" may break down into TWO assembly instructions, there is potential for the other thread to be swapped in between these two instructions? Not sure. What do you think?
Yes there can be sync issues in this case. You need to either protect the count variable with a mutex, or use a (usually platform specific) atomic operation.
Example using pthread mutexes
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
for(int i=0; i<1000; i++) {
pthread_mutex_lock(&mutex);
count++;
pthread_mutex_unlock(&mutex);
}
Using atomic ops
There is a prior discussion of platform specific atomic ops here:
UNIX Portable Atomic Operations
If you only need to support GCC, this approach is straightforward. If you're supporting other compilers, you'll probably have to make some per-platform decisions.
Count clearly needs to be protected with a mutex or other synchronization mechanism.
At a fundamental level, the count++ statment breaks down to:
load count into register
increment register
store count from register
A context switch could occur before/after any of those steps, leading to situations like:
Thread 1: load count into register A (value = 0)
Thread 2: load count into register B (value = 0)
Thread 1: increment register A (value = 1)
Thread 1: store count from register A (value = 1)
Thread 2: increment register B (value = 1)
Thread 2: store count from register B (value = 1)
As you can see, both threads completed one iteration of the loop, but the net result is that count was only incremented once.
You probably would also want to make count volatile to force loads & stores to go to memory, since a good optimizer would likely keep count in a register unless otherwise told.
Also, I would suggest that if this is all the work that's going to be done in your threads, performance will dramatically drop from all the mutex locking/unlocking required to keep it consistent. Threads should have much bigger work units to perform.
Yes, there can be sync problems.
As an example of the possible issues, there is no guarantee that an increment itself is an atomic operation.
In other words, if one thread reads the value for increment then gets swapped out, the other thread could come in and change it, then the first thread will write back the wrong value:
+-----+
| 0 | Value stored in memory (0).
+-----+
| 0 | Thread 1 reads value into register (r1 = 0).
+-----+
| 0 | Thread 2 reads value into register (r2 = 0).
+-----+
| 1 | Thread 2 increments r2 and writes back.
+-----+
| 1 | Thread 1 increments r1 and writes back.
+-----+
So you can see that, even though both threads have tried to increment the value, it's only increased by one.
This is just one of the possible problems. It may also be that the write itself is not atomic and one thread may update only part of the value before being swapped out.
If you have atomic operations that are guaranteed to work in your implementation, you can use them. Otherwise, use mutexes, That's what pthreads provides for synchronisation (and guarantees will work) so is the safest approach.
I guess since the statement "count = count + 1" may break down into TWO assembly instructions, there is potential for the other thread to be swapped in between these two instructions? Not sure. What do you think?
Don't think like this. You're writing C code and pthreads code. You don't have to ever think about assembly code to know how your code will behave.
The pthreads standard does not define the behavior when one thread accesses an object while another thread is, or might be, modifying it. So unless you're writing platform-specific code, you should assume this code can do anything -- even crash.
The obvious pthreads fix is to use mutexes. If your platform has atomic operations, you can use those.
I strongly urge you not to delve into detailed discussions about how it might fail or what the assembly code might look like. Regardless of what you might or might not think compilers or CPUs might do, the behavior of the code is undefined. And it's too easy to convince yourself you've covered every way you can think of that it might fail and then you miss one and it fails.

What can make a program run slower when using more threads?

This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
bin_index is a completely deterministic function of its arguments which, for purposes of this question, does not use or change any shared state - in other words, it is manifestly reentrant.
I first wrote this program to use a single thread. Then I converted it to use multiple threads, such that thread n runs all iterations of the outer loop where i1 % nthreads == n. So the function that runs in each thread looks like
for (int i1 = n; i1 < N; i1 += nthreads)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
thread_local_histogram[bin_index(i1, i2, i3, i4)] += 1;
and all the thread_local_histograms are added up in the main thread at the end.
Here's the strange thing: when I run the program with just 1 thread for some particular size of the calculation, it takes about 6 seconds. When I run it with 2 or 3 threads, doing exactly the same calculation, it takes about 9 seconds. Why is that? I would expect that using 2 threads would be faster than 1 thread since I have a dual-core CPU. The program does not use any mutexes or other synchronization primitives so two threads should be able to run in parallel.
For reference: typical output from time (this is on Linux) for one thread:
real 0m5.968s
user 0m5.856s
sys 0m0.064s
and two threads:
real 0m9.128s
user 0m10.129s
sys 0m6.576s
The code is at http://static.ellipsix.net/ext-tmp/distintegral.ccs
P.S. I know there are libraries designed for exactly this kind of thing that probably could have better performance, but that's what my last question was about so I don't need to hear those suggestions again. (Plus I wanted to use pthreads as a learning experience.)
To avoid further comments on this: When I wrote my reply, the questioner hasn't posted a link to his source yet, so I could not tailor my reply to his specific issues. I was only answering the general question what "can" cause such an issue, I never said that this will necessarily apply to his case. When he posted a link to his source, I wrote another reply, that is exactly only focusing on his very issue (which is caused by the use of the random() function as I explained in my other reply). However, since the question of this post is still "What can make a program run slower when using more threads?" and not "What makes my very specific application run slower?", I've seen no need to change my rather general reply either (general question -> general response, specific question -> specific response).
1) Cache Poisoning
All threads access the same array, which is a block of memory. Each core has its own cache to speed up memory access. Since they don't just read from the array but also change the content, the content is changed actually in the cache only, not in real memory (at least not immediately). The problem is that the other thread on the other core may have overlapping parts of memory cached. If now core 1 changes the value in the cache, it must tell core 2 that this value has just changed. It does so by invalidating the cache content on core 2 and core 2 needs to re-read the data from memory, which slows processing down. Cache poisoning can only happen on multi-core or multi-CPU machines. If you just have one CPU with one core this is no problem. So to find out if that is your issue or not, just disable one core (most OSes will allow you to do that) and repeat the test. If it is now almost equally fast, that was your problem.
2) Preventing Memory Bursts
Memory is read fastest if read sequentially in bursts, just like when files are read from HD. Addressing a certain point in memory is actually awfully slow (just like the "seek time" on a HD), even if your PC has the best memory on the market. However, once this point has been addressed, sequential reads are fast. The first addressing goes by sending a row index and a column index and always having waiting times in between before the first data can be accessed. Once this data is there, the CPU starts bursting. While the data is still on the way it sends already the request for the next burst. As long as it is keeping up the burst (by always sending "Next line please" requests), the RAM will continue to pump out data as fast as it can (and this is actually quite fast!). Bursting only works if data is read sequentially and only if the memory addresses grow upwards (AFAIK you cannot burst from high to low addresses). If now two threads run at the same time and both keep reading/writing memory, however both from completely different memory addresses, each time thread 2 needs to read/write data, it must interrupt a possible burst of thread 1 and the other way round. This issue gets worse if you have even more threads and this issue is also an issue on a system that has only one single-core CPU.
BTW running more threads than you have cores will never make your process any faster (as you mentioned 3 threads), it will rather slow it down (thread context switches have side effects that reduce processing throughput) - that is unlike you run more threads because some threads are sleeping or blocking on certain events and thus cannot actively process any data. In that case it may make sense to run more threads than you have cores.
Everything I said so far in my other reply holds still true on general, as your question was what "can"... however now that I've seen your actual code, my first bet would be that your usage of the random() function slows everything down. Why?
See, random keeps a global variable in memory that stores the last random value calculated there. Each time you call random() (and you are calling it twice within a single function) it reads the value of this global variable, performs a calculation (that is not so fast; random() alone is a slow function) and writes the result back there before returning it. This global variable is not per thread, it is shared among all threads. So what I wrote regarding cache poisoning applies here all the time (even if you avoided it for the array by having separated arrays per thread; this was very clever of you!). This value is constantly invalidated in the cache of either core and must be re-fetched from memory. However if you only have a single thread, nothing like that happens, this variable never leaves cache after it has been initially read, since it's permanently accessed again and again and again.
Further to make things even worse, glibc has a thread-safe version of random() - I just verified that by looking at the source. While this seems to be a good idea in practice, it means that each random() call will cause a mutex to be locked, memory to be accessed, and a mutex to be unlocked. Thus two threads calling random exactly the same moment will cause one thread to be blocked for a couple of CPU cycles. This is implementation specific, though, as AFAIK it is not required that random() is thread safe. Most standard lib functions are not required to be thread-safe, since the C standard is not even aware of the concept of threads in the first place. When they are not calling it the same moment, the mutex will have no influence on speed (as even a single threaded app must lock/unlock the mutex), but then cache poisoning will apply again.
You could pre-build an array with random numbers for every thread, containing as many random number as each thread needs. Create it in the main thread before spawning the threads and add a reference to it to the structure pointer you hand over to every thread. Then get the random numbers from there.
Or just implement your own random number generator if you don't need the "best" random numbers on the planet, that works with per-thread memory for holding its state - that one might be even faster than the system's built-in generator.
If a Linux only solution works for you, you can use random_r. It allows you to pass the state with every call. Just use a unique state object per thread. However this function is a glibc extension, it is most likely not supported by other platforms (neither part of the C standards nor of the POSIX standards AFAIK - this function does not exist on Mac OS X for example, it may neither exist in Solaris or FreeBSD).
Creating an own random number generator is actually not that hard. If you need real random numbers, you shouldn't use random() in the first place. Random only creates pseudo-random numbers (numbers that look random, but are predictable if you know the generator's internal state). Here's the code for one that produces good uint32 random numbers:
static uint32_t getRandom(uint32_t * m_z, uint32_t * m_w)
{
*m_z = 36969 * (*m_z & 65535) + (*m_z >> 16);
*m_w = 18000 * (*m_w & 65535) + (*m_w >> 16);
return (*m_z << 16) + *m_w;
}
It's important to "seed" m_z and m_w in a proper way somehow, otherwise the results are not random at all. The seed value itself should already be random, but here you could use the system random number generator.
uint32_t m_z = random();
uint32_t m_w = random();
uint32_t nextRandom;
for (...) {
nextRandom = getRandom(&m_z, &m_w);
// ...
}
This way every thread only needs to call random() twice and then uses your own generator. BTW, if you need double randoms (that are between 0 and 1), the function above can be easily wrapped for that:
static double getRandomDouble(uint32_t * m_z, uint32_t * m_w)
{
// The magic number below is 1/(2^32 + 2).
// The result is strictly between 0 and 1.
return (getRandom(m_z, m_w) + 1) * 2.328306435454494e-10;
}
Try to make this change in your code and let me know how the benchmark results are :-)
You are seeing cache line bouncing. I'm really surprised that you don't get wrong results, due to race conditions on the histogram buckets.
One possibility is that the time taken to create the threads exceeds the savings gained by using threads. I would think that N is not very large, if the elapsed time is only 6 seconds for a O(n^4) operation.
There's also no guarantee that multiple threads will run on different cores or CPUs. I'm not sure what the default thread affinity is with Linux - it may be that both threads run on a single core which would negate the benefits of a CPU-intensive piece of code such as this.
This article details default thread affinity and how to change your code to ensure threads run on specific cores.
Even though threads don't access the same elements of the array at the same, the whole array may sit in a few memory pages. When one core/processor writes to that page, it has to invalidate its cache for all other processors.
Avoid having many threads working over the same memory space. Allocate separate data for each thread to work upon, then join them together when the calculation finishes.
Off the top of my head:
Context switches
Resource contention
CPU contention (if they aren't getting split to multiple CPUs).
Cache thrashing
David,
Are you sure you run a kernel that supports multiple processors? If only one processor is utilized in your system, spawning additional CPU-intensive threads will slow down your program.
And, are you sure support for threads in your system actually utilizes multiple processors? Does top, for example, show that both cores in your processor utilized when you run your program?

Resources