Is the performance of notify_one() really this bad?

Is the performance of notify_one() really this bad? - multithreading

For the measurement below I've been using x86_64 GNU/Linux with kernel 4.4.0-109-generic #132-Ubuntu SMP running on the AMD FX(tm)-8150 Eight-Core Processor (which has a 64 byte cache-line size).
The full source code can be obtained here: https://github.com/CarloWood/ai-threadsafe-testsuite/blob/master/src/condition_variable_test.cxx
which is independent of other libraries. Just compile with:
g++ -pthread -std=c++11 -O3 condition_variable_test.cxx
What I really tried to do here is measure how long it takes to execute a call to notify_one() when one or more threads are actually waiting, relative to how long that takes when no thread is waiting on the condition_variable used.
To my astonishment I found that both cases are in the microsecond range: when 1 thread is waiting it takes about 14 to 20 microseconds; when no thread is waiting it takes apparently less, but still at least 1 microsecond.
In other words, if you have a producer/consumer scenario and every time there is nothing to do for the consumer you let them call wait(), and every time something new is written to the queue by a producer you call notify_one() assuming that the implementation of std::condition_variable will be smart enough not to spend a lot of time when no threads are waiting in the first place.. then oh horror, your application will become a lot slower than with the code that I wrote to TEST how long a call to notify_one() takes when a thread is waiting!
It seems that the code that I used is a must to speed up such scenarios. And that confuses me: why on earth isn't the code that I wrote already part of std::condition_variable ?
The code in question is, instead of doing:
// Producer thread:
add_something_to_queue();
cv.notify_one();
// Consumer thread:
if (queue.empty())
{
std::unique_lock<std::mutex> lk(m);
cv.wait(lk);
}
You can gain a speed up of a factor of 1000 by doing:
// Producer thread:
add_something_to_queue();
int waiting;
while ((waiting = s_idle.load(std::memory_order_relaxed)) > 0)
{
if (!s_idle.compare_exchange_weak(waiting, waiting - 1, std::memory_order_relaxed, std::memory_order_relaxed))
continue;
std::unique_lock<std::mutex> lk(m);
cv.notify_one();
break;
}
// Consumer thread:
if (queue.empty())
{
std::unique_lock<std::mutex> lk(m);
s_idle.fetch_add(1, std::memory_order_relaxed);
cv.wait(lk);
}
Am I making some horrible mistake here? Or are my findings correct?
Edit:
I forgot to add the output of the bench mark program (DIRECT=0):
All started!
Thread 1 statistics: avg: 1.9ns, min: 1.8ns, max: 2ns, stddev: 0.039ns
The average time spend on calling notify_one() (726141 calls) was: 17995.5 - 21070.1 ns.
Thread 1 finished.
Thread Thread Thread 5 finished.
8 finished.
7 finished.
Thread 6 finished.
Thread 3 statistics: avg: 1.9ns, min: 1.7ns, max: 2.1ns, stddev: 0.088ns
The average time spend on calling notify_one() (726143 calls) was: 17207.3 - 22278.5 ns.
Thread 3 finished.
Thread 2 statistics: avg: 1.9ns, min: 1.8ns, max: 2ns, stddev: 0.055ns
The average time spend on calling notify_one() (726143 calls) was: 17910.1 - 21626.5 ns.
Thread 2 finished.
Thread 4 statistics: avg: 1.9ns, min: 1.6ns, max: 2ns, stddev: 0.092ns
The average time spend on calling notify_one() (726143 calls) was: 17337.5 - 22567.8 ns.
Thread 4 finished.
All finished!
And with DIRECT=1:
All started!
Thread 4 statistics: avg: 1.2e+03ns, min: 4.9e+02ns, max: 1.4e+03ns, stddev: 2.5e+02ns
The average time spend on calling notify_one() (0 calls) was: 1156.49 ns.
Thread 4 finished.
Thread 5 finished.
Thread 8 finished.
Thread 7 finished.
Thread 6 finished.
Thread 3 statistics: avg: 1.2e+03ns, min: 5.9e+02ns, max: 1.5e+03ns, stddev: 2.4e+02ns
The average time spend on calling notify_one() (0 calls) was: 1164.52 ns.
Thread 3 finished.
Thread 2 statistics: avg: 1.2e+03ns, min: 1.6e+02ns, max: 1.4e+03ns, stddev: 2.9e+02ns
The average time spend on calling notify_one() (0 calls) was: 1166.93 ns.
Thread 2 finished.
Thread 1 statistics: avg: 1.2e+03ns, min: 95ns, max: 1.4e+03ns, stddev: 3.2e+02ns
The average time spend on calling notify_one() (0 calls) was: 1167.81 ns.
Thread 1 finished.
All finished!
The '0 calls' in the latter output are actually around 20000000 calls.

Related

Can two hrtimer callbacks in linux kernel run at the same time?

As stated in https://lwn.net/Articles/308545/ hrtimer callbacks run in hard interrupt context with irqs disabled.
But what about a SMP?
Can a second callback for another hrtimer run on another core,
while a first callback is allready running or do they exclude each other on all cores, so that no locking is needed between them?
edit:
When a handler for a "regular" hardware IRQ (let's call it X) is running on a core, all IRQs are disabled on only that core, but IRQ X is disabled on the whole system, so two handlers for X never run concurrently.
How do hrtimer interrupts behave in this regard?
Do they all share the same quasi IRQ, or is there one IRQ per hrtimer?
edit:
Did some experiments with two timers A and B:
// starting timer A to fire as fast as possible...
A_ktime = ktime_set(0, 1); // 1 NS
hrtimer_start( &A, A_ktime, HRTIMER_MODE_REL );
// starting timer B to fire 10 us later
B_ktime = ktime_set(0, 10000); // 10 us
hrtimer_start( &B, B_ktime, HRTIMER_MODE_REL );
Put some printks into the callbacks and a huge delay into the one for timer A
// fired after 1 NS
enum hrtimer_restart A(struct hrtimer *timer)
{
printk("timer A: %lu\n",jiffies);
int i;
for(i=0;i<10000;i++){ // delay 10 seconds (1000 jiffies with HZ 100)
udelay(1000);
}
printk("end timer A: %lu\n",jiffies);
return HRTIMER_NORESTART;
}
// fired after 10 us
enum hrtimer_restart B(struct hrtimer *timer)
{
printk("timer B: %lu\n",jiffies);
return HRTIMER_NORESTART;
}
Result was reproducible something like
[ 6.217393] timer A: 4294937914
[ 16.220352] end timer A: 4294938914
[ 16.224059] timer B: 4294938915
1000 jiffies after start of timer A,
when timer B was setup to fire after less than one jiffie after it.
When driving this further and increasing the delay to 70 seconds,
I got 7000 jiffies between start of timer A callback and timer B callback.
[ 6.218258] timer A: 4294937914
[ 76.220058] end timer A: 4294944914
[ 76.224192] timer B: 4294944915
edit:
Locking is probably required, because hrtimers
just get enqueued an any CPU. If two of them are enqueued on the same, it might happen, that they delay each other, but there is no guarantee.
from hrtimer.h:
* On SMP it is possible to have a "callback function running and enqueued"
* status. It happens for example when a posix timer expired and the callback
* queued a signal. Between dropping the lock which protects the posix timer
* and reacquiring the base lock of the hrtimer, another CPU can deliver the
* signal and rearm the timer.

CFS sysctl_sched_latency kernel parameter

We have following kernel parameters:
sysctl_sched_min_granularity = 0.75
sysctl_sched_latency = 6 ms
sched_nr_latency = 8
What I understand (I don't know if correctly), parameter sysctl_sched_latency says, that all tasks in runqueue should be executed in time 6ms.
So if the task arrives in time X the task should be executed not later than X + 6 ms at least once.
The function check_preempt_tick is executed periodically by task_tick_fair.
At the beginning of this function we checks whether delta_exec is not greater then ideal_runtime:
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime)
For instance:
if we have 4 tasks (the same priority) ideal_runtime should returns 1.5 ms for the first task.
Let assume that this task has executed for 1.5 ms and exit.
So we have 3 tasks in the queue right now.
The second calculation for task2 is like that:
we have 3 tasks in a queue so time of ideal_runtime should be 2 ms.
Task2 has run for 2 ms and exit.
Again
task3 calculates time and we have
2 tasks in the queue
so it should run for 3ms.
So in summary
task4 will be executed after 1.5 ms (task1) + 2ms (task2) + 3ms (task3)
so sysctl_sched_latency is exceeded (which is 6 ms).
How CFS does ensure that all task in queue are executed in sysctl_sched_latency time, when queue can change dynamically all time?
Thanks,

How do I parallelize GPars Actors?

My understanding of GPars Actors may be off so please correct me if I'm wrong. I have a Groovy app that polls a web service for jobs. When one or more jobs are found it sends each job to a DynamicDispatchActor I've created, and the job is handled. The jobs are completely self-contained and don't need to return anything to the main thread. When multiple jobs come in at once I'd like them to be processed in parallel, but no matter what configuration I try the actor processes them first in first out.
To give a code example:
def poolGroup = new DefaultPGroup(new DefaultPool(true, 5))
def actor = poolGroup.messageHandler {
when {Integer msg ->
println("I'm number ${msg} on thread ${Thread.currentThread().name}")
Thread.sleep(1000)
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
integers.each {
actor << it
}
This prints out:
I'm number 1 on thread Actor Thread 31
I'm number 2 on thread Actor Thread 31
I'm number 3 on thread Actor Thread 31
I'm number 4 on thread Actor Thread 31
I'm number 5 on thread Actor Thread 31
I'm number 6 on thread Actor Thread 31
I'm number 7 on thread Actor Thread 31
I'm number 8 on thread Actor Thread 31
I'm number 9 on thread Actor Thread 31
I'm number 10 on thread Actor Thread 31
With a slight pause in between each print out. Also notice that each printout happens from the same Actor/thread.
What I'd like to see here is the first 5 numbers are printed out instantly because the thread pool is set to 5, and then the next 5 numbers as those threads free up. Am I completely off base here?

To make it run as you expect there are few changes to make:
import groovyx.gpars.group.DefaultPGroup
import groovyx.gpars.scheduler.DefaultPool
def poolGroup = new DefaultPGroup(new DefaultPool(true, 5))
def closure = {
when {Integer msg ->
println("I'm number ${msg} on thread ${Thread.currentThread().name}")
Thread.sleep(1000)
stop()
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
def actors = integers.collect { poolGroup.messageHandler(closure) << it }
actors*.join()
Full gist file: https://gist.github.com/wololock/7f1348e04f68710e42d2
Then the output will be:
I'm number 5 on thread Actor Thread 5
I'm number 4 on thread Actor Thread 4
I'm number 1 on thread Actor Thread 1
I'm number 3 on thread Actor Thread 3
I'm number 2 on thread Actor Thread 2
I'm number 6 on thread Actor Thread 3
I'm number 9 on thread Actor Thread 4
I'm number 7 on thread Actor Thread 2
I'm number 8 on thread Actor Thread 5
I'm number 10 on thread Actor Thread 1
Now let's take a look what changed. First of all in your previous example you've worked on a single actor only. You defined poolGroup correctly, but then you created a single actor and shifted computation to this single instance. To make run those computations in parallel you have to rely on poolGroup and only send an input to some message handler - pool group will handle actors creation and their lifecycle management. This is what we do in:
def actors = integers.collect { poolGroup.messageHandler(closure) << it }
It will create a collection of actors started with given input. Pool group will take care that the specified pool size is not exceeded. Then you have to join each actor and this can be done by using groovy's magic: actors*.join(). Thanks that the application will wait with termination until all actors stop their computation. That's why we have to add stop() method to the when closure of message handler's body - without it, it wont terminate, because pool group does not know that actors did they job - they may wait e.g. for some another message.
Alternative solution
We can also consider alternative solution that uses GPars parallelized iterations:
import groovyx.gpars.GParsPool
// This example is dummy, but let's assume that this processor is
// stateless and shared between threads component.
class Processor {
void process(int number) {
println "${Thread.currentThread().name} starting with number ${number}"
Thread.sleep(1000)
}
}
def integers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Processor processor = new Processor()
GParsPool.withPool 5, {
integers.eachParallel { processor.process(it) }
}
In this example you have a stateless component Processor and paralleled computations using one instance of stateless Processor with multiple input values.
I've tried to figure out the case you mentioned in comment, but I'm not sure if single actor can process multiple messages at a time. Statelessness of an actor means only that it does not change it's internal state during the processing of a message and must not store any other information in actor scope. It would be great if someone could correct me if my reasoning is not correct :)
I hope this will help you. Best!

Perl Queue and Threads abnormal exit

I am quite new to Perl, especially Perl Threads.
I want to accomplish:
Have 5 threads that will en-queue data(Random numbers) into a
Thread::queue
Have 3 threads that will de-queue data from the
Thread::queue.
The complete code that I wrote in order to achieve above mission:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #Enquing_threads;
our #Dequeuing_threads;
sub buildQueue
{
my $TotalEntry=1000;
while($TotalEntry-- >0)
{
my $query = rand(10000);
$queue->enqueue($query);
print "Enque thread with TID " .threads->tid . " got $query,";
print "Queue Size: " . $queue->pending . "\n";
}
}
sub process_Queue
{
my $query;
while ($query = $queue->dequeue)
{
print "Dequeu thread with TID " .threads->tid . " got $query\n";
}
}
push #Enquing_threads,threads->create(\&buildQueue) for 1..5;
push #Dequeuing_threads,threads->create(\&process_Queue) for 1..3;
Issues that I am Facing:
The threads are not running as concurrently as expected.
The entire program abnormally exit with following console output:
Perl exited with active threads:
8 running and unjoined
0 finished and unjoined
0 running and detached
Enque thread with TID 5 got 6646.13585023883,Queue Size: 595
Enque thread with TID 1 got 3573.84104215917,Queue Size: 595
Any help on code-optimization is appreciated.

This behaviour is to be expected: When the main thread exits, all other threads exit as well. If you don't care, you can $thread->detach them. Otherwise, you have to manually $thread->join them, which we'll do.
The $thread->join waits for the thread to complete, and fetches the return value (threads can return values just like subroutines, although the context (list/void/scalar) has to be fixed at spawn time).
We will detach the threads that enqueue data:
threads->create(\&buildQueue)->detach for 1..5;
Now for the dequeueing threads, we put them into a lexical variable (why are you using globals?), so that we can dequeue them later:
my #dequeue_threads = map threads->create(\&process_queue), 1 .. 3;
Then wait for them to complete:
$_->join for #dequeue_threads;
We know that the detached threads will finish execution before the programm exits, because the only way for the dequeueing threads to exit is to exhaust the queue.
Except for one and a half bugs. You see, there is a difference between an empty queue and a finished queue. If the queue is just empty, the dequeueing threads will block on $queue->dequeue until they get some input. The traditional solution is to dequeue while the value they get is defined. We can break the loop by supplying as many undef values in the queue as there are threads reading from the queue. More modern version of Thread::Queue have an end method, that makes dequeue return undef for all subsequent calls.
The problem is when to end the queue. We should to this after all enqueueing threads have exited. Which means, we should wait for them manually. Sigh.
my #enqueueing = map threads->create(\&enqueue), 1..5;
my #dequeueing = map threads->create(\&dequeue), 1..3;
$_->join for #enqueueing;
$queue->enqueue(undef) for 1..3;
$_->join for #dequeueing;
And in sub dequeuing: while(defined( my $item = $queue->dequeue )) { ... }.
Using the defined test fixes another bug: rand can return zero, although this is quite unlikely and will slip through most tests. The contract of rand is that it returns a pseudo-random floating point number between including zero and excluding some upper bound: A number from the interval [0, x). The bound defaults to 1.
If you don't want to join the enqueueing threads manually, you could use a semaphore to signal completition. A semaphore is a multithreading primitive that can be incremented and decremented, but not below zero. If a decrement operation would let the drop count below zero, the call blocks until another thread raises the count. If the start count is 1, this can be used as a flag to block resources.
We can also start with a negative value 1 - $NUM_THREADS, and have each thread increment the value, so that only when all threads have exited, it can be decremented again.
use threads; # make a habit of importing `threads` as the first thing
use strict; use warnings;
use feature 'say';
use Thread::Queue;
use Thread::Semaphore;
use constant {
NUM_ENQUEUE_THREADS => 5, # it's good to fix the thread counts early
NUM_DEQUEUE_THREADS => 3,
};
sub enqueue {
my ($out_queue, $finished_semaphore) = #_;
my $tid = threads->tid;
# iterate over ranges instead of using the while($maxval --> 0) idiom
for (1 .. 1000) {
$out_queue->enqueue(my $val = rand 10_000);
say "Thread $tid enqueued $val";
}
$finished_semaphore->up;
# try a non-blocking decrement. Returns true only for the last thread exiting.
if ($finished_semaphore->down_nb) {
$out_queue->end; # for sufficiently modern versions of Thread::Queue
# $out_queue->enqueue(undef) for 1 .. NUM_DEQUEUE_THREADS;
}
}
sub dequeue {
my ($in_queue) = #_;
my $tid = threads->tid;
while(defined( my $item = $in_queue->dequeue )) {
say "thread $tid dequeued $item";
}
}
# create the queue and the semaphore
my $queue = Thread::Queue->new;
my $enqueuers_ended_semaphore = Thread::Semaphore->new(1 - NUM_ENQUEUE_THREADS);
# kick off the enqueueing threads -- they handle themself
threads->create(\&enqueue, $queue, $enqueuers_ended_semaphore)->detach for 1..NUM_ENQUEUE_THREADS;
# start and join the dequeuing threads
my #dequeuers = map threads->create(\&dequeue, $queue), 1 .. NUM_DEQUEUE_THREADS;
$_->join for #dequeuers;
Don't be suprised if the threads do not seem to run in parallel, but sequentially: This task (enqueuing a random number) is very fast, and is not well suited for multithreading (enqueueing is more expensive than creating a random number).
Here is a sample run where each enqueuer only creates two values:
Thread 1 enqueued 6.39390993005694
Thread 1 enqueued 0.337993319585337
Thread 2 enqueued 4.34504733960242
Thread 2 enqueued 2.89158054485114
Thread 3 enqueued 9.4947585773571
Thread 3 enqueued 3.17079715055542
Thread 4 enqueued 8.86408863197179
Thread 5 enqueued 5.13654995317669
Thread 5 enqueued 4.2210886147538
Thread 4 enqueued 6.94064174636395
thread 6 dequeued 6.39390993005694
thread 6 dequeued 0.337993319585337
thread 6 dequeued 4.34504733960242
thread 6 dequeued 2.89158054485114
thread 6 dequeued 9.4947585773571
thread 6 dequeued 3.17079715055542
thread 6 dequeued 8.86408863197179
thread 6 dequeued 5.13654995317669
thread 6 dequeued 4.2210886147538
thread 6 dequeued 6.94064174636395
You can see that 5 managed to enqueue a few things before 4. The threads 7 and 8 don't get to dequeue anything, 6 is too fast. Also, all enqueuers are finished before the dequeuers are spawned (for such a small number of inputs).

Seeking help with a MT design pattern

I have a queue of 1000 work items and a n-proc machine (assume n =
4).The main thread spawns n (=4) worker threads at a time ( 25 outer
iterations) and waits for all threads to complete before processing
the next n (=4) items until the entire queue is processed
for(i= 0 to queue.Length / numprocs)
for(j= 0 to numprocs)
{
CreateThread(WorkerThread,WorkItem)
}
WaitForMultipleObjects(threadHandle[])
The work done by each (worker) thread is not homogeneous.Therefore in
1 batch (of n) if thread 1 spends 1000 s doing work and rest of the 3
threads only 1 s , above design is inefficient,becaue after 1 sec
other 3 processors are idling. Besides there is no pooling - 1000
distinct threads are being created
How do I use the NT thread pool (I am not familiar enough- hence the
long winded question) and QueueUserWorkitem to achieve the above. The
following constraints should hold
The main thread requires that all worker items are processed before
it can proceed.So I would think that a waitall like construct above
is required
I want to create as many threads as processors (ie not 1000 threads
at a time)
Also I dont want to create 1000 distinct events, pass to the worker
thread, and wait on all events using the QueueUserWorkitem API or
otherwise
Exisitng code is in C++.Prefer C++ because I dont know c#
I suspect that the above is a very common pattern and was looking for
input from you folks.

I'm not a C++ programmer, so I'll give you some half-way pseudo code for it
tcount = 0
maxproc = 4
while queue_item = queue.get_next() # depends on implementation of queue
# may well be:
# for i=0; i<queue.length; i++
while tcount == maxproc
wait 0.1 seconds # or some other interval that isn't as cpu intensive
# as continously running the loop
tcount += 1 # must be atomic (reading the value and writing the new
# one must happen consecutively without interruption from
# other threads). I think ++tcount would handle that in cpp.
new thread(worker, queue_item)
function worker(item)
# ...do stuff with item here...
tcount -= 1 # must be atomic

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is the performance of notify_one() really this bad? - multithreading

Related

Can two hrtimer callbacks in linux kernel run at the same time?

CFS sysctl_sched_latency kernel parameter

How do I parallelize GPars Actors?

Perl Queue and Threads abnormal exit

Seeking help with a MT design pattern

Categories

Resources