Multithreaded performance (with Rust) - multithreading

I have been running the following experiment to test multi-threaded performance in Rust.
The below piece of code basically does the following :
STEP 1: Generate 50 million random (key, value) pairs on the main thread.
STEP 2: Process the 50 million pairs by inserting them in a HashMap. This processing step is repeated simultaneously over count threads. Every thread has it's own HashMap.
use rand::Rng;
use std::sync::Arc;
use std::thread;
use std::collections::HashMap;
use std::time::Instant;
fn generate_values(count: usize) -> Vec<([u8; 3], u8)>{
let mut generator = rand::thread_rng();
let mut values = Vec::new();
for _ in 0..count {
let key = generator.gen::<[u8; 3]>();
let value = generator.gen::<u8>();
values.push((key, value));
}
values
}
fn process_values(values: &Arc<Vec<([u8; 3], u8)>>, count: usize) {
let mut handles = Vec::new();
for _ in 0..count {
let values = Arc::clone(values);
handles.push(thread::spawn(move || {
let mut map = HashMap::new();
for (key, value) in values.iter() {
map.insert(key, value);
}
}));
}
for handle in handles {
handle.join().unwrap();
}
}
fn main() {
let values = Arc::new(generate_values(50000000));
println!("processing values...");
for count in 1..=16 {
let start = Instant::now();
process_values(&values, count);
println!("processing values, repeated over {} thread(s), took {:?}", count, start.elapsed());
}
}
I am running the code on a dedicated server with one AMD Ryzen 7 3700X 8-Core processor and 64 GB of RAM, running Ubuntu 20.04. Nothing else is running on the server.
I would have expected that running the code repeated over 1 to 8 threads would have taken roughly the same amount of time, but it seems that running the code on 8 threads took about 23% more time (19.59s) than running it over 1 thread (15.97s) :
processing values, repeated over 1 thread(s), took 15.970677367s
processing values, repeated over 2 thread(s), took 15.936398062s
processing values, repeated over 3 thread(s), took 16.497970587s
processing values, repeated over 4 thread(s), took 17.233953355s
processing values, repeated over 5 thread(s), took 17.233057743s
processing values, repeated over 6 thread(s), took 18.223844841s
processing values, repeated over 7 thread(s), took 19.094954912s
processing values, repeated over 8 thread(s), took 19.592578442s
processing values, repeated over 9 thread(s), took 21.152438731s
processing values, repeated over 10 thread(s), took 22.881476672s
processing values, repeated over 11 thread(s), took 22.97713133s
processing values, repeated over 12 thread(s), took 23.841287249s
processing values, repeated over 13 thread(s), took 24.713425745s
processing values, repeated over 14 thread(s), took 24.979827585s
processing values, repeated over 15 thread(s), took 25.78961309s
processing values, repeated over 16 thread(s), took 26.511473666s
Then I thought that it had something to do with hyper-threading, so I disabled simultaneous multi-threading :
echo off > /sys/devices/system/cpu/smt/control
These are the results without hyper-threading :
processing values, repeated over 1 thread(s), took 15.906120824s
processing values, repeated over 2 thread(s), took 15.927443081s
processing values, repeated over 3 thread(s), took 16.701871709s
processing values, repeated over 4 thread(s), took 16.73429606s
processing values, repeated over 5 thread(s), took 17.785883476s
processing values, repeated over 6 thread(s), took 18.171144237s
processing values, repeated over 7 thread(s), took 18.871619003s
processing values, repeated over 8 thread(s), took 19.439770035s
processing values, repeated over 9 thread(s), took 22.937699259s
processing values, repeated over 10 thread(s), took 25.164055752s
processing values, repeated over 11 thread(s), took 29.44375459s
processing values, repeated over 12 thread(s), took 30.436276538s
processing values, repeated over 13 thread(s), took 33.775704733s
processing values, repeated over 14 thread(s), took 35.962573012s
processing values, repeated over 15 thread(s), took 38.04670196s
processing values, repeated over 16 thread(s), took 40.535251291s
Still the same unexpected 22-23% performance decrease when going from 1 to 8 threads.
Although the performance when going from 8 to 16 threads is consistent and as expected (it takes about twice the time to run the code on 16 threads versus 8 threads).
Here is a little graph of the relative time taken to run the code on count threads versus 1 thread.
Is such a performance decrease of 22-23% expected when repeating the code over 8 threads versus 1 thread, when run on an 8-core processor ?
In other words, what explains the performance decrease ?
Code is run in release with "cargo run --release".

Two possibilities:
Even without lock contention memory and caches are a shared resource and speed when running more code in parallel is constrained on this resource. You can get some clues with perf stat, e.g. if there are more cache misses.
Modern processors run with a higher frequency if only a few cores are active. For the Ryzen / 3700 X in turbo mode the CPU runs with up to 4.4GHz while the base clock rate is only 3.6GHz.

Related

Calculating speed up and efficiency of a parallel program

Lets say the least time taken for a parallel program is 25 ms for 32 threads and it takes 400 ms for 1 thread, how can I find speed up and efficiency of the program?
speed up = 400 / 25 = 16
efficiency = speedup /no. of threads = 0.5 x 100 = 50 %
Technically, if we calculate efficiency for 1 thread it will result in 100 %
Even when I calculated the efficiency for other number of threads that took more time, their efficiency was higher than the "optimum" time which took 32 threads.
So are my calculations correct? and how can I tell which is the most efficient?

what is the need to divide a list sys.getsizeof() by 8 or 4 ( Depending upon the machine) after subtracting 64 or 36 from the list

I am trying to find the capacity of a list by a function. But a step involves subtracting the list size by 64 ( in my machine ) and also it has to be divided by 8 to get the capacity. What does this capacity value mean ?
I tried reading the docs of python to know about sys.getsizeof() method but still it couldn't answer my doubts.
import sys
def disp(l1):
print("Capacity",(sys.getsizeof(l1)-64)//8) // What does this line mean especially //8 part
print("Length",len(l1))
mariya_list=[]
mariya_list.append("Sugarisverysweetand it can be used for cooking sweets
and also used in beverages ")
mariya_list.append("Choco")
mariya_list.append("bike")
disp(mariya_list)
print(mariya_list)
mariya_list.append("lemon")
print(mariya_list)
disp(mariya_list)
mariya_list.insert(1,"leomon Tea")
print(mariya_list)
disp(mariya_list)
Output:
Capacity 4
Length 1
['Choco']
['Choco', 'lemon']
Capacity 4
Length 2
['Choco', 'leomon Tea', 'lemon']
Capacity 4
Length 3
This is the output. Here I am unable to understand what does capacity 4 mean. Why does it repeats the same value four even after subsequent addition of elements.

difference between counting packets and counting the total number of bytes in the packets

I'm reading perfbook. In chapter5.2, the book give some example about statistical counters. These example can solve the network packet count problem.
Quick Quiz 5.2: Network-packet counting problem. Suppose that you need
to collect statistics on the number of networking packets (or total
number of bytes) transmitted and/or received. Packets might be
transmitted or received by any CPU on the system. Suppose further that
this large machine is capable of handling a million packets per
second, and that there is a system-monitoring package that reads out
the count every five seconds. How would you implement this statistical
counter?
There is one QuickQuiz ask about difference between counting packets and counting the total number of bytes in the packets.
I can't understand the answer. After reading it, I still don't know the difference.
The example in "To see this" paragraph, if changing number the 3 and 5 to 1, what difference does it make?
Please help me to understand it.
QuickQuiz5.26: What fundamental difference is there between counting
packets and counting the total number of bytes in the packets, given
that the packets vary in size?
Answer: When counting packets, the
counter is only incremented by the value one. On the other hand, when
counting bytes, the counter might be incremented by largish numbers.
Why does this matter? Because in the increment-by-one case, the value
returned will be exact in the sense that the counter must necessarily
have taken on that value at some point in time, even if it is
impossible to say precisely when that point occurred. In contrast,
when counting bytes, two different threads might return values that are
inconsistent with any global ordering of operations.
To see this, suppose that thread 0 adds the value three to its counter,
thread 1 adds the value five to its counter, and threads 2 and 3 sum the
counters. If the system is “weakly ordered” or if the compiler uses
aggressive optimizations, thread 2 might find the sum to be three and
thread 3 might find the sum to be five. The only possible global orders
of the sequence of values of the counter are 0,3,8 and 0,5,8, and
neither order is consistent with the results obtained.
If you missed > this one, you are not alone. Michael Scott used this
question to stump Paul E. McKenney during Paul’s Ph.D. defense.
I can be wrong but presume that idea behind that is the following: suppose there are 2 separate processes which collect their counters to be summed up for a total value. Now suppose that there are some sequences of events which occur simultaneously in both processes, for example a packet of size 10 comes to the first process and a packet of size 20 comes to the second at the same time and after some period of time a packet of size 30 comes to the first process at the same time when a packet of size 60 comes to the second process. So here is the the sequence of events:
Time point#1 Time point#2
Process1: 10 30
Process2: 20 60
Now let's build a vector of possible total counter states after the time point #1 and #2 for a weakly ordered system, considering the previous total value was 0:
Time point#1
0 + 10 (process 1 wins) = 10
0 + 20 (process 2 wins) = 20
0 + 10 + 20 = 30
Time point#2
10 + 30 = 40 (process 1 wins)
10 + 60 = 70 (process 2 wins)
20 + 30 = 50 (process 1 wins)
20 + 60 = 80 (process 2 wins)
30 + 30 = 60 (process 1 wins)
30 + 60 = 90 (process 2 wins)
30 + 90 = 110
Now presuming that there can be some period of time between time point#1 and time point#2 let's assess which values reflect the real state of the system. Apparently all states after time point#1 can be treated as valid as there was some precise moment in time when total received size was 10, 20 or 30 (we ignore the fact the the final value may not the actual one - at least it contains a value which was actual at some moment of system functioning). For the possible states after the Time point#2 the picture is slightly different. For example the system has never been in the states 40, 70, 50 and 80 but we are under the risk to get these values after the second collection.
Now let's take a look at the situation from the number of packets perspective. Our matrix of events is:
Time point#1 Time point#2
Process1: 1 1
Process2: 1 1
The possible total states:
Time point#1
0 + 1 (process 1 wins) = 1
0 + 1 (process 1 wins) = 1
0 + 1 + 1 = 2
Time point#2
1 + 1 (process 1 wins) = 2
1 + 1 (process 2 wins) = 2
2 + 1 (process 1 wins) = 3
2 + 1 (process 2 wins) = 3
2 + 2 = 4
In that case all possible values (1, 2, 3, 4) reflect a state in which the system definitely was at some point in time.

Multithreading - Calculations between all pairs in a set

I have n elements (e.g. A, B, C and D) and need to do calculations between all of those.
Calculation 1 = A with B
Calculation 2 = A with C
Calculation 3 = A with D
Calculation 4 = B with C
Calculation 5 = B with D
Calculation 6 = C with D
In reality there are more than 1000 elements and I want to parallelise the process.
Note that I can't access an element from 2 threads simultaneously. This for example makes it impossible to do Calculation 1 and Calculation 2 at the same time because they both use the element A.
Edit: I could access an element from 2 threads but it makes everything very slow if i just split up the calculations and depend on locks for threadsafety.
Is there already an distribution algorithm for these kind of problems?
It seems like a lot of people must have had the same problem already but i couldn't find anything in the great internet. ;)
Single thread example code:
for (int i = 0; i < elementCount; i++)
{
for (int j = i + 1; j < elementCount; j++)
{
Calculate(element[i], element[j]);
}
}
You can apply round-robin tournament algorithm that allows to organize all possible pairs (N*(N-1) results).
All set elements (players) form two rows, column is pair at the
current round. First element is fixed, others are shifted in cyclic manner.
So you can run up to N/2 threads to get results for the first pairs set, then reorder indexes and continue
Excerpt from wiki:
The circle method is the standard algorithm to create a schedule for a round-robin tournament. All competitors are assigned to numbers, and then paired in the first round:
Round 1. (1 plays 14, 2 plays 13, ... )
1 2 3 4 5 6 7
14 13 12 11 10 9 8
then fix one of the contributors in the first or last column of the table (number one in this example) and rotate the others clockwise one position
Round 2. (1 plays 13, 14 plays 12, ... )
1 14 2 3 4 5 6
13 12 11 10 9 8 7
Round 3. (1 plays 12, 13 plays 11, ... )
1 13 14 2 3 4 5
12 11 10 9 8 7 6
until you end up almost back at the initial position
Round 13. (1 plays 2, 3 plays 14, ... )
1 3 4 5 6 7 8
2 14 13 12 11 10 9
It is simple enough to prove there is no way to distribute your calculations so that collisions never occur (that is, unless you manually order the computations and place round-boundaries, like #Mbo suggests), meaning that there is no distribution amongst multiple threads that will allow you to never lock.
Proof :
Given your requirement that any computation involving data object A should happen on a given thread T (only way to make sure you never lock on A).
Then it follows that thread T has to deal with at least one pair containing each other objects (B, C, D) of the input list.
It follows from the basic requirement that T is also to handle everything object-B related. And C. And D. So everything.
Therefore, only T can work.
QED. There is no possible parallelization that will never lock.
Way around #1 : map/reduce
That said... This is a typical case of divide and conquer. You are right that simple additions can require critical section locks, without the order of execution mattering. That is because your critical operation (addition) has a nice property, associativeness : A+(B+C) = (A+B)+C, on top of being commutative.
In other words, this operation is a candidate for a (parralel-friendly) reduce operation.
So the key here is probably :
Emit a stream of all interesting pairs
Map each pair to one or more partial results
Group each partial result by its master object (A, B, C)
Reduce each group by combining the partial results
A sample (pseudo) code
static class Data { int i = 0; }
static class Pair { Data d1; Data d2; }
static class PartialComputation { Data d; int sum; }
Data[] data = ...
Stream<Pair> allPairs = ... // Something like IntStream(0, data.length-1).flatMap(idx -> IntStream(idx+1 to data.length ).map(idx2 -> new Pair(data[idx], data[idx2])))
allPairs.flatMap(pair -> Stream.of(new ParticalComputation(pair.d1, pair.d1.i + pair.d2.i), new PartialComputation(pair.d2, pair.d2.i+pair.d1.i)) // Map everything, parallely, to partial results keyable by the original data object
allPairs.collect(Collectors.groupByParallel(
partialComp -> partialComp.d, // Regroup by the original data object
Collectors.reducing(0, (sum1, sum2) -> sum1.sum + sum2.sum)) // reduce by summing
))
Way around 2 : trust the implementations
Fact is, uncontended locks in java have gotten cheaper. On top of that, pure locking sometimes has better alternatives, like Atomic types in Java (e.g. AtomicLong if you are summing stuff), that use CAS instead of locking, which can be faster (google for it, I usually refer to the Java Concurrency In Practice book for hard numbers.)
The fact is, if you have 1000 to 10k different elements (which translates to at least millions of pairs) and, like, 8 CPUs, the contention (or probability that at least 2 of your 8 threads will be processing the same element) is pretty low. And I would rather measure it first-hand rather than saying upfront "I can not affor the locks", especially if the operation can be implemented using Atomic types.

parallel loop over pairs

What is the best way in C++11 to perform a pairwise computation in multiple threads? What I mean is, I have a vector of elements, and I want to compute a function for each pair of distinct elements. The caveat is that I cannot use the same element in multiple threads at the same time, e.g. the elements have states that evolve during the computation, and the computation relies on that.
An easy way would be to group the pairs by offsets.
If v is a vector, then the elements N apart (mod v.size()) form two collections of pairs. Each of those collections of pairs contain no overlaps inside themselves.
Examine a 10 element vector 0 1 2 3 4 5 6 7 8 9. The pairs 1 apart are:
0 1, 1 2, 2 3, 3 4, 4 5, 5 6, 6 7, 7 8, 8 9, 9 0
if you split these by "parity" into two collections we get:
0 1, 2 3, 4 5, 6 7, 8 9
1 2, 3 4, 5 6, 7 8, 9 0
You can work, in parallel, on each of the above collections. When the collection is finished, sync up, then work on the next collection.
Similar tricks work for 2 apart.
0 2, 1 3, 4 6, 5 7
2 4, 3 5, 6 8, 7 9
with leftovers:
8 0, 9 1
For every offset from 1 to n/2 there is are 2 "collections" and leftovers.
Here is offset of 4:
0 4, 1 5, 2 6, 3 7
4 8, 5 9, 6 0, 7 1
and leftovers
8 2, 9 3
(I naively think the size of leftovers is vector size mod offset)
Calculating these collections (and the leftovers) isn't hard; arranging to queue up threads and get the right tasks efficiently in the right threads is harder.
There are N choose 2, or (n^2+n)/2, pairs. This split gives you O(1.5n) collections and leftovers, each of size at most n/2, and full parallelism within each collection.
If you have a situation where some elements are far more expensive than others, and thus waiting for each collection to finish idles threads too much, you could add fine-grained synchronization.
Maintain a vector of atomic bools. Use that to indicate that you are currently processing an element. Always "lock" (set to true, and check that it was false before you set it to true) the lower index one before the upper one.
If you manage to lock both, process away. Then clear them both.
If you fail, remember the task for later, and work on other tasks. When you have too many tasks queued, wait on a condition variable, trying to check and set the atomic bool you want to lock in the spin-lambda.
Periodically kick the condition variable when you clear the locks. How often you do this will depend on profiling. You can kick without aquiring the mutex mayhap (but you must sometimes acquire the mutex after clearing the bools to deal with a race condition that could starve a thread).
Queue the tasks in the order indicated by the above collection system, as that reduces the likelihood of threads colliding. But with this system, work can still progress even if there is one task that is falling behind.
It adds complexity and synchronization, which could easily make it slower than the pure collection/cohort one.

Resources