Let's say I have the following benchmark to test how log::info! impacts the performance of the code.
Side note: the log crate in rust is only a facade, and in order to generate output it requires a compatible implementation. Since no implementation is present in the code below, I wanted to see if the compiler could optimize it out.
Here is the code:
use std::collections::HashSet;
use criterion::{criterion_group, criterion_main, Criterion};
fn f<const LOG: bool>() -> usize {
let mut x = 0;
let mut hs = HashSet::<i32>::new();
for i in 0..10000 {
x += i;
if hs.contains(&x) {
x *= 2;
} else {
x *= 3;
}
x = x % 1000;
hs.insert(x);
if LOG {
log::info!("{}: {}", i, hs.len());
}
}
hs.len()
}
fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("without", |b| b.iter(|| f::<false>()));
c.bench_function("with", |b| b.iter(|| f::<true>()));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
Running
cargo clean
cargo bench
Produces
without time: [265.71 µs 272.44 µs 280.77 µs]
Found 15 outliers among 100 measurements (15.00%)
6 (6.00%) high mild
9 (9.00%) high severe
with time: [276.52 µs 277.55 µs 278.80 µs]
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) high mild
9 (9.00%) high severe
Then, running the benchmark again
cargo bench
Produces
without time: [261.80 µs 263.03 µs 264.49 µs]
change: [-7.5920% -3.7254% +0.1192%] (p = 0.06 > 0.05)
No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
6 (6.00%) high mild
6 (6.00%) high severe
with time: [264.31 µs 265.26 µs 266.32 µs]
change: [-9.9053% -6.0486% -2.2028%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
For the third time, running the benchmark again
cargo bench
Produces
without time: [251.02 µs 251.39 µs 251.83 µs]
change: [-7.6265% -4.5399% -1.4715%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
with time: [251.56 µs 251.94 µs 252.38 µs]
change: [-8.3006% -5.3281% -2.1746%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) high mild
10 (10.00%) high severe
Why does each consequent run show performance improvement? The code and and the environment are identical.
The Criterion.rs user guide answers this question fairly well.
Typically this happens because the benchmark environments aren't quite the same. There are a lot of factors that can influence benchmarks. Other processes might be using the CPU or memory. Battery-powered devices often have power-saving modes that clock down the CPU (and these sometimes appear in desktops as well). If your benchmarks are run inside a VM, there might be other VMs on the same physical machine competing for resources.
However, sometimes this happens even with no change. It's important to remember that Criterion.rs detects regressions and improvements statistically. There is always a chance that you randomly get unusually fast or slow samples, enough that Criterion.rs detects it as a change even though no change has occurred. In very large benchmark suites you might expect to see several of these spurious detections each time you run the benchmarks.
Unfortunately, this is a fundamental trade-off in statistics. In order to decrease the rate of false detections, you must also decrease the sensitivity to small changes. Conversely, to increase the sensitivity to small changes, you must also increase the chance of false detections. Criterion.rs has default settings that strike a generally-good balance between the two, but you can adjust the settings to suit your needs.
One can configure the benchmark to use a different sample count as described in Advanced Configuration, thus reducing noise at the cost of sensitivity:
fn bench(c: &mut Criterion) {
let mut group = c.benchmark_group("sample-size-example");
// Configure Criterion.rs to detect smaller differences and increase sample size to improve
// precision and counteract the resulting noise.
group.significance_level(0.1).sample_size(500);
group.bench_function("my-function", |b| b.iter(|| my_function());
group.finish();
}
Related
So I've written the following function to show what i mean:
use std::{thread, time};
const TARGET_FPS: u64 = 60;
fn main() {
let mut frames = 0;
let target_ft = time::Duration::from_micros(1000000 / TARGET_FPS);
println!("target frame time: {:?}",target_ft);
let mut time_slept = time::Duration::from_micros(0);
let start = time::Instant::now();
loop {
let frame_time = time::Instant::now();
frames+=1;
if frames == 60 {
break
}
if let Some(i) = (target_ft).checked_sub(frame_time.elapsed()) {
time_slept+=i;
thread::sleep(i)
}
}
println!("time elapsed: {:?}",start.elapsed());
println!("time slept: {:?}",time_slept);
}
The idea of the function is to execute 60 cycles at 60fps then exit with the time elapsed and the total time spent sleeping during the loop. ideally, since im executing 60 cycles at 60fps with no real calculations happening between, it should take about one second to execute and spend basically the entire second sleeping. but instead when i run it it returns:
target frame time: 16.666ms
time elapsed: 1.8262798s
time slept: 983.2533ms
As you can see, even though it was only told to sleep for a total of 983ms, the 60 cycles took nearly 2 seconds to complete. Because of this nearly 50% inaccuracy, a loop told to run at 60fps instead runs at only 34fps.
The docs say The thread may sleep longer than the duration specified due to scheduling specifics or platform-dependent functionality. It will never sleep less. But is this really just from that? Am i doing something wrong?
i switched to using spin_sleep::sleep(i) from https://crates.io/crates/spin_sleep and it seems to have fixed it. i guess it must just be windows inaccuracies then...still strange that time::sleep on windows would be that far off for something as simple as a game loop
I am trying to write an MPI application to speedup a math algorithm with a computer cluster. But before this I am doing some kind of benchmarking. But the first results are not as much as expected.
The test application has linear speedup with 4 cores but 5,6 cores are not speeding up the application. I am doing a test with Odroid N2 platform. It has 6 cores. Nproc says there are 6 cores available.
Am I missing some kind of configuration? Or is my code not prepared well enought ( it is based on one of the base example of mpi)?
Is there any response time or syncronization time which shall be considered ?
Here are some measures from my MPI based application. I measured a total calculation time for a function.
1 core 0.838052sec
2 core 0.438483sec
3 core 0.405501sec
4 core 0.416391sec
5 core 0.514472sec
6 core 0.435128sec
12 core (4 core from 3 N2 boards) 0.06867sec
18 core (6 core from 3 N2 boards) 0.152759sec
I did a benchmark with raspberry pi4 with 4 core:
1 core 1.51 sec
2 core 0.75 sec
3 core 0.69 sec
4 core 0.67 sec
And this is my benchmark application:
int MyFun(int *array, int num_elements, int j)
{
int result_overall = 0;
for (int i = 0; i < num_elements; i++)
{
result_overall += array[i] / 1000;
}
return result_overall;
}
int compute_sum(int* sub_sums,int num_of_cpu)
{
int sum = 0;
for(int i = 0; i<num_of_cpu; i++)
{
sum += sub_sums[i];
}
return sum;
}
//measuring performance from main(): num_elements_per_proc is equal to 604800
if (world_rank == 0)
{
startTime = std::chrono::high_resolution_clock::now();
}
// Compute the sum of your subset
int sub_sum = 0;
for(int j=0;j<1000;j++)
{
sub_sum += MyFun(sub_intArray, num_elements_per_proc, world_rank);
}
MPI_Allgather(&sub_sum, 1, MPI_INT, sub_sums, 1, MPI_INT, MPI_COMM_WORLD);
int total_sum = compute_sum(sub_sums, num_of_cpu);
if (world_rank == 0)
{
elapsedTime = std::chrono::high_resolution_clock::now() - startTime;
timer = elapsedTime.count();
}
I build it with -O3 optimization level.
UPDATE:
new measures:
60480 sample, MyFun called 100000 times:
1.47 -> 0.74 -> 0.48 -> 0.36
6048 samples, MyFun called 1000000 times:
1.43 -> 0.7 -> 0.47 -> 0.35
6048 samples, MyFun called 10000000 times:
14.43 -> 7.08 -> 4.72 -> 3.59
UPDATE2:
By the way when I list the CPU info in linux I got this:
Is this normal?
The quad-core A73 core is not present. And it says there are two sockets with 3-3 cores.
And here is the CPU utilization with sar:
Seems like all of the cores are utilized.
I create some plots from speedup:
Seems like calculation on float instead of int helps a bit but the core 5-6 do not help much. And I think memory bandwidth is okay. Is this a normal behavior when utilizing all CPU equally with little.BIG architecture?
In my application I experience an exponential slowdown due to a concurrent usage of timegm (at the moment around 800 times slower). I uploaded my test code here: http://cpp.sh/363nv
void* testThread(void* args)
{
time_t tt;
tm t;
time(&tt);
localtime_r(&tt, &t);
for (int i = 0; i < 5000; i++)
{
timegm(&t);
}
return nullptr;
}
The performance drop is caused by a timezone lock. Any idea how to bypass this lock? Depending if I misuse timegm(..) or not, I will file a report to Apple. Thanks for any help
Here are my results:
macOS 10.13.6 (Macbook Pro 2.5 GHz Intel Core i7):
5000 calls of 'timegm' with 1 thread(s) took 0.03634 seconds
5000 calls of 'timegm' with 8 thread(s) took 24.84911 seconds
For a comparison, here are the results of an online c++ execution:
Linux:
5000 calls of 'timegm' with 1 thread(s) took 0.00453 seconds
5000 calls of 'timegm' with 8 thread(s) took 0.00588 seconds
Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.
Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure
I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.
this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?
I implemented a small program in C to calculate PI using a Monte Carlo method (mainly because of personal interest and training). After having implemented the basic code structure, I added a command-line option allowing to execute the calculations threaded.
I expected major speed ups, but I got disappointed. The command-line synopsis should be clear. The final number of iterations made to approximate PI is the product of the number of -iterations and -threads passed via the command-line. Leaving -threads blank defaults it to 1 thread resulting in execution in the main thread.
The tests below are tested with 80 Million iterations in total.
On Windows 7 64Bit (Intel Core2Duo Machine):
Compiled using Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8Core AMD):
Compiled using GCC 4.6.3: gcc pi.c -lm -lpthread -O3 -o pi
Performance
On Windows, the threaded version is a few milliseconds faster than the un-threaded. I expected a better performance, to be honest. On Linux, ew! What the heck? Why does it take even 2000% longer? Of course this is depending much on the implementation, so here it goes. An excerpt after the command-line argument parsing was done and the calculation is started:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
While pi_mc_threaded() is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code at http://pastebin.com/jptBTgwr.
Question
Why is this? Why this extreme difference on Linux? I expected the anmount of time taken to calculate to be at least 3/4 of the original time. It would of course be possible that I simply made wrong use of the pthread library. A clarifcation on how to do correct in this case would be very nice.
The problem is that in glibc's implementation, rand() calls __random(), and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
locks around each call to the function __random_r that does the actual work.
Thus, as soon as you have more than one thread using rand(), you make each thread wait for the other(s) on almost every call to rand(). Directly using random_r() with your own buffers in each thread should be much faster.
Performance and threading is a black art. The answer depends on the specifics of the compiler and libraries used to do threading, how well the kernel handles it, etc. Basically, if your libraries for *nix are not efficient in switching, moving objects around etc, threading will in fact, be slower . THis is one of the reasons a lot us doing thread work now work with JVM or JVM-like languages. We can trust the runtime JVM's behavior -- it's overall speed may vary with platform, but it's consistent on that platform. In addition, you may have some hidden wait/race conditions that you uncovered just due to timing that may not show up on Windows.
If you are in a position to change your language, consider Scala or D. Scala is the actor driven model successor to Java, and D, the successor to C. Both languages show their roots -- if you can write in C, D should be no problem. Both languages however, implement the actor model. NO MORE THREAD POOLS, NO MORE RACE CONDITIONS ETC!!!!!!
For comparison, I just tried your app on Windows Vista, compiled with Borland C++, and the 2 thread version performed nearly twice as fast as the single thread.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 12.511000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.142397
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.584000 sec
Threads: 2
That's compiled against the thread-safe run-time library. Using the single thread library, both versions run at twice their thread-safe speed.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.458000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.141314
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 3.978000 sec
Threads: 2
So the 2 thread version is still twice as fast, but the 1 thread version with the single thread library is actually faster than the 2 thread version on the thread-safe library.
Looking at Borland's rand implementation, they use thread local storage for the seed in the thread-safe implementation, so it's not going to have the same negative impact on threaded code as glibc's lock, but the thread-safe implementation will obviously be slower than the single thread implementation.
The bottom line though, is that your compiler's rand implementation is probably the main performance issue in both cases.
Update
I've just tried replacing your rand_01 calls with inline implementations of Borland's rand function using a local variable for the seed, and the results are consistently twice as fast in the 2 thread case.
The updated code looks like this:
#define MULTIPLIER 0x015a4e35L
#define INCREMENT 1
double pi_mc(int iterations) {
unsigned seed = 1;
long long inner = 0;
long long outer = 0;
int i;
for (i=0; i < iterations; i++) {
seed = MULTIPLIER * seed + INCREMENT;
double x = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
seed = MULTIPLIER * seed + INCREMENT;
double y = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
double d = sqrt(pow(x, 2.0) + pow(y, 2.0));
if (d <= 1.0) {
inner++;
}
else {
outer++;
}
}
return ((double) inner / (double) iterations) * 4;
}
I don't know how good that is as rand implementations go, but it's worth at least trying on Linux to see whether it makes a difference to the performance.