Linux io_submit latency large - aio

I find one very strange probelm with io_submit latency.
If I write a loop invoke io_submit 5 times, like following:
for (int i = 0; i < 5; i++) {
gettimeofday(&start);
io_submit(...);
gettimeofday(&end);
}
The latency of io_submit are all very small except the first one
io_submit cost: 9 us
io_submit cost: 2 us
io_submit cost: 2 us
io_submit cost: 3 us
but if I sleep after every invoke of io_submit, just like following:
for (int i = 0; i < 5; i++) {
gettimeofday(&start);
io_submit(...);
gettimeofday(&end);
sleep(1);
}
The latency of io_submit are all very large:
io_submit cost: 9 us
io_submit cost: 8 us
io_submit cost: 9 us
io_submit cost: 7 us
The block device is a nvme ssd. I have tried to
used blktrace, but it seems that blktrace has some problems with nvme, only event 'Q' and 'A' are catched, this is not enough to figure out this question.
I have also tried to use systemtap to trace some point in io_submit's code, but that make io_submit's latency become too large nearly 50us which makes the diff between the upper situatios not apparently.
Does anyone know why about this or give some advices to figure out this question.
NEW PROGRESS:
using systemptap, i found that increase of latency came from many places of the code path, not in one place.
Two releted things.
first, the cpu cache miss that come from context switch, sleep situation lead to more context switch;
second, the code path alloc then free memory, if run without sleep, the memory free this turn can be reused by next turn. while with sleep, the memory just free may be used by other threads.

Related

How should I spawn threads for parallel computation?

Today, I got into multi-threading. Since it's a new concept, I thought I could begin to learn by translating a simple iteration to a parallelized one. But, I think I got stuck before I even began.
Initially, my loop looked something like this:
let stuff: Vec<u8> = items.into_iter().map(|item| {
some_item_worker(&item)
}).collect();
I had put a reasonably large amount of stuff into items and it took about 0.05 seconds to finish the computation. So, I was really excited to see the time reduction once I successfully implemented multi-threading!
When I used threads, I got into trouble, probably due to my bad reasoning.
use std::thread;
let threads: Vec<_> = items.into_iter().map(|item| {
thread::spawn(move || {
some_item_worker(&item)
})
}).collect(); // yeah, this is followed by another iter() that unwraps the values
I have a quad-core CPU, which means that I can run only up to 4 threads concurrently. I guessed that it worked this way: once the iterator starts, threads are spawned. Whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently.
The result was that it took (after some re-runs) ~0.2 seconds to finish the same computation. Clearly, there's no parallel computing going on here. I don't know why the time increased by 4 times, but I'm sure that I've misunderstood something.
Since this isn't the right way, how should I go about modifying the program so that the threads execute concurrently?
EDIT:
I'm sorry, I was wrong about that ~0.2 seconds. I woke up and tried it again, when I noticed that the usual iteration ran for 2 seconds. It turned out that some process had been consuming the memory wildly. When I rebooted my system and tried the threaded iteration again, it ran for about 0.07 seconds. Here are some timings for each run.
Actual iteration (first one):
0.0553760528564 seconds
0.0539519786835 seconds
0.0564560890198 seconds
Threaded one:
0.0734670162201 seconds
0.0727820396423 seconds
0.0719120502472 seconds
I agree that the threads are indeed running concurrently, but it seems to consume another 20 ms to finish the job. My actual goal was to utilize my processor to run threads parallel and finish the job soon. Is this gonna be complicated? What should I do to make those threads run in parallel, not concurrent?
I have a quad-core CPU, which means that I can run only up to 4 threads concurrently.
Only 4 may be running concurrently, but you can certainly create more than 4...
whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently (it was just a guess).
Whenever you have a guess, you should create an experiment to figure out if your guess is correct. Here's one:
use std::{thread, time::Duration};
fn main() {
let threads: Vec<_> = (0..500)
.map(|i| {
thread::spawn(move || {
println!("Thread #{i} started!");
thread::sleep(Duration::from_millis(500));
println!("Thread #{i} finished!");
})
})
.collect();
for handle in threads {
handle.join().unwrap();
}
}
If you run this, you will see that "Thread XX started!" is printed out 500 times, followed by 500 "Thread XX finished!"
Clearly, there's no parallel computing going on here
Unfortunately, your question isn't fleshed out enough for us to tell why your time went up. In the example I've provided, it takes a little less than 600 ms, so it's clearly not happening in serial!
Creating a thread has a cost. If the cost of the computation inside the thread is small enough, it'll be dwarfed by the cost of the threads or the inefficiencies caused by the threads.
For example, spawning 10 million threads to double 10 million u8s will probably not be worth it. Vectorizing it would probably yield better performance.
That said, you still might be able to get some improvement through parallelizing cheap tasks. But you want to use fewer threads through a thread pool w/ a small number of threads (so you have a (small) number of threads created at any given point, less CPU contention) or something more sophisticated (under the hood, the api is quite simple) like Rayon.
// Notice `.par_iter()` turns it into a `parallel iterator`
let stuff: Vec<u8> = items.par_iter().map(|item| {
some_item_worker(&item)
}).collect();

Fps drops when using .NETs ThreadPool

I have asked about this before, but didn't provide code because I didn't have an easy way to do so. However now I've started a new project in Unity and tried to replicate the behaviour without all the unnecessary baggage attached.
So this is my current setup:
public class Main : MonoBehaviour
{
public GameObject calculatorPrefab;
void Start ()
{
for (int i = 0; i < 10000; i++)
{
Instantiate(calculatorPrefab);
}
}
}
public class Calculator : MonoBehaviour
{
void Start ()
{
ThreadPool.QueueUserWorkItem(DoCalculations);
}
void DoCalculations(object o)
{
// Just doing some pointless calculations so the thread actually has something to do.
float result = 0;
for (int i = 0; i < 1000; i++)
{
// Note that the loop count doesn't seem to matter at all, other than taking longer.
for (int i2 = 0; i2 < 1000; i2++)
{
result = i * i2 * Mathf.Sqrt(i * i2 + 59);
}
}
}
}
Both scripts are attached to GameObjects. The 'Main' script is on a GameObject thats placed in the scene and is supposed to create a bunch of other GameObjects at start up which then in turn queue some random calculations for the ThreadPool. Obviously this produces a fairly big CPU spike at start up, but that's not the problem. The problem is that the main thread seems to be blocked by this. In other words, it produces horrible fps. Why is that ? Isn't it supposed to run in the background ? Isn't the whole point behind this not to make the main thread unresponsive ?
I'm really struggling to figure out what I'm doing wrong, because as far as I see it, it doesn't get much simpler than this.
On the first frame you instantiate 10000 prefabs. That is quite a load for a single frame. On the second frame you initialize 10000 thread pools. That is quite a number of threads and I am sure you are running into some upfront initialization costs.
The background task is not that complex. I use background tasks for really long running operations. For instance web calls and long running calculations. I dont think your task really fits. In other words the upfront cost exceeds the cost of running your calculation.
Try using a coroutine instead to breakup your calculations and instantiations. I think that is a better solution for this particular background task.
Edit Ran some tests per the comments below.
10k instantiates took on average (median) 104 milliseconds. The editor had poor framerate and used about 15% of my I7 cpu capacity.
10k QueueUserWorkItem took on average (median) 23 milliseconds. The editor locked up for multiple seconds. My cpu capacity had a wonderfull 99% capacity.
Conclusion
Queuing the worker thread has some cost, but not a lot. The problems are mainly with your instantiate. That, and why are we quing 1000 worker threads for such a simple calculation ?
I see the following problems with your code.
You are creating far too many background jobs at once, 10,000 to be precise. .NET won't run them all concurrently all the same but still is perhaps not the best way to go. On my machine (8 logical cores) the initial max workers via ThreadPool.GetMaxThreads() was 1023
Each job is rather complex. The calculation of Sqrt is not cheap and no wonder takes so long
Unity has methods for updating and methods for drawing. The problem here is that your jobs are ongoing and thus drags down everything including updating; drawing; and everything in between; rather than computation just happening during Update()
Taking your code and just running it in a stand-alone .NET app, it took 15 seconds to complete maxing out all 8 of my cores.
However, changing
result = i * i2 * Mathf.Sqrt(i * i2 + 59)
...to:
result = i * i2 * i * i2 + 59;
...also maxed out all of my 8 cores as before but this time took 6 seconds.
You might ask, "well you took away to sqrt, what is your point". My point is I don't believe you realise how intensive a call Sqrt is particularly with this statement:
And it's still terrible. I even reduced the amount of objects being created from 10000 to 100, while increasing the loop count so it still takes a while. No real difference
Furthermore, scheduling so many jobs, regardless of tech, purely to update game objects won't scale. Game designers update in batches.
My suggestion:
Design tip
Generally when there is alot of calculations that must be performed for many objects, instead of doing so in one frame, group them and spread them out over time. So for 10,000 objects maybe have a batch size of 1000 or 100? Source: Cities: Skylines;
Tell me more
Game Engine Architecture
Image copyright respective owners

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.
I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.
There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)
The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.
Am I right to be disappointed?
Is there anything I can do to get closer to my theoretical expectation?
int split_bigtask(Inputs * inputs, Outputs * results)
{
for (int k = 0; k < MAXNO; k++)
results->solved[k].value = 0;
int res;
#pragma omp parallel shared(inputs, outputs)
{
#pragma omp for schedule(dynamic)
for (int k = 0; k < inputs->no; k++)
{
res = bigtask(inputs->values[k],
outputs->solved[k],
omp_get_thread_num()
);
}
}
return TRUE;
}
I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).
I personally would assume that you have case 2 ("Big Global Array").
Solution to the problem (if it's indeed case 2):
Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
Split the global array into several smaller arrays (and give each thread one of these arrays)
Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors)
You may want to try to create a local copy of the array for each thread (at least for the results)

Excesive Linux Latency

Do you think that a latency of 50 msec are normal in Linux System?
I have a program with many threads, one thread is controlling the movement of an object with a motor and photocells.
I have made many thing to get minimun latency, but always get 50 msec that cause a position error in the object.
Things I did:
- nice function to -20
- Thread priority of photeocell control thread: SCHED FIFO, 99
- Kernel configuration: CONFING_PREEMPT=y
- mlockall (MCL_CURRENT | MCL_FUTURE);
Many times, I lose 50 msec waiting for a photocell. I think that the problema is not another of
my thread, but process in the kernel.
Is posible reduced this latency? Is posible to know who is getting this 50 msec extra?
The thread that is controlling photocells make many "read" functions. Can this generate problems?
/**********/
At now the situation is:
There is only one thread running an infinite empty loop, only looking for time at start od the loop an at the end of the loop.
No acces to disk, no acces to GPIO, no serial ports, nothing.
The loop spend 50 milisecond many of the times.
I have not set cpuaffinity, my processor has only one nucleus.
I have been making test in my program.
This is the code in the main function, before the program star the threads, that cause de 50 mseg latency:
struct sched_param lsPrio;
lsPrio.sched_priority = 1;
if (sched_setscheduler (0, SCHED_FIFO, &lsPrio) != 0)
printf ("FALLO sched_set\n");
if I comment this lines the latency is reduced about 1 mseg.
Why this lines cause latency?

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?
No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.
First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.
I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.
Implicit parallelism is probably what you are looking for.
If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.
A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.
The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.
With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.
If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});
Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

Resources