How to finely control the scheduling of rayon? - multithreading

In this problem, I found a way to fill Python's multiprocessing pool (test4). Then I recalled that rayon implements parallelism in iterator style. So I try to implement the same logic in rayon. Here is the code:
use rayon::prelude::*;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::{Duration, Instant};
fn main() {
let now = Instant::now();
let table = Arc::new(Mutex::new(vec![b"| ".repeat(20); 8]));
let write_table = |msg: &[u8]| {
let mut table = table.lock().unwrap();
let thread_idx = rayon::current_thread_index().unwrap();
let time_idx = now.elapsed().as_secs() as usize * 2 + 1;
table[thread_idx][time_idx..time_idx + msg.len()].clone_from_slice(msg);
};
rayon::ThreadPoolBuilder::new()
.num_threads(8)
.build_global()
.unwrap();
(0i32..10)
.into_par_iter()
.map(|_| {
write_table(b"b b b");
thread::sleep(Duration::from_secs(3));
(0i32..3).into_par_iter()
})
.flatten()
.map(|_| {
write_table(b"s");
thread::sleep(Duration::from_secs(1));
})
.collect::<Vec<_>>();
println!("##### total: {}s #####", now.elapsed().as_secs());
println!(
"{}",
table
.lock()
.unwrap()
.iter()
.map(|r| std::str::from_utf8(r).unwrap())
.collect::<Vec<_>>()
.join("\n")
);
}
Rust Playground
But it turns out that rayon by default is lazier than I thought. Here is the output of the code:
##### total: 10s #####
|b b b|s|s|s|b b b|s| | | | | | | | | |
|b b b|s|s|s| | | |s| | | | | | | | | |
|b b b|s|s|s| | | | | | | | | | | | | |
|b b b|s|s|s| | | |s| | | | | | | | | |
|b b b|s|s|s|b b b|s| | | | | | | | | |
|b b b|s|s|s| | | |s| | | | | | | | | |
|b b b|s|s|s| | | |s| | | | | | | | | |
|b b b|s|s|s| | | | | | | | | | | | | |
You can see that some big tasks are scheduled after some small tasks. As a result, the thread pool is not fully utilized. So how to finely control the scheduling of rayon to fully utilize it?
Update
From comment:
The procedure in Python's process pool is, when the imap is called, it starts executing on the background. If we call next on the iterator, we will block until the result is returned. This requires an additional queue to store the results and the execution should be async (not sure). I was wondering if we can easily achieve this in Rayon.
After tossing around, I found that Rayon actually provides some convenient methods to help me implement this, like for_each_with and par_bridge. I finally got this version: Rust Playground. But it is unstable. Sometimes it gets a better result like below:
##### total: 9s #####
|b b b|b b b|s|s| | | | | | | | | | | |
|b b b|s|s|s|s|s|s| | | | | | | | | | |
|b b b|s|s|s|s|s| | | | | | | | | | | |
|b b b|s|s|s|s|s| | | | | | | | | | | |
|b b b|s|s|s|s|s| | | | | | | | | | | |
|b b b|b b b|s|s|s| | | | | | | | | | |
| | | | | | |s|s| | | | | | | | | | | |
|b b b|b b b|s|s| | | | | | | | | | | |
Sometimes it gets worse. So I guess this may be an antipattern in Rayon?

Rayon's scheduling strategy is known as “work stealing”. The principle of it is that tasks specify points where they can run in parallel; if we look at the provided interface rayon::join() we can see that at least one way to do it is to specify two functions that are candidates for running in parallel. By default, these two closures run in sequence. But, if a thread from Rayon's thread pool doesn't have any work to do, it will look at the second closure from the pair pairs, and “steal” it to run on that free thread.
This strategy has many advantages, mainly in that there is no overhead of communication between threads (and moving the data to be worked on to another core) except when this would enable additional parallelism.
In your particular use case, this happens to produce a poor utilization pattern, because each b task fans out to many s tasks so it's optimal to complete all the b tasks ASAP, but rayon prefers to finish all the s tasks on the same thread as their b task unless there's a free thread, which there isn't.
Unfortunately, I don't think it's possible to get Rayon's parallel iteration to perform scheduling well for your use case. Your case benefits from doing as much of the b task-starting as possible and disregarding the potential sequentially-runnable s activity after it, which is the opposite of what Rayon assumes is desirable.
However, we can step outside of the iterator interface and explicitly spawn parallel tasks, then feed their outputs through a channel to a parallel bridge. This way, the b tasks are not considered candidates to execute sequentially after their parent s task.
let (b_out, s_in) = crossbeam::channel::unbounded();
let mut final_output: Vec<()> = vec![];
rayon::scope(|scope| {
for _ in 0i32..10 {
let b_out = b_out.clone();
scope.spawn(move |_| {
write_table(b"b b b");
thread::sleep(Duration::from_secs(3));
b_out.send((0i32..3).into_par_iter()).unwrap();
});
}
drop(b_out); // needed to ensure the channel is closed when it has no more items
final_output = s_in
.into_iter()
.par_bridge()
.flatten()
.map(|_| {
write_table(b"s");
thread::sleep(Duration::from_secs(1));
})
.collect();
});
This wastes one of the threads, probably because it is spending most of its time blocking on s_in.into_iter().next(). This could be avoided by
using spawn() for the s tasks too, instead of a channel, assuming you don't actually need to collect any outputs (or can use a channel for those outputs), or
creating a thread pool with 1 extra thread (which will spend most of its time blocking rather than competing for CPU time).
However, it's still faster than the scheduling you started with (9s instead of 10s).
If you don't actually have any work that benefits from further subdivision than you've shown, you might want to consider avoiding Rayon entirely and running your own thread pool. Rayon is a great generic tool but if your workload has a known and simple shape, you should be able to outperform it with custom code.

Related

Moving a split Vim window to the other half of the screen

Let's say we have 3 buffers (A, B, C) open in Vim arranged as follows
-----------------------------------------
| | |
| | |
| | |
| A | |
| | |
| | |
|------------------| B |
| | |
| | |
| C | |
| | |
| | |
-----------------------------------------
and we want to rearrange it as
-----------------------------------------
| | |
| | |
| | |
| | B |
| | |
| | |
| A |--------------------|
| | |
| | |
| | C |
| | |
| | |
-----------------------------------------
I know I can do this by closing C and reopening it after splitting B. Is there a simple way to do this where I don't have to close buffers and I can rearrange the windows directly?
You wouldn't "close" the buffer C, only the window that displays it.
Vim has dedicated normal mode commands for:
switching a window and the next one in a row or column,
rotating the whole window layout,
pushing a window to the far top, far right, far bottom, and far left,
but it doesn't have one for moving a window to an arbitrary point so, assuming the window you want to move has the focus, the command should look like this:
:q|winc w|sp c
which is not too shabby. You might be able to find a plugin that provides the level of control you are after on https://www.vim.org.

Multithreaded Shortest Job First Scheduling Algorithm

I'm familiar with Shortest Process next Scheduling Algorithm (SJF) which is a non preemptive algorithm. But, this algorithm handles only one process at a time which has the smallest burst time. Can it be modified as Shortest Process Next 2 at a time?
So for the example mentioned here:
5
A 0 3
B 2 6
C 4 4
D 6 5
E 8 2
The first lines denotes the Total number of processes.
The subsequent lines denotes the Process ID, Arrival Time, Burst Time.
The SJF scheduling with 2 processes at a time will works as follows :
Time | A | B | C | D | E | IDLE |
------------------------------------------------
0 | O | | | | | 1 |
1 | O | | | | | 1 |
2 | X | O | | | | |
3 | | O | | | | 1 |
4 | | O | O | | | |
5 | | O | O | | | |
6 | | O | O | | | |
7 | | X | X | | | |
8 | | | | O | O | |
9 | | | | O | X | |
10 | | | | O | | 1 |
11 | | | | O | | 1 |
12 | | | | X | | 1 |
Here,
O: Process scheduled
X: Process completed
Idle denotes how many processors are currently idle. For this case, there are 2 processors.
It can be observed that at time t=4, there are 2 processes scheduled instead of 1.
Why not use a rb-tree and add the tasks to the tree based on their burst time?
The tasks with the least amount of burst time (shortest job) is the left most node in the tree. So when you pick your next tasks, you just remove the left most node from the tree.
When the task completes, you take next task from the rb queue.
When the task blocks, you put it on a separate structure and you take the next task from the tree. You also update the burst time. Once it unblocks, you reinsert it back into the rb-tree.
When a task yields, you update the burst time and reinsert it back into the rb-tree and then take the next task from the tree.
And you can have multiple processors taking tasks from the rb-tree and each will pick the one with the lowest burst time.
The Linux CFS uses a similar mechanism.

How to perform a series of steps in a single thread, with an async flow in spring-integration?

I currently have a spring-integration (v4.3.24) flow that looks like the following:
|
| list of
| filepaths
+----v---+
|splitter|
+----+---+
| filepath
|
+----------v----------+
|sftp-outbound-gateway|
| "get" |
+----------+----------+
| file
+---------------------+
| +----v----+ |
| |decryptor| |
| +----+----+ |
| | |
| +-----v------+ | set of transformers
| |decompressor| | (with routers before them
| +-----+------+ | because some steps are optional)
| | | that process the file;
| +--v--+ | call this "FileProcessor"
| | ... | |
| +--+--+ |
+---------------------+
|
+----v----+
|save file|
| to disk |
+----+----+
|
All of the channels above are DirectChannels - Yup, I know this is a poor structure. This was working fine for files in small numbers. But now, I have to deal with thousands of files which need to go through the same flow - benchmarks reveal that this takes ~ 1 day to finish processing. So, I'm planning to introduce some parallel processing to this flow. I want to modify my flow to achieve something like this:
|
|
+----------v----------+
|sftp-outbound-gateway|
| "mget" |
+----------+----------+
| list of files
|
+----v---+
|splitter|
+----+---+
one thread one | thread ...
+------------------------+---------------+--+--+--+--+
| file | file | | | | |
+---------------------+ +---------------------+
| +----v----+ | | +----v----+ |
| |decryptor| | | |decryptor| |
| +----+----+ | | +----+----+ |
| | | | | |
| +-----v------+ | | +-----v------+ | ...
| |decompressor| | | |decompressor| |
| +-----+------+ | | +-----+------+ |
| | | | | |
| +--v--+ | | +--v--+ |
| | ... | | | | ... | |
| +--+--+ | | +--+--+ |
+---------------------+ +---------------------+
| |
+----v----+ +----v----+
|save file| |save file|
| to disk | | to disk |
+----+----+ +----+----+
| |
| |
For parallel processing, I output the files from the splitter on to a ExecutorChannel with a ThreadPoolTaskExecutor.
Some of the questions that I have:
I want all of the "FileProcessor" steps for one file to happen on the same thread, while multiple files are processed in parallel. How can I achieve this?
I saw from this answer, that a ExecutorChannel to MessageHandlerChain flow would offer such functionality. But, some of the steps inside "FileProcessor" are optional (using selector-expression with routers to skip some of the steps) - ruling out using a MessageHandlerChain. I can rig up a couple of MessageHandlerChains with Filters inside, but this more or less becomes the approach mentioned in #2.
If #1 cannot be achieved, will changing all of the channel types starting from the splitter, from DirectChannel to ExecutorChannel help in introducing some parallelism? If yes, should I create a new TaskExecutor for each channel or can I reuse one TaskExecutor bean for all channels (I cannot set scope="prototype" on a TaskExecutor bean)?
In your opinion, which approach (#1 or #2) is better? Why?
If I perform global error handling, like the approach mentioned here, will the other files continue to process even if one file errors out?
It will work as you need by using an ExecutorChannel as an input to the decrypter and leave all the rest as direct channels; the remaining flow does not have to be a chain, each component will run on one of the executor's threads.
You will need to be sure all your downstream components are thread-safe.
Error handling should remain as is; each sub flow is independent.

Resort key-Value combination

The following example just shows the pattern, my data much bigger.
I have a Table like
| Variable | String |
|:---------|-------:|
| V1 | Hello |
| V2 | little |
| V3 | World |
I have another table where different arrangements are defined
| Arrangement1 | Arrangement2 |
|:-------------|-------------:|
| V3 | V2 |
| V2 | V1 |
| V1 | V3 |
My output depending on the asked Arrangement (e.g. Arrangement1) should be
| Variable | Value |
|:---------|------:|
| V3 | World |
| V2 | little|
| V1 | Hello |
Till now I try to realize an approach with .find and array but think there might be an easier way (maybe with dictionary?) anyone an idea with good performance?

Write a command to increase or decrease the number of vertical splits

I usually have my Vim screen split into two vertical windows, each of which may be further horizontally split. Sometimes, I want to add or delete a vertical window. Is there a way to detect how many top-level vertical splits there are and add or remove vsplits as necessary?
For example, suppose my screen looks like this:
+--------+--------+
| | |
| | |
+--------+ |
| | |
| | |
| +--------+
| | |
+--------+--------+
I want :Columns 1 to give me
+--------+
| |
| |
+--------+
| |
| |
| |
| |
+--------+
by closing the two right-most windows.
I want :Columns 2 to do nothing, detecting that two columns are already open.
And I want :Columns 3 to give me
+--------+--------+--------+
| | | |
| | | |
+--------+ | |
| | | |
| | | |
| +--------+ |
| | | |
+--------+--------+--------+
I am fine if the function ignores vertical splits within horizontal splits. For example, if I had
+--------+
| |
| |
+---+----+
| | |
| | |
| | |
| | |
+---+----+
and I ran :Columns 2, I would get
+--------+--------+
| | |
| | |
+---+----+ |
| | | |
| | | |
| | | |
| | | |
+---+----+--------+
There is indeed a way, but it is involved; the first step is to count the currently-open vertical windows, and I don’t know of any built-in function that facilitates this. The working approach I found to it is basically to start at the first window (the top of the first — if not the entirety of the first — vertical split), and to then, using wincmd l, move to the next window to the right for as long as wincmd l moves to a new window, adding each to a count of open vertical windows including the first one. (I think this is what Gary Fixler referred to in the comments on the question.)
I started trying to write the code for posting here, and it grew to become larger than any function I would want to put in my ~/.vimrc, so I ended up turning it into a plugin which takes the above approach and provides the :Columns command; see Columcille (on vim.org at http://www.vim.org/scripts/script.php?script_id=4742.) The plugin also provides a command for similarly managing horizontal split windows: :Rows divides the current column (or the main window, if there are no open vertical splits) into the specified number of “rows.”

Resources