Currently, I have the following Rust toy program:
use rayon::prelude::*;
use std::{env, thread, time};
/// Sleeps 1 seconds n times parallely using rayon
fn rayon_sleep(n: usize) {
let millis = vec![0; n];
millis
.par_iter()
.for_each(|_| thread::sleep(time::Duration::from_millis(1000)));
}
fn main() {
let args: Vec<String> = env::args().collect();
let n = args[1].parse::<usize>().unwrap();
let now = time::Instant::now();
rayon_sleep(n);
println!("rayon: {:?}", now.elapsed());
}
Basically, my program accepts one input argument n. Then, I sleep for 1 second n times. The program executes the sleep tasks in parallel using rayon.
However, this is not exactly what I want. As far as I know, thread::sleep sleeps according to wall-clock time. However, I would like to keep a virtual CPU busy for 1 second in CPU time.
Is there any way to do this?
EDIT
I would like to make this point clear: I don't mind if the OS preempts the tasks. However, if this happens, then I don't want to consider the time the task spends in the ready/waiting queue.
EDIT
This is a simple, illustrative example of what I need to do. In reality, I have to develop a benchmark for a crate that allows defining and simulating models using the DEVS formalism. The benchmark aims to compare DEVS-compliant libraries with each other, and it explicitly says that the models must spend a fixed, known amount of CPU time. That is why I need to make sure of that. Thus, I cannot use a simple busy loop nor simply sleep.
I followed Sven Marnach's suggestions and implemented the following function:
use cpu_time::ThreadTime;
use rayon::prelude::*;
use std::{env, thread, time};
/// Sleeps 1 seconds n times parallely using rayon
fn rayon_sleep(n: usize) {
let millis = vec![0; n];
millis.par_iter().for_each(|_| {
let duration = time::Duration::from_millis(1000);
let mut x: u32 = 0;
let now = ThreadTime::now(); // get current thread time
while now.elapsed() < duration { // active sleep
std::hint::black_box(&mut x); // to avoid compiler optimizations
x = x.wrapping_add(1);
}
});
}
fn main() {
let args: Vec<String> = env::args().collect();
let n = args[1].parse::<usize>().unwrap();
let now = time::Instant::now();
rayon_sleep(n);
println!("rayon: {:?}", now.elapsed());
}
If I set n to 8, it takes 2 seconds more or less. I'd expect a better performance (1 second, as I have 8 vCPUs), but I guess that the overhead corresponds to the OS scheduling policy.
Related
I have the following toy Rust program:
use rayon::prelude::*;
use std::{env, thread, time};
/// Sleeps 1 second n times
fn seq_sleep(n: usize) {
for _ in 0..n {
thread::sleep(time::Duration::from_millis(1000));
}
}
/// Launches n threads that sleep 1 second
fn thread_sleep(n: usize) {
let mut handles = Vec::new();
for _ in 0..n {
handles.push(thread::spawn(|| {
thread::sleep(time::Duration::from_millis(1000))
}));
}
for handle in handles {
handle.join().unwrap();
}
}
/// Sleeps 1 seconds n times parallely using rayon
fn rayon_sleep(n: usize) {
let millis = vec![0; n];
millis
.par_iter()
.for_each(|_| thread::sleep(time::Duration::from_millis(1000)));
}
fn main() {
let args: Vec<String> = env::args().collect();
let n = args[1].parse::<usize>().unwrap();
let now = time::Instant::now();
seq_sleep(n);
println!("sequential: {:?}", now.elapsed());
let now = time::Instant::now();
thread_sleep(n);
println!("thread: {:?}", now.elapsed());
let now = time::Instant::now();
rayon_sleep(n);
println!("rayon: {:?}", now.elapsed());
}
Basically, I want to compare the degree of parallelism of i) sequential code, ii) basic threads, and iii) rayon. To do so, my program accepts one input parameter n and, depending on the method, it sleeps for 1 second n times.
For n = 8, I get the following output:
sequential: 8.016809707s
thread: 1.006029845s
rayon: 1.004957395s
So far so good. However, for n = 9, I get the following output:
sequential: 9.012422104s
thread: 1.003085005s
rayon: 2.011378713s
The sequential and basic thread versions make sense to me. However, I expected rayon to take 1 second. My machine has 4 cores and hyper threading. This leads me to think that rayon internally limits the number of parallel threads according to the cores/threads that your machine supports. Is this correct?
Yes:
rayon::ThreadPoolBuilder::build_global():
Initializes the global thread pool. This initialization is optional. If you do not call this function, the thread pool will be automatically initialized with the default configuration.
rayon::ThreadPoolBuilder::num_threads():
If num_threads is 0, or you do not call this function, then the Rayon runtime will select the number of threads automatically. At present, this is based on the RAYON_NUM_THREADS environment variable (if set), or the number of logical CPUs (otherwise). In the future, however, the default behavior may change to dynamically add or remove threads as needed.
I have a long string stored in a variable in Rust. I often remove some characters from its front with a drain method and use the value returned from it:
my_str.drain(0..i).collect::<String>();
The problem is, that draining from this string is done really often in the program and it's slowing it down a lot (it takes ~99.6% of runtime). This is a very expensive operation, since every time, the entire string has to be moved left in the memory.
I do not drain from the end of the string at all (which should be much faster), just from the front.
How can I make this more efficient? Is there some alternative to String, that uses a different memory layout, which would be better for this use case?
If you can't use slices because of the lifetimes, you could use a type that provides shared-ownership like SharedString from the shared-string crate or Str from the bytes-utils crate. The former looks more fully-featured but both provide methods that can take the prefix from a string in O(1) because the original data is never moved.
As stated by #Jmb, keeping the original string intact and working with slices is certainly a big win.
I don't know, from the question, the context and usage of these strings, but this quick and dirty benchmark shows a substantial difference in performances.
This benchmark is flawed because there is a useless clone() at each repetition, there is no warm-up, there is no black-box for the result, there are no statistics... but it just gives an idea.
use std::time::Instant;
fn with_drain(mut my_str: String) -> usize {
let mut total = 0;
'work: loop {
for &i in [1, 2, 3, 4, 5].iter().cycle() {
if my_str.len() < i {
break 'work;
}
let s = my_str.drain(0..i).collect::<String>();
total += s.len();
}
}
total
}
fn with_slice(my_str: String) -> usize {
let mut total = 0;
let mut pos = 0;
'work: loop {
for &i in [1, 2, 3, 4, 5].iter().cycle() {
let next_pos = pos + i;
if my_str.len() <= next_pos {
break 'work;
}
let s = &my_str[pos..next_pos];
pos = next_pos;
total += s.len();
}
}
total
}
fn main() {
let my_str="I have a long string stored in a variable in Rust.
I often remove some characters from its front with a drain method and use the value returned from it:
my_str.drain(0..i).collect::<String>();
The problem is, that draining from this string is done really often in the program and it's slowing it down a lot (it takes ~99.6% of runtime). This is a very expensive operation, since every time, the entire string has to be moved left in the memory.
I do not drain from the end of the string at all (which should be much faster), just from the front.
How can I make this more efficient? Is there some alternative to String, that uses a different memory layout, which would be better for this use case?
".to_owned();
let repeat = 1_000_000;
let instant = Instant::now();
for _ in 0..repeat {
let _ = with_drain(my_str.clone());
}
let drain_duration = instant.elapsed();
let instant = Instant::now();
for _ in 0..repeat {
let _ = with_slice(my_str.clone());
}
let slice_duration = instant.elapsed();
println!("{:?} {:?}", drain_duration, slice_duration);
}
/*
$ cargo run --release
Finished release [optimized] target(s) in 0.00s
Running `target/release/prog`
5.017018957s 310.466253ms
*/
As proposed by #SUTerliakov, using VecDeque<char> in this case is much more effective than String either with the pop_front method or the drain method (when draining from the front of course)
So I want to get the distance in cm from my sensor, I already did it with Arduino C and an Arduino compatible board. Now I want to do this with stm32, below is my code (leaving out the conversion of pulse length to sound, as the delta time is constant already at this point.
#![deny(unsafe_code)]
#![allow(clippy::empty_loop)]
#![no_main]
#![no_std]
use panic_halt as _; // panic handler
use cortex_m_rt::{entry, interrupt};
use stm32f4xx_hal as hal;
use crate::hal::{pac, prelude::*};
use stm32f4xx_hal::delay::Delay;
use rtt_target::{rtt_init_print, rprintln};
use stm32f4xx_hal::timer::{Counter, Timer, SysCounter, CounterUs};
use cortex_m::peripheral::SYST;
use stm32f4xx_hal::time::Hertz;
use core::fmt::Debug;
use stm32f4xx_hal::pac::TIM2;
use core::pin::Pin;
fn dbg<T: Debug>(d: T, tag: &str) -> T {
rprintln!("{} {:?}", tag, d);
d
}
fn waste(c_us: &CounterUs<TIM2>, us: u32) {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us {}
}
fn waste_until<T>(c_us: &CounterUs<TIM2>,
predicate: fn(_: &T) -> bool,
dt: &T,
us: u32) -> u32 {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us && !predicate(dt) {}
return c_us.now().ticks() - ts1;
}
#[entry]
fn main() -> ! {
if let (Some(dp), Some(cp)) = (
pac::Peripherals::take(),
cortex_m::peripheral::Peripherals::take(),
) {
rtt_init_print!();
let gpioa = dp.GPIOA.split();
let mut trig = gpioa.pa3.into_push_pull_output();
let mut echo = gpioa.pa4.into_pull_up_input();
let rcc = dp.RCC.constrain();
let clocks = rcc.cfgr.freeze();
let mut counter = Timer::new(dp.TIM2, &clocks).counter_us();
counter.start(1_000_000_u32.micros()).unwrap();
loop {
trig.set_low();
waste(&counter, 2);
trig.set_high();
waste(&counter, 10);
trig.set_low();
let _ = waste_until(&counter, |c|c.is_high(),&echo, 1000);
let pulse_duration = waste_until(&counter, |c| c.is_low(),&echo, 1000);
rprintln!("{}", pulse_duration);
}
}
loop {}
}
I know that the code at this point does not stop the evaluation of the data in the case of timeout in the waste_until function, but given that there is an object less then 10 cm from sensor (which has a range of up to 2 meters) it shouldn't be causing issues.
I have few things I don't understand completely, which I assume might be the cause of this behavior.
First of all, I'm not sure if hardware timers loop, or have to be reset manually. (I used my waste function with half a second delay and managed to make seemingly ok blinky program, so i hope i got it correct).
I'm not sure if i have to configure my TIM2 maximum sampling frequency as in theory I could do it with sysclock, but i didn't find a way to do it with TIM2. Also I assumed that it wouldn't let me create CounterUs without minimum valid sample rate.
I'm not sure if ticks() are in one to one relation with microseconds (only assumed so, because it seemed logical that CounterUs would do that).
I'm not sure about the problems which might occur if timer loops mid wait and delta time becomes negative (in case of u32 just overflows).
When it comes to pull_up_input and pull_down_input does pull_up refer to the fact that pin is usually pulled high, and to trigger logical one it has to go low or that it has to be pulled high to get logical one? (Also it is not very clear if the is_low() and is_high() methods refer to the state of the pin, or logical value of the pin?)
I spent quite some time on this thing, but sadly to no avail so far. Hopefully someone can tell me if one of the things above is wrong and indeed causes the issue, or if its not something I considered helped me to see it.
(Value I'm getting is 1000 - 1001)
So from one of the comments I found out about the pull down and pull up resistors and watched couple YouTube videos on the matter. Not sure if this is correct, but from what I've found it seems that in fact i need a pull_down_input for echo pin. So I replaced it and the value
I'm getting is still constant but it's 1 now.
Now that makes some sense, since I assume that 1000 was originating from the timeout value in my waste. But getting 1 is a bit more confusing, I mean it cannot be faster then 1 us, right?
So after experimenting some more, I've ended up with this version of the code:
#![deny(unsafe_code)]
#![allow(clippy::empty_loop)]
#![no_main]
#![no_std]
use panic_halt as _; // panic handler
use cortex_m_rt::{entry, interrupt};
use stm32f4xx_hal as hal;
use crate::hal::{pac, prelude::*};
use stm32f4xx_hal::delay::Delay;
use rtt_target::{rtt_init_print, rprintln};
use stm32f4xx_hal::timer::{Counter, Timer, SysCounter, CounterUs};
use cortex_m::peripheral::SYST;
use stm32f4xx_hal::time::Hertz;
use core::fmt::Debug;
use stm32f4xx_hal::pac::TIM2;
use core::pin::Pin;
use cortex_m::asm::nop;
fn dbg<T: Debug>(d: T, tag: &str) -> T {
rprintln!("{} {:?}", tag, d);
d
}
fn waste(c_us: &CounterUs<TIM2>, us: u32) {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us {}
}
fn waste_until<T>(c_us: &CounterUs<TIM2>,
predicate: fn(_: &T) -> bool,
dt: &T,
us: u32) -> Option<u32> {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us && !predicate(dt) {
}
if predicate(dt) {Some(c_us.now().ticks() - ts1)} else {None}
}
#[entry]
fn main() -> ! {
if let (Some(dp), Some(cp)) = (
pac::Peripherals::take(),
cortex_m::peripheral::Peripherals::take(),
) {
rtt_init_print!();
let gpioa = dp.GPIOA.split();
let mut trig = gpioa.pa4.into_push_pull_output();
let mut echo = gpioa.pa5.into_pull_down_input();
let rcc = dp.RCC.constrain();
let clocks = rcc.cfgr.freeze();
let mut counter = Timer::new(dp.TIM2, &clocks).counter_us();
counter.start(1_000_000_u32.micros()).unwrap();
loop {
// starting pulse
trig.set_low();
waste(&counter, 2);
trig.set_high();
waste(&counter, 10);
trig.set_low();
// ending pulse
// starting echo read
if let Some(_) = waste_until(&counter, |c|c.is_high(),&echo, 1_000_000) { // if didn't timeout
if let Some(pulse_duration) = waste_until(&counter, |c| c.is_low(),&echo, 1_000_000) { // if didn't timeout
rprintln!("{}", pulse_duration);
} else {
rprintln!("no falling edge");
}
} else {
rprintln!("no rising edge");
}
// end echo read
}
}
loop {}
}
And here it became clear that the pattern in fact was that first 1-3 readings output same value (so far I've seen 1, 21 and 41) and then it keeps timing out in the outer if.
I tried changing io pins because I considered that my poor solder job was to blame, and also inspected the pins with multimeter, they seem to be fine.
I'm not entirely sure but I think that given that sensor has a recommended VCC of 5 volts, and stlink-2 provides 3.3 volts to the board the sensor can preform worse (but once again the target object is at most 5 cm away).
Here are the images of my breadboard just in case i missed something.
I can only see examples using cycles of clock to schedule tasks in Real Time for the Masses (RTFM):
#[init(schedule = [foo])]
fn init() {
schedule.foo(Instant::now() + PERIOD.cycles()).unwrap();
}
I can't find a variable containing the clock speed, the source code of RTFM is mostly syntax tree manipulation inaccessible to a beginner, I struggle to find uses of this API on GitHub. How do I relate cycles to seconds?
I found something:
fn hertz_to_cycles(sysclock: Hertz, hertz: Hertz) -> Duration {
return (sysclock.0 / hertz.0).cycles();
}
#[init(schedule = [toggle])]
unsafe fn init() {
let mut rcc = device.RCC.constrain();
let mut flash = device.FLASH.constrain();
let clocks = rcc.cfgr.freeze(&mut flash.acr);
let sysclock = clocks.sysclk();
let period = hertz_to_cycles(sysclock, 2.hz());
schedule.toggle(Instant::now() + period).unwrap();
}
I hope it gets the ball rolling for a serious answer.
Basically the scheduler is based on DWT (data watchpoint trigger), and that thing has to be clocked at the core speed, so I went to get it there.
No matter how many times I run the program, it always shows the numbers in the same order:
use std::sync::mpsc::channel;
use std::thread;
fn main() {
let (tx, rx) = channel();
for i in 0 ..10 {
let tx = tx.clone();
thread::spawn(move || {
tx.send(i).unwrap();
});
}
for _ in 0..10 {
println!("{}", rx.recv().unwrap());
}
}
Code on the playground. The output is:
6
7
8
5
9
4
3
2
1
0
If I rebuild the project, the sequence will change. Is the sequence decided at compile time?
What order would you expect them to be in? For what it's worth, on my machine I ran the same binary twice and got slightly different results.
Ultimately, this comes down to how your operating system decides to schedule threads. You create 10 new threads and then ask the OS to run each of them when convenient. A hypothetical thread scheduler might look like this:
for thread in threads {
if thread.runnable() {
thread.run_for_a_time_slice();
}
}
Where threads stores the threads in the order they were created. It's unlikely that any OS would be this naïve, but it shows the idea.
In your case, every thread is ready to run immediately, and is very short so it can run all the way to completion before the time is up.
Additionally, there might be some fairness being applied to the lock that guards the channel. Perhaps it always lets the first of multiple competing threads submit a value. Unfortunately, the implementation of channels is reasonably complex, so I can't immediately say if that's the case or not.