When writing relatively realtime code, generally heap allocations in the main execution loop are avoided. So in my experience you allocate all the memory your program needs in an initialization step, and then pass the memory around as needed. A toy example in C might look something like the following:
#include <stdlib.h>
#define LEN 100
void not_realtime() {
int *v = malloc(LEN * sizeof *v);
for (int i = 0; i < LEN; i++) {
v[i] = 1;
}
free(v);
}
void realtime(int *v, int len) {
for (int i = 0; i < len; i++) {
v[i] = 1;
}
}
int main(int argc, char **argv) {
not_realtime();
int *v = malloc(LEN * sizeof *v);
realtime(v, LEN);
free(v);
}
And I believe roughly the equivalent in Rust:
fn possibly_realtime() {
let mut v = vec![0; 100];
for i in 0..v.len() {
v[i] = 1;
}
}
fn realtime(v: &mut Vec<i32>) {
for i in 0..v.len() {
v[i] = 1;
}
}
fn main() {
possibly_realtime();
let mut v: Vec<i32> = vec![0; 100];
realtime(&mut v);
}
What I'm wondering is: is Rust able to optimize possibly_realtime such that the local heap allocation of v only occurs once and is reused on subsequent calls to possibly_realtime? I'm guessing not but maybe there's some magic that makes it possible.
To investigate this, it is useful to add #[inline(never)] to your function, then view the LLVM IR on the playground.
Rust 1.54
This is not optimized. Here's an excerpt:
; playground::possibly_realtime
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc void #_ZN10playground17possibly_realtime17h2ab726cd567363f3E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* #rust_eh_personality {
start:
%0 = tail call i8* #__rust_alloc_zeroed(i64 400, i64 4) #9, !noalias !8
%1 = icmp eq i8* %0, null
br i1 %1, label %bb20.i.i.i.i, label %vector.body
Every time that possibly_realtime is called, memory is allocated via __rust_alloc_zeroed.
Slightly before Rust 1.0
This is not optimized. Here's an excerpt:
; Function Attrs: noinline uwtable
define internal fastcc void #_ZN17possibly_realtime20h1a3a159dd4b50685eaaE() unnamed_addr #0 {
entry-block:
%0 = tail call i8* #je_mallocx(i64 400, i32 0), !noalias !0
%1 = icmp eq i8* %0, null
br i1 %1, label %then-block-255-.i.i, label %normal-return2.i
Every time that possibly_realtime is called, memory is allocated via je_mallocx.
Editorial
Reusing a buffer is a great way to leak secure information, and I'd encourage you to avoid it as much as possible. I'm sure you are already familiar with these problems, but I want to make sure that future searchers make a note.
I also doubt that this "optimization" will be added to Rust, especially not without explicit opt-in by the programmer. There needs to be somewhere that the pointer to the allocated memory could be stored, but there really isn't anywhere. That means it would need to be a global or thread-local variable! Rust can run in environments without threads, but a global variable would still preclude recursive calls to this method. All in all, I think that passing the buffer into the method is much more explicit about what will happen.
I also assume that your example uses a Vec with a fixed size for demo purposes, but if you truly know the size at compile time, a fixed-size array could be a better choice.
As of 2021, Rust is capable of optimizing out heap allocation and inlining vtable method calls (playground):
fn old_adder(a: f64) -> Box<dyn Fn(f64)->f64> {
Box::new(move |x| a + x)
}
#[inline(never)]
fn test() {
let adder = old_adder(1.);
assert_eq!(adder(1.), 2.);
}
fn main() {
test();
}
Related
I'm trying to find the Rust equivalent of having a ASCII string buffer on the stack to have the same efficiency as plain C code has.
Here an example on what I mean with a simplified toy exercise:
the goal is to generate a random-content and random-length ASCII string that is at most 50 characters long. Thus I keep a char buffer on the stack that is used to iteratively construct the string. Once finished, the string is copied onto the heap with the just-right malloc size and returned to the user.
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdio.h>
#define ASCII_PRINTABLE_FIRST ' '
#define ASCII_PRINTABLE_AMOUNT 95
#define MAX_LEN 50
#define MAX_LEN_WITH_TERM (MAX_LEN + 1)
char* generate_string(void) {
char buffer[MAX_LEN_WITH_TERM];
srand((unsigned) time(NULL));
// Generate random string length
const int len = rand() % MAX_LEN_WITH_TERM;
int i;
for (i = 0; i < len; i++) {
// Fill with random ASCII printable character
buffer[i] = (char)
((rand() % ASCII_PRINTABLE_AMOUNT) + ASCII_PRINTABLE_FIRST);
}
buffer[i] = '\0';
return strdup(buffer);
}
int main(void) {
printf("Generated string: %s\n", generate_string());
return 0;
}
What I explored so far:
Using a buffer String::with_capacity(50) or BytesMut, but that allocates the buffer on the heap, which I would like to avoid. Sure, it's premature optimisation, but as an optimisation exercise let's image me calling generate_string() a billion times. That is a billion malloc calls to allocate the buffer. I don't want to use static memory.
Using a an array of chars on the stack, but it consumes 4x the space for just ASCII characters
What are your suggestions?
EDIT:
Yes, it leaks memory. That't not the point of my question, unless you want much longer snippets of code.
Yes, it has insecure random characters. That's not the point of my question.
Why would I allocate the buffer on the heap once per generate_string() call? To make the function self contained, stateless and without static memory. It does not require a pre-allocated buffer externally.
You can generate a random length u8 array (stored on the stack) and only allocate memory on the heap when you convert it to a String using the from_utf8 method. Example:
use rand::prelude::*;
const MAX_LEN: usize = 50;
const ASCII_START: u8 = 32;
const ASCII_END: u8 = 127;
fn generate_string() -> String {
let mut buffer = [0; MAX_LEN];
let mut rng = rand::thread_rng();
let buffer_len = rng.gen_range(0, MAX_LEN);
for i in 0..buffer_len {
buffer[i] = rng.gen_range(ASCII_START, ASCII_END);
}
String::from_utf8((&buffer[0..buffer_len]).to_vec()).unwrap()
}
fn main() {
for _ in 0..5 {
dbg!(generate_string());
}
}
playground
The Rust type that is equivalent to C's char is u8, so the equivalent to a char buffer on the stack is an u8 array.
let mut buf = [0u8; 20];
for i in 0..20 {
buf[i] = b'a' + i as u8;
}
To obtain a &str slice that points into the stack buffer, you can use std::str::from_utf8, which performs a UTF-8 check and returns the pointer if it is valid UTF-8.
fn takes_a_string(a: &str) {
println!("{}", a);
}
fn main() {
let mut buf = [0u8; 20];
for i in 0..20 {
buf[i] = b'a' + i as u8;
}
// This calls takes_a_string with a reference to the stack buffer.
takes_a_string(std::str::from_utf8(&buf).unwrap());
}
abcdefghijklmnopqrst
Rust have anonymous closures with state. Can I do the same with named function?
(invalid pseudocode)
fn counting_function()->i32 {
let mut static counter = 0;
counter = counter + 1;
return counter.clone();
}
I understand I can use structs and functions/traits to do this. And I understand that iterators are the proper way to do it. But leaving aside structs with traits and iterators, can I do this without passing the any burden (of initializing structure) to caller?
This is a thread safe variant using an atomic:
use std::sync::atomic::{AtomicUsize, Ordering};
fn counting_function() -> usize {
static COUNTER: AtomicUsize = AtomicUsize::new(0);
let result = COUNTER.fetch_add(1, Ordering::Relaxed);
result
}
But it's actually a code smell I'd say.
Your pseudocode almost works as is. To work with the static mut variable, you'll need to mark the accessing and modifying parts of your code as unsafe as these operations are not threadsafe.
fn counting_function() -> u32 {
static mut counter: u32 = 0;
let retval = unsafe { counter };
unsafe {
counter += 1;
}
retval
}
I am new to Rust, and struggling to deal with all those wrapper types in Rust. I am trying to write code that is semantically equal to the following C code. The code tries to create a big table for book keeping, but will divide the big table so that every thread will only access their local small slices of that table. The big table will not be accessed unless other threads quit and no longer access their own slice.
#include <stdio.h>
#include <pthread.h>
void* write_slice(void* arg) {
int* slice = (int*) arg;
int i;
for (i = 0; i < 10; i++)
slice[i] = i;
return NULL;
}
int main()
{
int* table = (int*) malloc(100 * sizeof(int));
int* slice[10];
int i;
for (i = 0; i < 10; i++) {
slice[i] = table + i * 10;
}
// create pthread for each slice
pthread_t p[10];
for (i = 0; i < 10; i++)
pthread_create(&p[i], NULL, write_slice, slice[i]);
for (i = 0; i < 10; i++)
pthread_join(p[i], NULL);
for (i = 0; i < 100; i++)
printf("%d,", table[i]);
}
How do I use Rust's types and ownership to achieve this?
Let's start with the code:
// cargo-deps: crossbeam="0.7.3"
extern crate crossbeam;
const CHUNKS: usize = 10;
const CHUNK_SIZE: usize = 10;
fn main() {
let mut table = [0; CHUNKS * CHUNK_SIZE];
// Scoped threads allow the compiler to prove that no threads will outlive
// table (which would be bad).
let _ = crossbeam::scope(|scope| {
// Chop `table` into disjoint sub-slices.
for slice in table.chunks_mut(CHUNK_SIZE) {
// Spawn a thread operating on that subslice.
scope.spawn(move |_| write_slice(slice));
}
// `crossbeam::scope` ensures that *all* spawned threads join before
// returning control back from this closure.
});
// At this point, all threads have joined, and we have exclusive access to
// `table` again. Huzzah for 100% safe multi-threaded stack mutation!
println!("{:?}", &table[..]);
}
fn write_slice(slice: &mut [i32]) {
for (i, e) in slice.iter_mut().enumerate() {
*e = i as i32;
}
}
One thing to note is that this needs the crossbeam crate. Rust used to have a similar "scoped" construct, but a soundness hole was found right before 1.0, so it was deprecated with no time to replace it. crossbeam is basically the replacement.
What Rust lets you do here is express the idea that, whatever the code does, none of the threads created within the call to crossbeam::scoped will survive that scope. As such, anything borrowed from outside that scope will live longer than the threads. Thus, the threads can freely access those borrows without having to worry about things like, say, a thread outliving the stack frame that table is defined by and scribbling over the stack.
So this should do more or less the same thing as the C code, though without that nagging worry that you might have missed something. :)
Finally, here's the same thing using scoped_threadpool instead. The only real practical difference is that this allows us to control how many threads are used.
// cargo-deps: scoped_threadpool="0.1.6"
extern crate scoped_threadpool;
const CHUNKS: usize = 10;
const CHUNK_SIZE: usize = 10;
fn main() {
let mut table = [0; CHUNKS * CHUNK_SIZE];
let mut pool = scoped_threadpool::Pool::new(CHUNKS as u32);
pool.scoped(|scope| {
for slice in table.chunks_mut(CHUNK_SIZE) {
scope.execute(move || write_slice(slice));
}
});
println!("{:?}", &table[..]);
}
fn write_slice(slice: &mut [i32]) {
for (i, e) in slice.iter_mut().enumerate() {
*e = i as i32;
}
}
I'm trying to parallelize an algorithm I have. This is a sketch of how I would write it in C++:
void thread_func(std::vector<int>& results, int threadid) {
results[threadid] = threadid;
}
std::vector<int> foo() {
std::vector<int> results(4);
for(int i = 0; i < 4; i++)
{
spawn_thread(thread_func, results, i);
}
join_threads();
return results;
}
The point here is that each thread has a reference to a shared, mutable object that it does not own. It seems like this is difficult to do in Rust. Should I try to cobble it together in terms of (and I'm guessing here) Mutex, Cell and &mut, or is there a better pattern I should follow?
The proper way is to use Arc<Mutex<...>> or, for example, Arc<RWLock<...>>. Arc is a shared ownership-based concurrency-safe pointer to immutable data, and Mutex/RWLock introduce synchronized internal mutability. Your code then would look like this:
use std::sync::{Arc, Mutex};
use std::thread;
fn thread_func(results: Arc<Mutex<Vec<i32>>>, thread_id: i32) {
let mut results = results.lock().unwrap();
results[thread_id as usize] = thread_id;
}
fn foo() -> Arc<Mutex<Vec<i32>>> {
let results = Arc::new(Mutex::new(vec![0; 4]));
let guards: Vec<_> = (0..4).map(|i| {
let results = results.clone();
thread::spawn(move || thread_func(results, i))
}).collect();
for guard in guards {
guard.join();
}
results
}
This unfortunately requires you to return Arc<Mutex<Vec<i32>>> from the function because there is no way to "unwrap" the value. An alternative is to clone the vector before returning.
However, using a crate like scoped_threadpool (whose approach could only be recently made sound; something like it will probably make into the standard library instead of the now deprecated thread::scoped() function, which is unsafe) it can be done in a much nicer way:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn thread_func(result: &mut i32, thread_id: i32) {
*result = thread_id;
}
fn foo() -> Vec<i32> {
let results = vec![0; 4];
let mut pool = Pool::new(4);
pool.scoped(|scope| {
for (i, e) in results.iter_mut().enumerate() {
scope.execute(move || thread_func(e, i as i32));
}
});
results
}
If your thread_func needs to access the whole vector, however, you can't get away without synchronization, so you would need a Mutex, and you would still get the unwrapping problem:
extern crate scoped_threadpool;
use std::sync::Mutex;
use scoped_threadpool::Pool;
fn thread_func(results: &Mutex<Vec<u32>>, thread_id: i32) {
let mut results = results.lock().unwrap();
result[thread_id as usize] = thread_id;
}
fn foo() -> Vec<i32> {
let results = Mutex::new(vec![0; 4]);
let mut pool = Pool::new(4);
pool.scoped(|scope| {
for i in 0..4 {
scope.execute(move || thread_func(&results, i));
}
});
results.lock().unwrap().clone()
}
But at least you don't need any Arcs here. Also execute() method is unsafe if you use stable compiler because it does not have a corresponding fix to make it safe. It is safe on all compiler versions greater than 1.4.0, according to its build script.
I'm looking for more convenient way to work with std::String in winapi calls in Rust.
Using rust v 0.12.0-nigtly with winapi 0.1.22 and user32-sys 0.1.1
Now I'm using something like this:
use winapi;
use user32;
pub fn get_window_title(handle: i32) -> String {
let mut v: Vec<u16> = Vec::new();
v.reserve(255);
let mut p = v.as_mut_ptr();
let len = v.len();
let cap = v.capacity();
let mut read_len = 0;
unsafe {
mem::forget(v);
read_len = unsafe { user32::GetWindowTextW(handle as winapi::HWND, p, 255) };
if read_len > 0 {
return String::from_utf16_lossy(Vec::from_raw_parts(p, read_len as usize, cap).as_slice());
} else {
return "".to_string();
}
}
}
I think, that this vector based memory allocation is rather bizarre. So I'm looking for more easier way to cast LPCWSTR to std::String
In your situation, you always want a maximum of 255 bytes, so you can use an array instead of a vector. This reduces the entire boilerplate to a mem::uninitialized() call, an as_mut_ptr() call and a slicing operation.
unsafe {
let mut v: [u16; 255] = mem::uninitialized();
let read_len = user32::GetWindowTextW(
handle as winapi::HWND,
v.as_mut_ptr(),
255,
);
String::from_utf16_lossy(&v[0..read_len])
}
In case you wanted to use a Vec, there's an easier way than to destroy the vec and re-create it. You can write to the Vec's content directly and let Rust handle everything else.
let mut v: Vec<u16> = Vec::with_capacity(255);
unsafe {
let read_len = user32::GetWindowTextW(
handle as winapi::HWND,
v.as_mut_ptr(),
v.capacity(),
);
v.set_len(read_len); // this is undefined behavior if read_len > v.capacity()
String::from_utf16_lossy(&v)
}
As a side-note, it is idiomatic in Rust to not use return on the last statement in a function, but to simply let the expression stand there without a semicolon. In your original code, the final if-expression could be written as
if read_len > 0 {
String::from_utf16_lossy(Vec::from_raw_parts(p, read_len as usize, cap).as_slice())
} else {
"".to_string()
}
but I removed the entire condition from my samples, as it is unnecessary to handle 0 read characters differently from n characters.