How to build a barrier by rust asm? - rust

In gcc, we can use asm volatile("":::"memory");
But I can't find a option similar to "memory" in document of rust inline asm.
Is there any way to do that?

In Rust, memory clobbering is the default. You should use options(nomem) to opt it out.
For example:
pub unsafe fn no_nomem() {
std::arch::asm!("");
}
pub unsafe fn nomem() {
std::arch::asm!("", options(nomem));
}
LLVM IR:
define void #_ZN7example8no_nomem17h95b023e6c43118daE() unnamed_addr #0 !dbg !5 {
call void asm sideeffect alignstack inteldialect "", "~{dirflag},~{fpsr},~{flags},~{memory}"(), !dbg !10, !srcloc !11
br label %bb1, !dbg !10
bb1: ; preds = %start
ret void, !dbg !12
}
define void #_ZN7example5nomem17hc75cf2d808290004E() unnamed_addr #0 !dbg !13 {
call void asm sideeffect alignstack inteldialect "", "~{dirflag},~{fpsr},~{flags}"() #1, !dbg !14, !srcloc !15
br label %bb1, !dbg !14
bb1: ; preds = %start
ret void, !dbg !16
}
The function without nomem emits a ~{memory} barrier.

Related

Is there a way to init a non-trivial static std::collections::HashMap without making it static mut?

In this code, A does not need to be static mut, but the compiler forces B to be static mut:
use std::collections::HashMap;
use std::iter::FromIterator;
static A: [u32; 21] = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
];
static mut B: Option<HashMap<u32, String>> = None;
fn init_tables() {
let hm = HashMap::<u32, String>::from_iter(A.iter().map(|&i| (i, (i + 10u32).to_string())));
unsafe {
B = Some(hm);
}
}
fn main() {
init_tables();
println!("{:?} len: {}", A, A.len());
unsafe {
println!("{:?}", B);
}
}
This is the only way I have found to get close to what I actually want: a global, immutable HashMap to be used by several functions, without littering all my code with unsafe blocks.
I know that a global variable is a bad idea for multi-threaded applications, but mine is single threaded, so why should I pay the price for an eventuality which will never arise?
Since I use rustc directly and not cargo, I don't want the "help" of extern crates like lazy_static. I tried to decypher what the macro in that package does, but to no end.
I also tried to write this with thread_local() and a RefCell but I had trouble using A to initialize B with that version.
In more general terms, the question could be "How to get stuff into the initvars section of a program in Rust?"
If you can show me how to initialize B directly (without a function like init_tables()), your answer is probably right.
If a function like init_tables() is inevitable, is there a trick like an accessor function to reduce the unsafe litter in my program?
How to get stuff into the initvars section of a program in Rust?
Turns out rustc puts static data in .rodata section and static mut data in .data section of the generated binary:
#[no_mangle]
static DATA: std::ops::Range<u32> = 0..20;
fn main() { DATA.len(); }
$ rustc static.rs
$ objdump -t -j .rodata static
static: file format elf64-x86-64
SYMBOL TABLE:
0000000000025000 l d .rodata 0000000000000000 .rodata
0000000000025490 l O .rodata 0000000000000039 str.0
0000000000026a70 l O .rodata 0000000000000400 elf_crc32.crc32_table
0000000000026870 l O .rodata 0000000000000200 elf_zlib_default_dist_table
0000000000026590 l O .rodata 00000000000002e0 elf_zlib_default_table
0000000000025060 g O .rodata 0000000000000008 DATA
0000000000027f2c g O .rodata 0000000000000100 _ZN4core3str15UTF8_CHAR_WIDTH17h6f9f810be98aa5f2E
So changing from static mut to static at the source code level significantly changes the binary generated. The .rodata section is read-only and trying to write to it will seg fault the program.
If init_tables() is of the judgement day category (inevitable)
It is probably inevitable. Since the default .rodata linkage won't work, one has to control it directly:
use std::collections::HashMap;
use std::iter::FromIterator;
static A: std::ops::Range<u32> = 0..20;
#[link_section = ".bss"]
static B: Option<HashMap<u32, String>> = None;
fn init_tables() {
let data = HashMap::from_iter(A.clone().map(|i| (i, (i + 10).to_string())));
unsafe {
let b: *mut Option<HashMap<u32, String>> = &B as *const _ as *mut _;
(&mut *b).replace(data);
}
}
fn main() {
init_tables();
println!("{:?} len: {}", A, A.len());
println!("{:#?} 5 => {:?}", B, B.as_ref().unwrap().get(&5));
}
I don't want the "help" of extern crates like lazy_static
Actually lazy_static isn't that complicated. It has some clever use of the Deref trait. Here is a much simplified standalone version and it is more ergonomically friendly than the first example:
use std::collections::HashMap;
use std::iter::FromIterator;
use std::ops::Deref;
use std::sync::Once;
static A: std::ops::Range<u32> = 0..20;
static B: BImpl = BImpl;
struct BImpl;
impl Deref for BImpl {
type Target = HashMap<u32, String>;
#[inline(always)]
fn deref(&self) -> &Self::Target {
static LAZY: (Option<HashMap<u32, String>>, Once) = (None, Once::new());
LAZY.1.call_once(|| unsafe {
let x: *mut Option<Self::Target> = &LAZY.0 as *const _ as *mut _;
(&mut *x).replace(init_tables());
});
LAZY.0.as_ref().unwrap()
}
}
fn init_tables() -> HashMap<u32, String> {
HashMap::from_iter(A.clone().map(|i| (i, (i + 10).to_string())))
}
fn main() {
println!("{:?} len: {}", A, A.len());
println!("{:#?} 5 => {:?}", *B, B.get(&5));
}

How to use read-only borrowed Rust data by multiple Java threads?

I have a struct Foo and FooRef which has references to data from Foo:
struct Foo { /* ... */ }
struct FooRef<'foo> { /* ... */ }
impl Foo {
pub fn create_ref<'a>(&'a self) -> FooRef<'a> { /* ... */ }
}
Now Foo directly cannot be used in the logic; I need FooRef. Creating FooRef requires lots of computation, so I do it once just after creating the Foo instance. FooRef is immutable; it's only used for reading data.
Multiple threads needs to access this FooRef instance. How can I implement this? The calling threads are Java threads and this will be used with JNI. This prevents using a scoped threadpool, for example.
Another complication is that when I have to refresh the Foo instance to load new data into it. I then also need to recreate the FooRef instance as well.
How can this be achieved thread-safely and memory-safely? I tried messing around with pointers and RwLock but that resulted in a memory leak (the memory usage kept on adding on each reload). I am a Java developer that is a newbie to pointers.
The data in Foo is mostly text and about 250Mb. The FooRef is mostly strs and structs of strs borrowed from Foo.
My Java usage explanation
I use two long variables in a Java class to store pointers to Foo and FooRef. I use a static ReentrantReadWriteLock to guard these pointers.
If the data need to be updated in Foo, I acquire a write lock, drop FooRef, update Foo, create a new FooRef and update the ref pointer in Java.
If I need to read the data (i.e. when I am not updating Foo), I acquire a read lock and use the FooRef.
The memory leak is visible only when multiple Java threads are calling this code.
Rust:
use jni::objects::{JClass, JString};
use jni::sys::{jlong, jstring};
use jni::JNIEnv;
use std::collections::HashMap;
macro_rules! foo_mut_ptr {
($env: expr, $class: expr) => {
$env.get_field(*$class, "ptr", "J")
.ok()
.and_then(|j| j.j().ok())
.and_then(|ptr| {
if ptr == 0 {
None
} else {
Some(ptr as *mut Foo)
}
})
};
}
macro_rules! foo_ref_mut_ptr {
($env: expr, $class: expr) => {
$env.get_field(*$class, "ptrRef", "J")
.ok()
.and_then(|j| j.j().ok())
.and_then(|ptr| {
if ptr == 0 {
None
} else {
Some(ptr as *mut FooRef)
}
})
};
}
macro_rules! foo_mut {
($env: expr, $class: expr) => {
foo_mut_ptr!($env, $class).map(|ptr| &mut *ptr)
};
}
macro_rules! foo_ref {
($env: expr, $class: expr) => {
foo_ref_mut_ptr!($env, $class).map(|ptr| &*ptr)
};
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_create(_env: JNIEnv, _class: JClass) -> jlong {
Box::into_raw(Box::new(Foo::default())) as jlong
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_createRef(env: JNIEnv, class: JClass) -> jlong {
let foo = foo_mut!(env, class).expect("createRef was called on uninitialized Data");
let foo_ref = foo.create_ref();
Box::into_raw(Box::new(foo_ref)) as jlong
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_reload(env: JNIEnv, class: JClass) {
let foo = foo_mut!(env, class).expect("foo must be initialized");
*foo = Foo {
data: vec!["hello".to_owned(); 1024 * 1024],
};
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_destroy(env: JNIEnv, class: JClass) {
drop_ptr(foo_ref_mut_ptr!(env, class));
drop_ptr(foo_mut_ptr!(env, class));
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_destroyRef(env: JNIEnv, class: JClass) {
drop_ptr(foo_ref_mut_ptr!(env, class));
}
unsafe fn drop_ptr<T>(ptr: Option<*mut T>) {
if let Some(ptr) = ptr {
let _foo = Box::from_raw(ptr);
// foo drops here
}
}
#[derive(Default)]
struct Foo {
data: Vec<String>,
}
#[derive(Default)]
struct FooRef<'a> {
data: HashMap<&'a str, Vec<&'a str>>,
}
impl Foo {
fn create_ref(&self) -> FooRef {
let mut data = HashMap::new();
for s in &self.data {
let s = &s[..];
data.insert(s, vec![s]);
}
FooRef { data }
}
}
Java:
package test;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock.ReadLock;
import java.util.concurrent.locks.ReentrantReadWriteLock.WriteLock;
public class App implements AutoCloseable {
private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
private final ReadLock readLock = lock.readLock();
private final WriteLock writeLock = lock.writeLock();
private volatile long ptr;
private volatile long ptrRef;
private volatile boolean reload;
static {
System.loadLibrary("foo");
}
public static void main(String[] args) throws InterruptedException {
try (App app = new App()) {
for (int i = 0; i < 20; i++) {
new Thread(() -> {
while (true) {
app.tryReload();
}
}).start();
}
while (true) {
app.setReload();
}
}
}
public App() {
this.ptr = this.create();
}
public void setReload() {
writeLock.lock();
try {
reload = true;
} finally {
writeLock.unlock();
}
}
public void tryReload() {
readLock.lock();
debug("Got read lock");
if (reload) {
debug("Cache is expired");
readLock.unlock();
debug("Released read lock coz expired");
writeLock.lock();
debug("Got write lock");
try {
if (reload) {
fullReload();
}
readLock.lock();
debug("Got read lock inside write");
} finally {
writeLock.unlock();
debug("Released write lock");
}
}
readLock.unlock();
debug("Released read lock");
}
private void fullReload() {
destroyRef();
debug("Dropped ref");
debug("Reloading");
reload();
debug("Reloading completed");
updateRef();
debug("Created ref");
reload = false;
}
private void updateRef() {
this.ptrRef = this.createRef();
}
private native void reload();
private native long create();
private native long createRef();
private native void destroy();
private native void destroyRef();
#Override
public void close() {
writeLock.lock();
try {
this.destroy();
this.ptrRef = 0;
this.ptr = 0;
} finally {
writeLock.unlock();
}
}
private static void debug(String s) {
System.out.printf("%10s : %s%n", Thread.currentThread().getName(), s);
}
}
The problem that I was thinking as memory leak wasn't actually a memory leak. The issue was that the allocator was using thread local arenas. So, whatever thread was reloading 250MB of data was leaving the allocated space as is and not returning it to the system. This issue was not specific to JNI, but also happening in pure safe rust code. See Why multiple threads using too much memory when holding Mutex
The default number of arenas created defaults to 8 * cpu count = 64 in my case. This setting can be overridden by setting MALLOC_ARENA_MAX env variable.
So I resolved this issue by setting MALLOC_ARENA_MAX env variable to 1 . So, the approach I took is fine. It was just platform specific issue.
This issue was occurring only in Ubuntu in WSL. I also tried the same code without any tweaking on Windows 10 and it works perfectly without any issues.

How to execute raw instructions from a memory buffer in Rust?

I'm attempting to make a buffer of memory executable, then execute it in Rust. I've gotten all the way until I need to cast the raw executable bytes as code/instructions. You can see a working example in C below.
Extra details:
Rust 1.34
Linux
CC 8.2.1
unsigned char code[] = {
0x55, // push %rbp
0x48, 0x89, 0xe5, // mov %rsp,%rbp
0xb8, 0x37, 0x00, 0x00, 0x00, // mov $0x37,%eax
0xc9, // leaveq
0xc3 // retq
};
void reflect(const unsigned char *code) {
void *buf;
/* copy code to executable buffer */
buf = mmap(0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,MAP_PRIVATE|MAP_ANON,-1,0);
memcpy(buf, code, sizeof(code));
((void (*) (void))buf)();
}
extern crate mmap;
use mmap::{MapOption, MemoryMap};
unsafe fn reflect(instructions: &[u8]) {
let map = MemoryMap::new(
instructions.len(),
&[
MapOption::MapAddr(0 as *mut u8),
MapOption::MapOffset(0),
MapOption::MapFd(-1),
MapOption::MapReadable,
MapOption::MapWritable,
MapOption::MapExecutable,
MapOption::MapNonStandardFlags(libc::MAP_ANON),
MapOption::MapNonStandardFlags(libc::MAP_PRIVATE),
],
)
.unwrap();
std::ptr::copy(instructions.as_ptr(), map.data(), instructions.len());
// How to cast into extern "C" fn() ?
}
Use mem::transmute to cast a raw pointer to a function pointer type.
use std::mem;
let func: unsafe extern "C" fn() = mem::transmute(map.data());
func();

Does Rust optimize passing temporary structures by value?

Let's say I have a vector of structures in Rust. Structures are quite big. When I want to insert a new one, I write the code like this:
my_vec.push(MyStruct {field1: value1, field2: value2, ... });
The push definition is
fn push(&mut self, value: T)
which means the value is passed by value. I wonder if Rust creates a temporary object first then does a copy to the push function or does it optimize the code so that no temporary objects are created and copied?
Let's see. This program:
struct LotsOfBytes {
bytes: [u8; 1024]
}
#[inline(never)]
fn consume(mut lob: LotsOfBytes) {
}
fn main() {
let lob = LotsOfBytes { bytes: [0; 1024] };
consume(lob);
}
Compiles to the following LLVM IR code:
%LotsOfBytes = type { [1024 x i8] }
; Function Attrs: noinline nounwind uwtable
define internal fastcc void #_ZN7consume20hf098deecafa4b74bkaaE(%LotsOfBytes* noalias nocapture dereferenceable(1024)) unnamed_addr #0 {
entry-block:
%1 = getelementptr inbounds %LotsOfBytes* %0, i64 0, i32 0, i64 0
tail call void #llvm.lifetime.end(i64 1024, i8* %1)
ret void
}
; Function Attrs: nounwind uwtable
define internal void #_ZN4main20hf3cbebd3154c5390qaaE() unnamed_addr #2 {
entry-block:
%lob = alloca %LotsOfBytes, align 8
%lob1 = getelementptr inbounds %LotsOfBytes* %lob, i64 0, i32 0, i64 0
%arg = alloca %LotsOfBytes, align 8
%0 = getelementptr inbounds %LotsOfBytes* %lob, i64 0, i32 0, i64 0
call void #llvm.lifetime.start(i64 1024, i8* %0)
call void #llvm.memset.p0i8.i64(i8* %lob1, i8 0, i64 1024, i32 8, i1 false)
%1 = getelementptr inbounds %LotsOfBytes* %arg, i64 0, i32 0, i64 0
call void #llvm.lifetime.start(i64 1024, i8* %1)
call void #llvm.memcpy.p0i8.p0i8.i64(i8* %1, i8* %0, i64 1024, i32 8, i1 false)
call fastcc void #_ZN7consume20hf098deecafa4b74bkaaE(%LotsOfBytes* noalias nocapture dereferenceable(1024) %arg)
call void #llvm.lifetime.end(i64 1024, i8* %1)
call void #llvm.lifetime.end(i64 1024, i8* %0)
ret void
}
This line is interesting in particular:
call fastcc void #_ZN7consume20hf098deecafa4b74bkaaE(%LotsOfBytes* noalias nocapture dereferenceable(1024) %arg)
If I understand correctly, this means that consume is called with a pointer to LotsOfBytes, so yes, rustc optimizes passing big structures by value.

Is Rust able to optimize local heap allocations?

When writing relatively realtime code, generally heap allocations in the main execution loop are avoided. So in my experience you allocate all the memory your program needs in an initialization step, and then pass the memory around as needed. A toy example in C might look something like the following:
#include <stdlib.h>
#define LEN 100
void not_realtime() {
int *v = malloc(LEN * sizeof *v);
for (int i = 0; i < LEN; i++) {
v[i] = 1;
}
free(v);
}
void realtime(int *v, int len) {
for (int i = 0; i < len; i++) {
v[i] = 1;
}
}
int main(int argc, char **argv) {
not_realtime();
int *v = malloc(LEN * sizeof *v);
realtime(v, LEN);
free(v);
}
And I believe roughly the equivalent in Rust:
fn possibly_realtime() {
let mut v = vec![0; 100];
for i in 0..v.len() {
v[i] = 1;
}
}
fn realtime(v: &mut Vec<i32>) {
for i in 0..v.len() {
v[i] = 1;
}
}
fn main() {
possibly_realtime();
let mut v: Vec<i32> = vec![0; 100];
realtime(&mut v);
}
What I'm wondering is: is Rust able to optimize possibly_realtime such that the local heap allocation of v only occurs once and is reused on subsequent calls to possibly_realtime? I'm guessing not but maybe there's some magic that makes it possible.
To investigate this, it is useful to add #[inline(never)] to your function, then view the LLVM IR on the playground.
Rust 1.54
This is not optimized. Here's an excerpt:
; playground::possibly_realtime
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc void #_ZN10playground17possibly_realtime17h2ab726cd567363f3E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* #rust_eh_personality {
start:
%0 = tail call i8* #__rust_alloc_zeroed(i64 400, i64 4) #9, !noalias !8
%1 = icmp eq i8* %0, null
br i1 %1, label %bb20.i.i.i.i, label %vector.body
Every time that possibly_realtime is called, memory is allocated via __rust_alloc_zeroed.
Slightly before Rust 1.0
This is not optimized. Here's an excerpt:
; Function Attrs: noinline uwtable
define internal fastcc void #_ZN17possibly_realtime20h1a3a159dd4b50685eaaE() unnamed_addr #0 {
entry-block:
%0 = tail call i8* #je_mallocx(i64 400, i32 0), !noalias !0
%1 = icmp eq i8* %0, null
br i1 %1, label %then-block-255-.i.i, label %normal-return2.i
Every time that possibly_realtime is called, memory is allocated via je_mallocx.
Editorial
Reusing a buffer is a great way to leak secure information, and I'd encourage you to avoid it as much as possible. I'm sure you are already familiar with these problems, but I want to make sure that future searchers make a note.
I also doubt that this "optimization" will be added to Rust, especially not without explicit opt-in by the programmer. There needs to be somewhere that the pointer to the allocated memory could be stored, but there really isn't anywhere. That means it would need to be a global or thread-local variable! Rust can run in environments without threads, but a global variable would still preclude recursive calls to this method. All in all, I think that passing the buffer into the method is much more explicit about what will happen.
I also assume that your example uses a Vec with a fixed size for demo purposes, but if you truly know the size at compile time, a fixed-size array could be a better choice.
As of 2021, Rust is capable of optimizing out heap allocation and inlining vtable method calls (playground):
fn old_adder(a: f64) -> Box<dyn Fn(f64)->f64> {
Box::new(move |x| a + x)
}
#[inline(never)]
fn test() {
let adder = old_adder(1.);
assert_eq!(adder(1.), 2.);
}
fn main() {
test();
}

Resources