Why multiple threads using too much memory when holding Mutex

Why multiple threads using too much memory when holding Mutex - rust

Below code uses ~150MB in single thread but uses several GBs in 100 threads:
use std::sync::{Arc, Mutex};
use std::thread;
fn main() {
let f = Arc::new(Mutex::new(Foo::new("hello")));
let mut threads = vec![];
for i in 0..100 {
let f = f.clone();
let t = thread::spawn(move || loop {
let mut locked = f.lock().unwrap();
*locked = Foo::new("hello");
drop(locked);
println!("{} reloaded", i);
thread::yield_now();
});
threads.push(t);
}
threads.into_iter().for_each(|h| h.join().unwrap());
}
pub struct Foo {
_data: Vec<String>,
}
impl Foo {
fn new(s: &str) -> Foo {
Foo {
_data: vec![s.to_owned(); 1024 * 1024],
}
}
}
While holding the LockGuard, a thread should have exclusive access. So, new Foo should be allocated and old value should be dropped at that point. So, it doesn't make any sense to me this much memory is being used when called from multiple threads.
Can anyone please explain why this code is using this much memory?
Similar code in Java keeps memory ~200mb even with 1000 threads.
import java.util.ArrayList;
import java.util.List;
public class Foo {
private List<String> data;
public static void main(String[] args) {
Foo f = new Foo();
for (int i = 0; i < 1000; i++) {
int n = i;
new Thread(() -> {
while (true) {
f.update();
System.gc();
System.out.println(n + " updated");
}
}).start();
}
}
public synchronized void update() {
data = new ArrayList<>(1024 * 1024);
for (int i = 0; i < 1024 * 1024; i++) {
data.add(new String("hello"));
}
}
}

So the problem was in the big numbers of glibc's malloc arenas,
every arena has cache of preallocated memory. The simple way to check it is running binary with MALLOC_ARENA_MAX=2, but final solution depend on usage pattern, there are a lot variables to tune glibc's allocator: http://man7.org/linux/man-pages/man3/mallopt.3.html .
Java virtual machine is also actually affected by malloc's allocator. From my experience some time it is suitable to configure number of arenas to prevent huge memory usage of jvm inside docker.

Related

Get Mutex protected value in Coroutine from none-coroutine threads without mpsc?

How to get Mutex protected value in Coroutine from none-coroutine threads in Rust. Without using mpsc.
struct S1 {
value: Arc<tokio::sync::Mutex<f64>>,
}
impl S1 {
fn start(&self) {
let v = self.value.clone();
RT.spawn(async move {
loop {
let lk = v.lock().await;
*lk += 1.0;
drop(lk);
// sleep 1s
}
});
//sleep 5s
let lk = self.value.lock() //`start` is not an async function, so you can't call self.value.lock().await here.
}
}

You should use tokio::sync::Mutex::blocking_lock() for locking a tokio mutex outside of async context.
fn start(&self) {
let v = self.value.clone();
RT.spawn(async move {
loop {
let lk = v.lock().await;
*lk += 1.0;
drop(lk);
// sleep 1s
}
});
// sleep 5s
let lk = self.value.blocking_lock();
// ...
}
However note that you should prefer std::sync::Mutex as long as you don't need to hold the lock across .await points.

How to use read-only borrowed Rust data by multiple Java threads?

I have a struct Foo and FooRef which has references to data from Foo:
struct Foo { /* ... */ }
struct FooRef<'foo> { /* ... */ }
impl Foo {
pub fn create_ref<'a>(&'a self) -> FooRef<'a> { /* ... */ }
}
Now Foo directly cannot be used in the logic; I need FooRef. Creating FooRef requires lots of computation, so I do it once just after creating the Foo instance. FooRef is immutable; it's only used for reading data.
Multiple threads needs to access this FooRef instance. How can I implement this? The calling threads are Java threads and this will be used with JNI. This prevents using a scoped threadpool, for example.
Another complication is that when I have to refresh the Foo instance to load new data into it. I then also need to recreate the FooRef instance as well.
How can this be achieved thread-safely and memory-safely? I tried messing around with pointers and RwLock but that resulted in a memory leak (the memory usage kept on adding on each reload). I am a Java developer that is a newbie to pointers.
The data in Foo is mostly text and about 250Mb. The FooRef is mostly strs and structs of strs borrowed from Foo.
My Java usage explanation
I use two long variables in a Java class to store pointers to Foo and FooRef. I use a static ReentrantReadWriteLock to guard these pointers.
If the data need to be updated in Foo, I acquire a write lock, drop FooRef, update Foo, create a new FooRef and update the ref pointer in Java.
If I need to read the data (i.e. when I am not updating Foo), I acquire a read lock and use the FooRef.
The memory leak is visible only when multiple Java threads are calling this code.
Rust:
use jni::objects::{JClass, JString};
use jni::sys::{jlong, jstring};
use jni::JNIEnv;
use std::collections::HashMap;
macro_rules! foo_mut_ptr {
($env: expr, $class: expr) => {
$env.get_field(*$class, "ptr", "J")
.ok()
.and_then(|j| j.j().ok())
.and_then(|ptr| {
if ptr == 0 {
None
} else {
Some(ptr as *mut Foo)
}
})
};
}
macro_rules! foo_ref_mut_ptr {
($env: expr, $class: expr) => {
$env.get_field(*$class, "ptrRef", "J")
.ok()
.and_then(|j| j.j().ok())
.and_then(|ptr| {
if ptr == 0 {
None
} else {
Some(ptr as *mut FooRef)
}
})
};
}
macro_rules! foo_mut {
($env: expr, $class: expr) => {
foo_mut_ptr!($env, $class).map(|ptr| &mut *ptr)
};
}
macro_rules! foo_ref {
($env: expr, $class: expr) => {
foo_ref_mut_ptr!($env, $class).map(|ptr| &*ptr)
};
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_create(_env: JNIEnv, _class: JClass) -> jlong {
Box::into_raw(Box::new(Foo::default())) as jlong
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_createRef(env: JNIEnv, class: JClass) -> jlong {
let foo = foo_mut!(env, class).expect("createRef was called on uninitialized Data");
let foo_ref = foo.create_ref();
Box::into_raw(Box::new(foo_ref)) as jlong
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_reload(env: JNIEnv, class: JClass) {
let foo = foo_mut!(env, class).expect("foo must be initialized");
*foo = Foo {
data: vec!["hello".to_owned(); 1024 * 1024],
};
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_destroy(env: JNIEnv, class: JClass) {
drop_ptr(foo_ref_mut_ptr!(env, class));
drop_ptr(foo_mut_ptr!(env, class));
}
#[allow(non_snake_case)]
#[no_mangle]
pub unsafe extern "system" fn Java_test_App_destroyRef(env: JNIEnv, class: JClass) {
drop_ptr(foo_ref_mut_ptr!(env, class));
}
unsafe fn drop_ptr<T>(ptr: Option<*mut T>) {
if let Some(ptr) = ptr {
let _foo = Box::from_raw(ptr);
// foo drops here
}
}
#[derive(Default)]
struct Foo {
data: Vec<String>,
}
#[derive(Default)]
struct FooRef<'a> {
data: HashMap<&'a str, Vec<&'a str>>,
}
impl Foo {
fn create_ref(&self) -> FooRef {
let mut data = HashMap::new();
for s in &self.data {
let s = &s[..];
data.insert(s, vec![s]);
}
FooRef { data }
}
}
Java:
package test;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock.ReadLock;
import java.util.concurrent.locks.ReentrantReadWriteLock.WriteLock;
public class App implements AutoCloseable {
private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
private final ReadLock readLock = lock.readLock();
private final WriteLock writeLock = lock.writeLock();
private volatile long ptr;
private volatile long ptrRef;
private volatile boolean reload;
static {
System.loadLibrary("foo");
}
public static void main(String[] args) throws InterruptedException {
try (App app = new App()) {
for (int i = 0; i < 20; i++) {
new Thread(() -> {
while (true) {
app.tryReload();
}
}).start();
}
while (true) {
app.setReload();
}
}
}
public App() {
this.ptr = this.create();
}
public void setReload() {
writeLock.lock();
try {
reload = true;
} finally {
writeLock.unlock();
}
}
public void tryReload() {
readLock.lock();
debug("Got read lock");
if (reload) {
debug("Cache is expired");
readLock.unlock();
debug("Released read lock coz expired");
writeLock.lock();
debug("Got write lock");
try {
if (reload) {
fullReload();
}
readLock.lock();
debug("Got read lock inside write");
} finally {
writeLock.unlock();
debug("Released write lock");
}
}
readLock.unlock();
debug("Released read lock");
}
private void fullReload() {
destroyRef();
debug("Dropped ref");
debug("Reloading");
reload();
debug("Reloading completed");
updateRef();
debug("Created ref");
reload = false;
}
private void updateRef() {
this.ptrRef = this.createRef();
}
private native void reload();
private native long create();
private native long createRef();
private native void destroy();
private native void destroyRef();
#Override
public void close() {
writeLock.lock();
try {
this.destroy();
this.ptrRef = 0;
this.ptr = 0;
} finally {
writeLock.unlock();
}
}
private static void debug(String s) {
System.out.printf("%10s : %s%n", Thread.currentThread().getName(), s);
}
}

The problem that I was thinking as memory leak wasn't actually a memory leak. The issue was that the allocator was using thread local arenas. So, whatever thread was reloading 250MB of data was leaving the allocated space as is and not returning it to the system. This issue was not specific to JNI, but also happening in pure safe rust code. See Why multiple threads using too much memory when holding Mutex
The default number of arenas created defaults to 8 * cpu count = 64 in my case. This setting can be overridden by setting MALLOC_ARENA_MAX env variable.
So I resolved this issue by setting MALLOC_ARENA_MAX env variable to 1 . So, the approach I took is fine. It was just platform specific issue.
This issue was occurring only in Ubuntu in WSL. I also tried the same code without any tweaking on Windows 10 and it works perfectly without any issues.

Is it normal to experience large overhead using the 1:1 threading that comes in the standard library?

While working through learning Rust, a friend asked me to see what kind of performance I could get out of Rust for generating the first 1 million prime numbers both single-threaded and multi-threaded. After trying several implementations, I'm just stumped. Here is the kind of performance that I'm seeing:
rust_primes --threads 8 --verbose --count 1000000
Options { verbose: true, count: 1000000, threads: 8 }
Non-concurrent using while (15485863): 2.814 seconds.
Concurrent using mutexes (15485863): 876.561 seconds.
Concurrent using channels (15485863): 798.217 seconds.
Without overloading the question with too much code, here are the methods responsible for each of the benchmarks:
fn non_concurrent(options: &Options) {
let mut count = 0;
let mut current = 0;
let ts = Instant::now();
while count < options.count {
if is_prime(current) {
count += 1;
}
current += 1;
}
let d = ts.elapsed();
println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_mutex(options: &Options) {
let count = Arc::new(Mutex::new(0));
let highest = Arc::new(Mutex::new(0));
let mut cc = 0;
let mut current = 0;
let ts = Instant::now();
while cc < options.count {
let mut handles = vec![];
for x in current..(current + options.threads) {
let count = Arc::clone(&count);
let highest = Arc::clone(&highest);
let handle = thread::spawn(move || {
if is_prime(x) {
let mut c = count.lock().unwrap();
let mut h = highest.lock().unwrap();
*c += 1;
if x > *h {
*h = x;
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
cc = *count.lock().unwrap();
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_channel(options: &Options) {
let mut count = 0;
let mut current = 0;
let mut highest = 0;
let ts = Instant::now();
while count < options.count {
let (tx, rx) = mpsc::channel();
for x in current..(current + options.threads) {
let txc = mpsc::Sender::clone(&tx);
thread::spawn(move || {
if is_prime(x) {
txc.send(x).unwrap();
}
});
}
drop(tx);
for message in rx {
count += 1;
if message > highest && count <= options.count {
highest = message;
}
}
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
Am I doing something wrong, or is this normal performance with the 1:1 threading that comes in the standard library?
Here is a MCVE that shows the same problem. I didn't limit the number of threads it starts up at once here like I did in the code above. The point is, threading seems to have a very significant overhead unless I'm doing something horribly wrong.
use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;
fn main() {
let iterations = 100_000;
non_threaded(iterations);
threaded(iterations);
}
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..iterations {
let counter = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
fn non_threaded(iterations: u32) {
let tx = Instant::now();
let mut _q = 0;
for x in 0..iterations {
_q = test(x + 1);
}
let d = tx.elapsed();
println!("Non-threaded in {}.", dur_to_string(d));
}
fn dur_to_string(d: Duration) -> String {
let mut s = d.as_secs().to_string();
s.push_str(".");
s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
s
}
fn test(x: u32) -> u32 {
x
}
Here are the results of this on my machine:
Non-threaded in 0.9.
Threaded in 5.785.

threading seems to have a very significant overhead
It's not the general concept of "threading", it's the concept of creating and destroying lots of threads.
By default in Rust 1.22.1, each spawned thread allocates 2MiB of memory to use as stack space. In the worst case, your MCVE could allocate ~200GiB of RAM. In reality, this is unlikely to happen as some threads will exit, memory will be reused, etc. I only saw it use ~400MiB.
On top of that, there is overhead involved with inter-thread communication (Mutex, channels, Atomic*) compared to intra-thread variables. Some kind of locking needs to be performed to ensure that all threads see the same data. "Embarrassingly parallel" algorithms tend to not have a lot of communication required. There are also different amounts of time required for different communication primitives. Atomic variables tend to be faster than others in many cases, but aren't as widely usable.
Then there's compiler optimizations to account for. Non-threaded code is way easier to optimize compared to threaded code. For example, running your code in release mode shows:
Non-threaded in 0.0.
Threaded in 142.775.
That's right, the non-threaded code took no time. The compiler can see through the code and realizes that nothing actually happens and removes it all. I don't know how you got 5 seconds for the threaded code as opposed to the 2+ minutes I saw.
Switching to a threadpool will reduce a lot of the unneeded creation of threads. We can also use a threadpool that provides scoped threads, which allows us to avoid the Arc as well:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Mutex::new(0);
let mut pool = Pool::new(8);
pool.scoped(|scope| {
for _ in 0..iterations {
scope.execute(|| {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
}
});
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
Non-threaded in 0.0.
Threaded in 0.675.
As with most pieces of programming, it's crucial to understand the tools you have and to use them appropriately.

How do I pass disjoint slices from a vector to different threads?

I am new to Rust, and struggling to deal with all those wrapper types in Rust. I am trying to write code that is semantically equal to the following C code. The code tries to create a big table for book keeping, but will divide the big table so that every thread will only access their local small slices of that table. The big table will not be accessed unless other threads quit and no longer access their own slice.
#include <stdio.h>
#include <pthread.h>
void* write_slice(void* arg) {
int* slice = (int*) arg;
int i;
for (i = 0; i < 10; i++)
slice[i] = i;
return NULL;
}
int main()
{
int* table = (int*) malloc(100 * sizeof(int));
int* slice[10];
int i;
for (i = 0; i < 10; i++) {
slice[i] = table + i * 10;
}
// create pthread for each slice
pthread_t p[10];
for (i = 0; i < 10; i++)
pthread_create(&p[i], NULL, write_slice, slice[i]);
for (i = 0; i < 10; i++)
pthread_join(p[i], NULL);
for (i = 0; i < 100; i++)
printf("%d,", table[i]);
}
How do I use Rust's types and ownership to achieve this?

Let's start with the code:
// cargo-deps: crossbeam="0.7.3"
extern crate crossbeam;
const CHUNKS: usize = 10;
const CHUNK_SIZE: usize = 10;
fn main() {
let mut table = [0; CHUNKS * CHUNK_SIZE];
// Scoped threads allow the compiler to prove that no threads will outlive
// table (which would be bad).
let _ = crossbeam::scope(|scope| {
// Chop `table` into disjoint sub-slices.
for slice in table.chunks_mut(CHUNK_SIZE) {
// Spawn a thread operating on that subslice.
scope.spawn(move |_| write_slice(slice));
}
// `crossbeam::scope` ensures that *all* spawned threads join before
// returning control back from this closure.
});
// At this point, all threads have joined, and we have exclusive access to
// `table` again. Huzzah for 100% safe multi-threaded stack mutation!
println!("{:?}", &table[..]);
}
fn write_slice(slice: &mut [i32]) {
for (i, e) in slice.iter_mut().enumerate() {
*e = i as i32;
}
}
One thing to note is that this needs the crossbeam crate. Rust used to have a similar "scoped" construct, but a soundness hole was found right before 1.0, so it was deprecated with no time to replace it. crossbeam is basically the replacement.
What Rust lets you do here is express the idea that, whatever the code does, none of the threads created within the call to crossbeam::scoped will survive that scope. As such, anything borrowed from outside that scope will live longer than the threads. Thus, the threads can freely access those borrows without having to worry about things like, say, a thread outliving the stack frame that table is defined by and scribbling over the stack.
So this should do more or less the same thing as the C code, though without that nagging worry that you might have missed something. :)
Finally, here's the same thing using scoped_threadpool instead. The only real practical difference is that this allows us to control how many threads are used.
// cargo-deps: scoped_threadpool="0.1.6"
extern crate scoped_threadpool;
const CHUNKS: usize = 10;
const CHUNK_SIZE: usize = 10;
fn main() {
let mut table = [0; CHUNKS * CHUNK_SIZE];
let mut pool = scoped_threadpool::Pool::new(CHUNKS as u32);
pool.scoped(|scope| {
for slice in table.chunks_mut(CHUNK_SIZE) {
scope.execute(move || write_slice(slice));
}
});
println!("{:?}", &table[..]);
}
fn write_slice(slice: &mut [i32]) {
for (i, e) in slice.iter_mut().enumerate() {
*e = i as i32;
}
}

Is Rust able to optimize local heap allocations?

When writing relatively realtime code, generally heap allocations in the main execution loop are avoided. So in my experience you allocate all the memory your program needs in an initialization step, and then pass the memory around as needed. A toy example in C might look something like the following:
#include <stdlib.h>
#define LEN 100
void not_realtime() {
int *v = malloc(LEN * sizeof *v);
for (int i = 0; i < LEN; i++) {
v[i] = 1;
}
free(v);
}
void realtime(int *v, int len) {
for (int i = 0; i < len; i++) {
v[i] = 1;
}
}
int main(int argc, char **argv) {
not_realtime();
int *v = malloc(LEN * sizeof *v);
realtime(v, LEN);
free(v);
}
And I believe roughly the equivalent in Rust:
fn possibly_realtime() {
let mut v = vec![0; 100];
for i in 0..v.len() {
v[i] = 1;
}
}
fn realtime(v: &mut Vec<i32>) {
for i in 0..v.len() {
v[i] = 1;
}
}
fn main() {
possibly_realtime();
let mut v: Vec<i32> = vec![0; 100];
realtime(&mut v);
}
What I'm wondering is: is Rust able to optimize possibly_realtime such that the local heap allocation of v only occurs once and is reused on subsequent calls to possibly_realtime? I'm guessing not but maybe there's some magic that makes it possible.

To investigate this, it is useful to add #[inline(never)] to your function, then view the LLVM IR on the playground.
Rust 1.54
This is not optimized. Here's an excerpt:
; playground::possibly_realtime
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc void #_ZN10playground17possibly_realtime17h2ab726cd567363f3E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* #rust_eh_personality {
start:
%0 = tail call i8* #__rust_alloc_zeroed(i64 400, i64 4) #9, !noalias !8
%1 = icmp eq i8* %0, null
br i1 %1, label %bb20.i.i.i.i, label %vector.body
Every time that possibly_realtime is called, memory is allocated via __rust_alloc_zeroed.
Slightly before Rust 1.0
This is not optimized. Here's an excerpt:
; Function Attrs: noinline uwtable
define internal fastcc void #_ZN17possibly_realtime20h1a3a159dd4b50685eaaE() unnamed_addr #0 {
entry-block:
%0 = tail call i8* #je_mallocx(i64 400, i32 0), !noalias !0
%1 = icmp eq i8* %0, null
br i1 %1, label %then-block-255-.i.i, label %normal-return2.i
Every time that possibly_realtime is called, memory is allocated via je_mallocx.
Editorial
Reusing a buffer is a great way to leak secure information, and I'd encourage you to avoid it as much as possible. I'm sure you are already familiar with these problems, but I want to make sure that future searchers make a note.
I also doubt that this "optimization" will be added to Rust, especially not without explicit opt-in by the programmer. There needs to be somewhere that the pointer to the allocated memory could be stored, but there really isn't anywhere. That means it would need to be a global or thread-local variable! Rust can run in environments without threads, but a global variable would still preclude recursive calls to this method. All in all, I think that passing the buffer into the method is much more explicit about what will happen.
I also assume that your example uses a Vec with a fixed size for demo purposes, but if you truly know the size at compile time, a fixed-size array could be a better choice.

As of 2021, Rust is capable of optimizing out heap allocation and inlining vtable method calls (playground):
fn old_adder(a: f64) -> Box<dyn Fn(f64)->f64> {
Box::new(move |x| a + x)
}
#[inline(never)]
fn test() {
let adder = old_adder(1.);
assert_eq!(adder(1.), 2.);
}
fn main() {
test();
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string