Size of Rust Hashmap [closed] - rust

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 months ago.
Improve this question
My Rust program is consuming too much memory, causing the Linux Out-Of-Memory Killer to be invoked.
The program takes a file as input and performs multiple operations, including finding duplicated (group of) lines. Hashmaps are used to perform these operations, but their size becomes significant if the file is very large.
What is the recommended way to handle this issue ? How can I dynamically get the size of a hashmap in Rust ? Is there a simple solution to know the maximum size not to exceed ?
pub struct Key {
pub code: String,
pub num: String,
pub ref: String,
pub date: String,
}
pub struct Data {
pub d: FixedI64<U16>,
pub c: FixedI64<U16>,
}
...
let mut map: HashMap<Key, Data> = HashMap::new();
for (i, line) in reader.lines().skip(1).enumerate() {
let line = line.unwrap_or_default();
insert_in_hashmap(&line, &mut map);
}
...

You can check how many elements HashMap has allocated with method capacity and you can try reserving more memory with try_reserve. However likely this is not a problem, because you probably are storing Strings in this HashMap, and they are only 3 words in size. You should therefore track how much memory this Strings are using, which will be much more difficult.
There is no universal solution for handling OOM errors. If your files are too big to fit in the memory you can't really do anything about it. Unfortunately fallible allocations in rust are still work in progress, so there are not many options for trying to reserve memory (although most standard collections have try_reserve methods).

Related

Is it appropriate to mark functions unsafe if the user's inputs can break invariants? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
It's common in Rust to enforce invariants about the state of types by exposing a limited API. However, invariants can be easily broken if the invariant is difficult or prohibitively computationally expense to check and return an error.
For example, imagine we have a function that takes an integer and, for whatever reason, fails if that integer is a prime number. It may be too expensive to check that the input is not prime. Is it appropriate to mark such a function unsafe as a way of saying, "This method could potentially operate in an unexpected way if the input is prime, but we trust you and don't waste effort checking"?
If such an input could cause obvious undefined behavior (UB), then I imagine the answer is yes. But if it's not immediately clear if it could UB, I am uncertain whether such an API should be complicated with an unsafe attribute.
If a broken invariant could cause Undefined Behaviour then the convention would usually be to check the invariant in the "obvious" usage of the API, but also provide a less convenient unsafe variant where the user can take responsibility for the checks themselves.
The unsafe variants of functions usually have _unchecked as a suffix. For example:
/// Will panic if `p` is prime
pub fn do_stuff(p: u64) {
assert!(!is_prime(p), "The argument must not be prime");
unsafe { do_stuff(p) }
}
/// # Safety
/// It is Undefined Behaviour if `p` is prime
pub unsafe fn do_stuff_unchecked(p: u64) {
todo!()
}
In a lot of cases it's more appropriate to return a Result rather than panicking. I used panicking in the example to reinforce that unsafe is about memory safety only.
See Also
Lots more examples in Rust std

Does Rust's borrow checker really mean that I should re-structure my program?

So I've read Why can't I store a value and a reference to that value in the same struct? and I understand why my naive approach to this was not working, but I'm still very unclear how to better handle my situation.
I have a program I wanted to structure like follows (details omitted because I can't make this compile anyway):
use std::sync::Mutex;
struct Species{
index : usize,
population : Mutex<usize>
}
struct Simulation<'a>{
species : Vec<Species>,
grid : Vec<&'a Species>
}
impl<'a> Simulation<'a>{
pub fn new() -> Self {...} //I don't think there's any way to implement this
pub fn run(&self) {...}
}
The idea is that I create a vector of Species (which won't change for the lifetime of Simulation, except in specific mutex-guarded fields) and then a grid representing which species live where, which will change freely. This implementation won't work, at least not any way I've been able to figure out. As I understand it, the issue is that pretty much however I make my new method, the moment it returns, all of the references in grid would becomine invalid as Simulation and therefor Simulation.species are moved to another location in the stack. Even if I could prove to the compiler that species and its contents would continue to exist, they actually won't be in the same place. Right?
I've looked into various ways around this, such as making species as an Arc on the heap or using usizes instead of references and implementing my own lookup function into the species vector, but these seem slower, messier or worse. What I'm starting to think is that I need to really re-structure my code to look something like this (details filled in with placeholders because now it actually runs):
use std::sync::Mutex;
struct Species{
index : usize,
population : Mutex<usize>
}
struct Simulation<'a>{
species : &'a Vec<Species>, //Now just holds a reference rather than data
grid : Vec<&'a Species>
}
impl<'a> Simulation<'a>{
pub fn new(species : &'a Vec <Species>) -> Self { //has to be given pre-created species
let grid = vec!(species.first().unwrap(); 10);
Self{species, grid}
}
pub fn run(&self) {
let mut population = self.grid[0].population.lock().unwrap();
println!("Population: {}", population);
*population += 1;
}
}
pub fn top_level(){
let species = vec![Species{index: 0, population : Mutex::new(0_)}];
let simulation = Simulation::new(&species);
simulation.run();
}
As far as I can tell this runs fine, and ticks off all the ideal boxes:
grid uses simple references with minimal boilerplate for me
these references are checked at compile time with minimal overhead for the system
Safety is guaranteed by the compiler (unlike a custom map based approach)
But, this feels very weird to me: the two-step initialization process of creating owned memory and then references can't be abstracted any way that I can see, which feels like I'm exposing an implementation detail to the calling function. top_level has to also be responsible for establishing any other functions or (scoped) threads to run the simulation, call draw/gui functions, etc. If I need multiple levels of references, I believe I will need to add additional initialization steps to that level.
So, my question is just "Am I doing this right?". While I can't exactly prove this is wrong, I feel like I'm losing a lot of near-universal abstraction of the call structure. Is there really no way to return species and simulation as a pair at the end (with some one-off update to make all references point to the "forever home" of the data).
Phrasing my problem a second way: I do not like that I cannot have a function with a signature of ()-> Simulation, when I can can have a pair of function calls that have that same effect. I want to be able to encapsulate the creation of this simulation. I feel like the fact that this approach cannot do so indicates I'm doing something wrong, and that there may be a more idiomatic approach I'm missing.
I've looked into various ways around this, such as making species as an Arc on the heap or using usizes instead of references and implementing my own lookup function into the species vector, but these seem slower, messier or worse.
Don't assume that, test it. I once had a self-referential (using ouroboros) structure much like yours, with a vector of things and a vector of references to them. I tried rewriting it to use indices instead of references, and it was faster.
Rc/Arc is also an option worth trying out — note that there is only an extra cost to the reference counting when an Arc is cloned or dropped. Arc<Species> doesn't cost any more to dereference than &Species, and you can always get an &Species from an Arc<Species>. So the reference counting only matters if and when you're changing which Species is in an element of Grid.
If you're owning a Vec of objects, then want to also keep track of references to particular objects in that Vec, a usize index is almost always the simplest design. It might feel like extra boilerplate to you now, but it's a hell of a lot better than properly dealing with keeping pointers in check in a self-referential struct (as somebody who's made this mistake in C++ more than I should have, trust me). Rust's rules are saving you from some real headaches, just not ones that are obvious to you necessarily.
If you want to get fancy and feel like a raw usize is too arbitrary, then I recommend you look at slotmap. For a simple SlotMap, internally it's not much more than an array of values, iteration is fast and storage is efficient. But it gives you generational indices (slotmap calls these "keys") to the values: each value is embellished with a "generation" and each index also internally keeps hold of a its generation, therefore you can safely remove and replace items in the Vec without your references suddenly pointing at a different object, it's really cool.

How could I avoid cloning the entirety of a large struct to send to a thread when only parts are needed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
My use case:
I have a large complex struct.
I want to take a snapshot of this struct to send it to a thread to do some calculation.
Many large fields within this struct are not neccessary for calculation.
Many fields within the struct are partially required (a field may be a struct and only a few parameters from this struct are required).
At the moment I simply call .clone() and pass a clone of the entire struct to the thread.
It is difficult to give a good example, but this is a summary of my current method:
use tokio::task;
fn main() {
let compex_struct = ComplexStruct::new(...);
// some extra async stuff I don't think is 100% neccessary to this question
let future = async_function(compex_struct.clone()); // Ideally not cloning whole struct
// some extra async stuff I don't think is 100% neccessary to this question
}
fn async_function(complex_struct:ComplexStruct) -> task::JoinHandle<_> {
task::spawn_blocking(move || {
// bunch of work, then return something
})
}
My current working idea is to have a seperate struct such as ThreadData which is instantiated with ThreadData::new(ComplexStruct) and effectively clones the required fields. I then pass ThreadData to the thread instead.
What is the best solution to this problem?
I think you've answered your own question. 😁 If you're just looking for validation, I believe a refactor to only the needed parts is a good idea. You may find ways to simplify your code, but the performance boost seems to be your reasoning. We can't see benchmarks on this, but perhaps you want to track that.
This part is just my opinion, but instead of ThreadData::new(), you could do ThreadData::from(), or better yet, impl From<ComplexStruct> for ThreadData {}. If it only has one purpose then it doesn't matter, but if ThreadData will ever be used in a different context, I like to keep the "new"/"from" functions available for a general instance. Otherwise I eventually have Struct::from_this(), Struct::from_that(), or Struct::from_some_random_input(). 😋

Take a mutable slice from a vector with a different length?

I'm trying to use a vector as a buffer like this, I'm wondering if there is a way to take a slice from the vector without growing it, so that code like this work:
fn example(r: Read) {
let buffer: Vec<u8> = Vec::with_capacity(1024);
let slice: &mut[u8] = &mut buffer;
r.read(slice); // doesnt work since the slice has length zero
}
If one was to just take a slice of the capacity, you would have a slice of uninitialized data. This is unsafe.
You can do this with Vec's set_len. However
This is unsafe. Reading the data is memory safe, but a vector of another type, or misuse of set_len, may not be. Overflow checking and proper cleanup is important.
This could well be a significant security flaw.
If you are using non-primitive types, you need to consider panic safety.
The standard library has a policy against allowing reads to uninitialized memory, even if memory-safe.
The basic way of doing this is
unsafe {
buffer.set_len(buffer.capacity());
let new_len = try!(r.read(slice));
buffer.set_len(cmp::min(buffer.len(), new_len));
}
The cmp::min is needed if you don't totally trust read's implementation, since incorrect output can result in too-large set length.
I want to add three details to #Veedrac answer.
1 - This is the relevant code that creates the slices from Vec<T>, and in both cases the slice goes from the start of the vector up to self.len:
fn deref(&self) -> &[T]
fn deref_mut(&mut self) -> &mut [T]
2 - There is an optimization in the TcpStream to avoid zeroing the memory, much like #Veedrac's alternative, but it is only used for read_to_end and not in read. And the BufReader that does zeroing starts with small allocations for performance reasons
3 - The slice has no information about the vector size, so the vector the length must be updated anyways, otherwise the buffer will be larger than the data and the zeros will be used.

Inefficient instance construction?

Here is a simple struct
pub struct Point {
x: uint,
y: uint
}
impl Point {
pub fn new() -> Point {
Point{x: 0u, y: 0u}
}
}
fn main() {
let p = box Point::new();
}
My understanding of how the constructor function works is as follows. The new() function creates an instance of Point in its local stack and returns it. Data from this instance is shallow copied into the heap memory created by box. The pointer to the heap memory is then assigned to the variable p.
Is my understanding correct? Does two separate memory regions get initialized to create one instance? This seems to be an inefficient way to initialize an instance compared to C++ where we get to directly write to the memory of the instance from the constructor.
From a relevant guide:
You may think that this gives us terrible performance: return a value and then immediately box it up ?! Isn't this pattern the worst of both worlds? Rust is smarter than that. There is no copy in this code. main allocates enough room for the box, passes a pointer to that memory into foo as x, and then foo writes the value straight into the Box.
This is important enough that it bears repeating: pointers are not for optimizing returning values from your code. Allow the caller to choose how they want to use your output.
While this talks about boxing the value, I believe the mechanism is general enough, and not specific to boxes.
Just to expand a bit on #Shepmaster's answer:
Rust (and LLVM) supports RVO, or return value optimization, where if a return value is used in a context like box, Rust is smart enough to generate code that uses some sort of out pointer to avoid the copy by writing the return value directly into its usage site. box is one of the major uses of RVO, but it can be used for other types and situations as well.

Resources