Shared ownership of an str between a HashMap and a Vec - hashmap

I come from a Java/C#/JavaScript background and I am trying to implement a Dictionary that would assign each passed string an id that never changes. The dictionary should be able to return a string by the specified id. This allows to store some data that has a lot of repetitive strings far more efficiently in the file system because only the ids of strings would be stored instead of entire strings.
I thought that a struct with a HashMap and a Vec would do but it turned out to be more complicated than that.
I started with the usage of &str as a key for HashMap and an item of Vec like in the following sample. The value of HashMap serves as an index into Vec.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<&'a str>
}
impl<'a> Dictionary<'a> {
pub fn put_and_get_key(&mut self, value: &'a str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
self.keys_map.push(value);
self.values_map.insert(value, id);
id
},
Some(&mut id) => id
}
}
}
This works just fine until it turns out that the strs need to be stored somewhere, preferably in this same struct as well. I tried to store a Box<str> in the Vec and &'a str in the HashMap.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<Box<str>>
}
The borrow checker did not allow this of course because it would have allowed a dangling pointer in the HashMap when an item is removed from the Vec (or in fact sometimes when another item is added to the Vec but this is an off-topic here).
I understood that I either need to write unsafe code or use some form of shared ownership, the simplest kind of which seems to be an Rc. The usage of Rc<Box<str>> looks like introducing double indirection but there seems to be no simple way to construct an Rc<str> at the moment.
pub struct Dictionary {
values_map: HashMap<Rc<Box<str>>, u32>,
keys_map: Vec<Rc<Box<str>>>
}
impl Dictionary {
pub fn put_and_get_key(&mut self, value: &str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
let value_to_store = Rc::new(value.to_owned().into_boxed_str());
self.keys_map.push(value_to_store);
self.values_map.insert(value_to_store, id);
id
},
Some(&mut id) => id
}
}
}
Everything seems fine with regard to ownership semantics, but the code above does not compile because the HashMap now expects an Rc, not an &str:
error[E0277]: the trait bound `std::rc::Rc<Box<str>>: std::borrow::Borrow<str>` is not satisfied
--> src/file_structure/sample_dictionary.rs:14:31
|
14 | match self.values_map.get_mut(value) {
| ^^^^^^^ the trait `std::borrow::Borrow<str>` is not implemented for `std::rc::Rc<Box<str>>`
|
= help: the following implementations were found:
= help: <std::rc::Rc<T> as std::borrow::Borrow<T>>
Questions:
Is there a way to construct an Rc<str>?
Which other structures, methods or approaches could help to resolve this problem. Essentially, I need a way to efficiently store two maps string-by-id and id-by-string and be able to retrieve an id by &str, i.e. without any excessive allocations.

Is there a way to construct an Rc<str>?
Annoyingly, not that I know of. Rc::new requires a Sized argument, and I am not sure whether it is an actual limitation, or just something which was forgotten.
Which other structures, methods or approaches could help to resolve this problem?
If you look at the signature of get you'll notice:
fn get<Q: ?Sized>(&self, k: &Q) -> Option<&V>
where K: Borrow<Q>, Q: Hash + Eq
As a result, you could search by &str if K implements Borrow<str>.
String implements Borrow<str>, so the simplest solution is to simply use String as a key. Sure it means you'll actually have two String instead of one... but it's simple. Certainly, a String is simpler to use than a Box<str> (although it uses 8 more bytes).
If you want to shave off this cost, you can use a custom structure instead:
#[derive(Clone, Debug)]
struct RcStr(Rc<String>);
And then implement Borrow<str> for it. You'll then have 2 allocations per key (1 for Rc and 1 for String). Depending on the size of your String, it might consume less or more memory.
If you wish to got further (why not?), here are some ideas:
implement your own reference-counted string, in a single heap-allocation,
use a single arena for the slice inserted in the Dictionary,
...

Related

Struct property as key and the struct itself as value in HashMaps - Rust [duplicate]

This question already has answers here:
Why can't I store a value and a reference to that value in the same struct?
(4 answers)
Closed 8 months ago.
The following is a snippet of a more complicated code, the idea is loading a SQL table and setting a hashmap with one of the table struct fields as the key and keeping the structure as the value (implementation details are not important since the code works fine if I clone the String, however, the Strings in the DB can be arbitrarily long and cloning can be expensive).
The following code will fail with
error[E0382]: use of partially moved value: `foo`
--> src/main.rs:24:35
|
24 | foo_hashmap.insert(foo.a, foo);
| ----- ^^^ value used here after partial move
| |
| value partially moved here
|
= note: partial move occurs because `foo.a` has type `String`, which does not implement the `Copy` trait
For more information about this error, try `rustc --explain E0382`.
use std::collections::HashMap;
struct Foo {
a: String,
b: String,
}
fn main() {
let foo_1 = Foo {
a: "bar".to_string(),
b: "bar".to_string(),
};
let foo_2 = Foo {
a: "bar".to_string(),
b: "bar".to_string(),
};
let foo_vec = vec![foo_1, foo_2];
let mut foo_hashmap = HashMap::new();
foo_vec.into_iter().for_each(|foo| {
foo_hashmap.insert(foo.a, foo); // foo.a.clone() will make this compile
});
}
The struct Foo cannot implement Copy since its fields are String. I tried wrapping foo.a with Rc::new(RefCell::new()) but later went down the pitfall of missing the trait Hash for RefCell<String>, so currently I'm not certain in either using something else for the struct fields (will Cow work?), or to handle that logic within the for_each loop.
There are at least two problems here: First, the resulting HashMap<K, V> would be a self-referential struct, as the K borrows V; there are many questions and answers on SA about the pitfalls of this. Second, even if you could construct such a HashMap, you'd easily break the guarantees provided by HashMap, which allows you to modify V while assuming that K always stays constant: There is no way to get a &mut K for a HashMap, but you can get a &mut V; if K is actually a &V, one could easily modify K through V (by ways of mutating Foo.a ) and break the map.
One possibility is to change Foo.a from a String to a Rc<str>, which you can clone with minimal runtime cost in order to put the value both in the K and into V. As Rc<str> is Borrow<str>, you can still look up values in the map by means of &str. This still has the - theoretical - downside that you can break the map by getting a &mut Foo from the map and std::mem::swap the a, which makes it impossible to look up the correct value from its keys; but you'd have to do that deliberately.
Another option is to actually use a HashSet instead of a HashMap, and use a newtype for Foo which behaves like a Foo.a. You'd have to implement PartialEq, Eq, Hash (and Borrow<str> for good measure) like this:
use std::collections::HashSet;
#[derive(Debug)]
struct Foo {
a: String,
b: String,
}
/// A newtype for `Foo` which behaves like a `str`
#[derive(Debug)]
struct FooEntry(Foo);
/// `FooEntry` compares to other `FooEntry` only via `.a`
impl PartialEq<FooEntry> for FooEntry {
fn eq(&self, other: &FooEntry) -> bool {
self.0.a == other.0.a
}
}
impl Eq for FooEntry {}
/// It also hashes the same way as a `Foo.a`
impl std::hash::Hash for FooEntry {
fn hash<H>(&self, hasher: &mut H)
where
H: std::hash::Hasher,
{
self.0.a.hash(hasher);
}
}
/// Due to the above, we can implement `Borrow`, so now we can look up
/// a `FooEntry` in the Set using &str
impl std::borrow::Borrow<str> for FooEntry {
fn borrow(&self) -> &str {
&self.0.a
}
}
fn main() {
let foo_1 = Foo {
a: "foo".to_string(),
b: "bar".to_string(),
};
let foo_2 = Foo {
a: "foobar".to_string(),
b: "barfoo".to_string(),
};
let foo_vec = vec![foo_1, foo_2];
let mut foo_hashmap = HashSet::new();
foo_vec.into_iter().for_each(|foo| {
foo_hashmap.insert(FooEntry(foo));
});
// Look up `Foo` using &str as keys...
println!("{:?}", foo_hashmap.get("foo").unwrap().0);
println!("{:?}", foo_hashmap.get("foobar").unwrap().0);
}
Notice that HashSet provides no way to get a &mut FooEntry due to the reasons described above. You'd have to use RefCell (and read what the docs of HashSet have to say about this).
The third option is to simply clone() the foo.a as you described. Given the above, this is probably the most simple solution. If using an Rc<str> doesn't bother you for other reasons, this would be my choice.
Sidenote: If you don't need to modify a and/or b, a Box<str> instead of String is smaller by one machine word.

How to implement Borrow<T> when T wraps borrowed data?

Given the following definitions from How to borrow a field for serialization but create it during deserialization?:
#[derive(Serialize)]
struct SerializeThing<'a> {
small_header: (u64, u64, u64),
big_body: &'a str,
}
#[derive(Deserialize)]
struct DeserializeThing {
small_header: (u64, u64, u64),
big_body: String,
}
How do I implement the Borrow trait so as to store the owned data naturally in (e.g.) HashMaps and query them by either them or by their borrowed counterparts? The closest thing that appears possible is as follows:
impl DeserializeThing {
fn as_serialize(&self) -> SerializeThing<'_> {
let DeserializeThing { small_header, big_body } = self;
let small_header = *small_header;
let big_body = big_body.as_str();
SerializeThing { small_header, big_body }
}
}
which is not quite sufficient.
You can't.
Borrow::borrow must return a reference, and there is no way you can get a reference to a SerializeThing from a reference to a DeserializeThing as these types are simply not ABI compatible.
If performance is important and you can't pay for the construction of DeserializeThing instances/allocation of strings, then you could use hashbrown::HashMap instead of std::collections::HashMap.
hashbrown is the library that the standard library uses for its own HashMap implementation, but it has some more useful methods.
One which would be useful for you now is the raw entry API (and mutable raw entry).
In particular, it allows you to get a map entry from its hash and a matching function:
pub fn from_hash<F>(self, hash: u64, is_match: F) -> Option<(&'a K, &'a > V)> where
F: FnMut(&K) -> bool
Access an entry by hash.
Since you can implement Hash for both DeserializeThing and SerializeThing to get the same the hashes for the same values, this API would be simple to use in your case.

What's the idiomatic way to make a lookup table which uses field of the item as the key?

I have a collection of Foo.
struct Foo {
k: String,
v: String,
}
I want a HashMap which has the key &foo.k and the value foo.
Apparently, it is not possible without redesigning Foo by introducing Rc or clone/copy the k.
fn t1() {
let foo = Foo { k: "k".to_string(), v: "v".to_string() };
let mut a: HashMap<&str, Foo> = HashMap::new();
a.insert(&foo.k, foo); // Error
}
There seems to be a workaround by abusing get() from HashSet (Playground):
use std::collections::{HashMap, HashSet};
use std::hash::{Hash, Hasher, BuildHasher};
use std::collections::hash_map::Entry::*;
struct Foo {
k: String,
v: String,
}
impl PartialEq for Foo {
fn eq(&self, other: &Self) -> bool { self.k == other.k }
}
impl Eq for Foo {}
impl Hash for Foo {
fn hash<H: Hasher>(&self, h: &mut H) { self.k.hash(h); }
}
impl ::std::borrow::Borrow<str> for Foo {
fn borrow(&self) -> &str {
self.k.as_str()
}
}
fn t2() {
let foo = Foo { k: "k".to_string(), v: "v".to_string() };
let mut a: HashSet<Foo> = HashSet::new();
a.insert(foo);
let bar = Foo { k: "k".to_string(), v: "v".to_string() };
let foo = a.get("k").unwrap();
println!("{}", foo.v);
}
This is pretty tedious. What if a Foo has multiple fields and different collections of Foo to key on different fields?
Apparently, it is not possible without redesigning Foo by introducing Rc or clone/copy the k.
That's correct, it is not possible to have HashMap<&K, V> where the key points to some component of the value.
The HashMap owns the key and the value, conceptually storing both in big vectors. When a new value is added to the HashMap, these existing values might need to be moved around due to hash collisions or the vectors might need to be reallocated to hold more items. Both of these operations would invalidate the address of any existing key, leaving it pointing at invalid memory. This would break Rust's safety guarantees, thus it is disallowed.
Read Why can't I store a value and a reference to that value in the same struct? for a thorough discussion.
Additionally, trentcl points out that HashMap::get_mut would allow you to get a mutable reference to the key, which would allow you to change the key without the map knowing. As the documentation states:
It is a logic error for a key to be modified in such a way that the key's hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the map.
Workarounds include:
Remove the key from the struct and store it separately. Instead of HashMap<&K, V> where V is (K, Data), store HashMap<K, Data>. You can return a struct which glues references to the key and value together (example)
Share ownership of the key using Rc (example)
Create duplicate keys using Clone or Copy.
Use a HashSet as you have done, enhanced by Sebastian Redl's suggestion. A HashSet<K> is actually just a HashMap<K, ()>, so this works by transferring all ownership to the key.
You can introduce a wrapper type for the item stored in the set.
struct FooByK(Foo);
Then implement the various traits needed for the set for this struct instead. This lets you choose a different wrapper type if you need a set that indexes by a different member.

Vec<MyTrait> without N heap allocations?

I'm trying to port some C++ code to Rust. It composes a virtual (.mp4) file from a few kinds of slices (string reference, lazy-evaluated string reference, part of a physical file) and serves HTTP requests based on the result. (If you're curious, see Mp4File which takes advantage of the FileSlice interface and its concrete implementations in http.h.)
Here's the problem: I want require as few heap allocations as possible. Let's say I have a few implementations of resource::Slice that I can hopefully figure out on my own. Then I want to make the one that composes them all:
pub trait Slice : Send + Sync {
/// Returns the length of the slice in bytes.
fn len(&self) -> u64;
/// Writes bytes indicated by `range` to `out.`
fn write_to(&self, range: &ByteRange,
out: &mut io::Write) -> io::Result<()>;
}
// (used below)
struct SliceInfo<'a> {
range: ByteRange,
slice: &'a Slice,
}
/// A `Slice` composed of other `Slice`s.
pub struct Slices<'a> {
len: u64,
slices: Vec<SliceInfo<'a>>,
}
impl<'a> Slices<'a> {
pub fn new() -> Slices<'a> { ... }
pub fn append(&mut self, slice: &'a resource::Slice) { ... }
}
impl<'a> Slice for Slices<'a> { ... }
and use them to append lots and lots of slices with as few heap allocations as possible. Simplified, something like this:
struct ThingUsedWithinMp4Resource {
slice_a: resource::LazySlice,
slice_b: resource::LazySlice,
slice_c: resource::LazySlice,
slice_d: resource::FileSlice,
}
struct Mp4Resource {
slice_a: resource::StringSlice,
slice_b: resource::LazySlice,
slice_c: resource::StringSlice,
slice_d: resource::LazySlice,
things: Vec<ThingUsedWithinMp4Resource>,
slices: resource::Slices
}
impl Mp4Resource {
fn new() {
let mut f = Mp4Resource{slice_a: ...,
slice_b: ...,
slice_c: ...,
slices: resource::Slices::new()};
// ...fill `things` with hundreds of things...
slices.append(&f.slice_a);
for thing in f.things { slices.append(&thing.slice_a); }
slices.append(&f.slice_b);
for thing in f.things { slices.append(&thing.slice_b); }
slices.append(&f.slice_c);
for thing in f.things { slices.append(&thing.slice_c); }
slices.append(&f.slice_d);
for thing in f.things { slices.append(&thing.slice_d); }
f;
}
}
but this isn't working. The append lines cause errors "f.slice_* does not live long enough", "reference must be valid for the lifetime 'a as defined on the block at ...", "...but borrowed value is only valid for the block suffix following statement". I think this is similar to this question about the self-referencing struct. That's basically what this is, with more indirection. And apparently it's impossible.
So what can I do instead?
I think I'd be happy to give ownership to the resource::Slices in append, but I can't put a resource::Slice in the SliceInfo used in Vec<SliceInfo> because resource::Slice is a trait, and traits are unsized. I could do a Box<resource::Slice> instead but that means a separate heap allocation for each slice. I'd like to avoid that. (There can be thousands of slices per Mp4Resource.)
I'm thinking of doing an enum, something like:
enum BasicSlice {
String(StringSlice),
Lazy(LazySlice),
File(FileSlice)
};
and using that in the SliceInfo. I think I can make this work. But it definitely limits the utility of my resource::Slices class. I want to allow it to be used easily in situations I didn't anticipate, preferably without having to define a new enum each time.
Any other options?
You can add a User variant to your BasicSlice enum, which takes a Box<SliceInfo>. This way only the specialized case of users will take the extra allocation, while the normal path is optimized.

How can I specify an iterator over traits?

I've been trying to make a websocket client, but one that has tons of options! I thought of using a builder style since the configuration can be stored in a nice way:
let client = Client::new()
.options(5)
.stuff(true)
// now users can store the config before calling build
.build();
I am having trouble creating a function that takes in a list of strings. Of course I have a few options:
fn strings(self, list: &[&str]) -> Self;
fn strings(self, list: Vec<String>) -> Self;
fn strings(self, list: &[&String]) -> Self;
// etc...
I would like to accept generously so I would like to accept &String, &str, and hopefully keys in a HashMap (since this might be used with a large routing table) so I thought I would accept an iterator over items that implement Borrow<str> like so:
fn strings<P, Sp>(self, P)
where P: Iterator<Item = &'p Sp>,
Sp: Borrow<str> + 'p;
A full example is available here.
This was great until I needed to add another optional list of strings (extensions) to the builder.
This meant that if I created a builder without specifying both lists of strings that the compiler would complain that it couldn't infer the type of the Builder, which makes sense. The only reason this is not OK is that both these fields are optional so the user might never know the type of a field it hasn't yet set.
Does anyone have any ideas on how to specify an iterator over traits? Then I wouldn't have to specify the type fully at compile time. Or maybe just a better way to do this entirely?
A pragmatic solution is to simply discard the concrete types of the types and introduce some indirection. We can Box the trait object and store that as a known type:
use std::borrow::Borrow;
struct Builder {
strings: Option<Box<Iterator<Item = Box<Borrow<str>>>>>,
}
impl Builder {
fn new() -> Self {
Builder { strings: None }
}
fn strings<I>(mut self, iter: I) -> Self
where I: IntoIterator + 'static,
I::Item: Borrow<str> + 'static,
{
let i = iter.into_iter().map(|x| Box::new(x) as Box<Borrow<str>>);
self.strings = Some(Box::new(i));
self
}
fn build(self) -> String {
match self.strings {
Some(iter) => {
let mut s = String::new();
for i in iter {
s.push_str((*i).borrow());
}
s
},
None => format!("No strings here!"),
}
}
}
fn main() {
let s =
Builder::new()
.strings(vec!["a", "b"])
.build();
println!("{}", s);
}
Here we convert the input iterator to a boxed iterator of boxed things that implement Borrow. We have to do some gyrations to convert the specific concrete type we have into a conceptually higher level type but that is still concrete.
This remainder doesn't directly answer your question about an iterator of traits, but it provides an alternate solution that I would use.
You have to pick between that might be a bit more optimal and have a worse user experience, or something that might be a bit suboptimal but a nicer user experience.
You are currently storing the iterator in the builder struct:
struct Builder
where I: Iterator
{
things: Option<I>,
}
This requires that the concrete type of I be known in order to instantiate a Builder. Specifically, the size of that type needs to be known in order to allocate enough space. There's nothing around this; if you want to store a generic type, you need to know what type it is.
For the same reasons, you cannot have this standalone statement:
let foo = None;
How much space needs to be allocated for foo? You cannot know until you know what type the Some might hold.
The way I would go would be to not add type parameters for the struct, but have them on the function. This means that the struct has to have a fixed type to store the values. In your example, a String is a good fit:
struct Builder {
strings: Vec<String>,
}
impl Builder {
fn strings<I>(mut self, iter: I) -> Self
where I: IntoIterator,
I::Item: Into<String>,
{
self.strings.extend(iter.into_iter().map(Into::into));
self
}
}
A Vec has very compact storage (it only takes 3 machine-sized values), and doesn't allocate any heap memory when it is empty. For that reason, I wouldn't wrap it in an Option unless you needed to tell 0 items from the absence of a provided value.
If you are just appending each value to one big string, you might as well do that in the strings method. That depends on your application.
You mention that you might be providing a large amount of data, but I'm not sure that holding the iterator until the build call will really help. You are going to pay the cost earlier or later.
If you are going to reuse the builder, then it depends on what is expensive. If iterating is expensive, then doing it once and reusing that for each build call will be more efficient. If holding onto the memory is expensive, then you don't want to have multiple builders or built items around concurrently. Since the builder will transfer ownership of the memory to the new item, there shouldn't be any waste here.

Resources