Is Rust hash map storing the same key twice? - rust

I am going through the Rust language book, and was going through the hash map section.
At the end of the section there is a quiz, which asks what would be the output of the following code.
use std::collections::HashMap;
fn main() {
let mut h: HashMap<char, Vec<usize>> = HashMap::new();
for (i, c) in "hello!".chars().enumerate() {
h.entry(c).or_insert(Vec::new()).push(i);
}
let mut sum = 0;
for i in h.get(&'l').unwrap() {
sum += *i;
}
println!("{}", sum);
}
The answer is 5, and I confirmed by executing the same. The explanation given is following
This program stores a vector of indexes for each occurrence of a given letter into a hashmap. Then it sums all the indexes for the letter 'l', which occurs at indexes 2 and 3 in the string "hello!".
Now when I look at section adding a new key and updating value based on old one, it seems Rust only allows unique key for a hash map as per definition.
However, when I run the above program with an extra print statement in the for loop
for i in h.get(&'l').unwrap() {
println!("{}", i);
sum += *i;
}
It seems there are two entries for the character l. As per my understanding, when iterating over string hello! when the first for loop encounters the second l it should update and overwrite the index value of l from 2 to 3, instead of making another entry with key l and value 3.
So the question is, is Rust allowing duplicate keys? Somehow this is not making sense.

Related

What is & doing in a rust for in loop? [duplicate]

This question already has an answer here:
What is the difference between `e1` and `&e2` when used as the for-loop variable?
(1 answer)
Closed 1 year ago.
Trying to understand how & works in a rust for..in loop...
For example let's say we have something simple like a find largest value function which takes a slice of i32's and returns the largest value.
fn largest(list: &[i32]) -> i32 {
let mut largest = list[0];
for item in list {
if *item > largest {
largest = *item;
}
}
largest
}
In the scenario given above item will be an &i32 which makes sense to me. We borrow a slice of i32's and as a result the item would also be a reference to the individual item in the slice. At this point we can dereference the value of item with * which is what I assume how a pointer based language would work.
But now if we alter this slightly below...
fn largest(list: &[i32]) -> i32 {
let mut largest = list[0];
for &item in list {
if item > largest {
largest = item;
}
}
largest
}
If we put an & in front of item this changes item within the for..in into an i32... Why? In my mind this is completely counterintuitive to how I would have imagined it to work. This to me says, "Give me an address/reference to item"... Which in itself would already be a reference. So then how does item get dereferenced? Is this just a quirk with rust or am I fundamentally missing something here.
All variable assignments in Rust, including loop variables in for loops and function arguments, are assigned using pattern matching. The value that is being assigned is matched against the target pattern, and Rust tries to fill in the "blanks", i.e. the target variable names, in a way that substituting the values makes the pattern match the value. Let's look at a few examples.
let x = 5;
This is the simplest case. Obvious, substituting x with 5 makes both sides match.
if let Some(x) = Some(5) {}
Here, x will also become 5, since substituting that value into the pattern will make both side identical.
let &x = &5;
Again, the two sides match when setting x to 5.
if let (Some(&x), &Some(y)) = (Some(&5), &Some(6)) {}
This assignment results in x = 5 and y = 6, since substituting these values into the pattern makes both sides match.
Let's apply this to your for loop. In each loop iteration, the pattern after for is matched against the next value returned by the iterator. We are iterating an &[i32], and the item type of the resulting iterator is &i32, so the iterator yields a &i32 in each iteration. This reference is matched against the pattern &item. Applying what we have seen above, this means item becomes an i32.
Note that assigning a value of a type that does not have the Copy marker trait will move that value into the new variable. All examples above use integers, which are Copy, so the value is copied instead.
There is no magic here, pure logic. Consider this example:
let a = 1;
let b = &a; // b is a reference to a
let &c = &a; // c is a copy of value a
You can read the third line of the example above as "Assign reference to a to a reference to c". This basically creates a virtual variable "reference to c", assigns to it the value &a and then dereferences it to get the value of c.
let a = 1;
let ref_c = &a;
let c = *ref_c;
// If you try to go backwards into this assignments, you get:
let &c = &a;
let &(*ref_c) = &a;
let ref_c = &a; // which is exactly what it was
The same occurs with the for .. in syntax. You iterate over item_ref, but assign them to &item, which means that the type of item is Item.
for item_ref in list {
let item = *item_ref;
...
}
// we see that item_ref == &item, so above is same as
for &item in list {
...
}

Rust - how to find n-th most frequent element in a collection

I can't imagine this hasn't been asked before, but I have searched everywhere and could not find the answer.
I have an iterable, which contains duplicate elements. I want to count number of times each element occurs in this iterable and return n-th most frequent one.
I have a working code which does exactly that, but I really doubt its the most optimal way to achieve this.
use std::collections::{BinaryHeap, HashMap};
// returns n-th most frequent element in collection
pub fn most_frequent<T: std::hash::Hash + std::cmp::Eq + std::cmp::Ord>(array: &[T], n: u32) -> &T {
// intialize empty hashmap
let mut map = HashMap::new();
// count occurence of each element in iterable and save as (value,count) in hashmap
for value in array {
// taken from https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.entry
// not exactly sure how this works
let counter = map.entry(value).or_insert(0);
*counter += 1;
}
// determine highest frequency of some element in the collection
let mut heap: BinaryHeap<_> = map.values().collect();
let mut max = heap.pop().unwrap();
// get n-th largest value
for _i in 1..n {
max = heap.pop().unwrap();
}
// find that element (get key from value in hashmap)
// taken from https://stackoverflow.com/questions/59401720/how-do-i-find-the-key-for-a-value-in-a-hashmap
map.iter()
.find_map(|(key, &val)| if val == *max { Some(key) } else { None })
.unwrap()
}
Are there any better ways or more optimal std methods to achieve what I want? Or maybe there are some community made crates that I could use.
Your implementation has a time complexity of Ω(n log n), where n is the length of the array. The optimal solution to this problem has a complexity of Ω(n log k) for retrieving the k-th most frequent element. The usual implementation of this optimal solution indeed involves a binary heap, but not in the way you used it.
Here's a suggested implementation of the common algorithm:
use std::cmp::{Eq, Ord, Reverse};
use std::collections::{BinaryHeap, HashMap};
use std::hash::Hash;
pub fn most_frequent<T>(array: &[T], k: usize) -> Vec<(usize, &T)>
where
T: Hash + Eq + Ord,
{
let mut map = HashMap::new();
for x in array {
*map.entry(x).or_default() += 1;
}
let mut heap = BinaryHeap::with_capacity(k + 1);
for (x, count) in map.into_iter() {
heap.push(Reverse((count, x)));
if heap.len() > k {
heap.pop();
}
}
heap.into_sorted_vec().into_iter().map(|r| r.0).collect()
}
(Playground)
I changed the prototype of the function to return a vector of the k most frequent elements together with their counts, since this is what you need to keep track of anyway. If you only want the k-th most frequent element, you can index the result with [k - 1][1].
The algorithm itself first builds a map of element counts the same way your code does – I just wrote it in a more concise form.
Next, we buid a BinaryHeap for the most frequent elements. After each iteration, this heap contains at most k elements, which are the most frequent ones seen so far. If there are more than k elements in the heap, we drop the least frequent element. Since we always drop the least frequent element seen so far, the heap always retains the k most frequent elements seen so far. We need to use the Reverse wrapper to get a min heap, as documented in the documentation of BinaryHeap.
Finally, we collect the results into a vector. The into_sorted_vec() function basically does this job for us, but we still want to unwrap the items from its Reverse wrapper – that wrapper is an implemenetation detail of our function and should not be returned to the caller.
(In Rust Nightly, we could also use the into_iter_sorted() method, saving one vector allocation.)
The code in this answer makes sure the heap is essentially limited to k elements, so an insertion to the heap has a complexity of Ω(log k). In your code, you push all elements from the array to the heap at once, without limiting the size of the heap, so you end up with a complexity of Ω(log n) for insertions. You essentially use the binary heap to sort a list of counts. Which works, but it's certainly neither the easiest nor the fastest way to achieve that, so there is little justification for going that route.

How to find a string of multiple occurences in a vector?

I have a vector of strings, and I want to find a string that has the number of occurrences more than one. I've tried this but didn't work.
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.into_iter()
.find(|x| o.into_iter().filter(|y| x == y).count() >= 2)
// sorry o ^ here is supposed to be strings
.unwrap();
There are two issues in your code:
o doesn't exist. I assume you meant to use strings instead.
into_itertakes ownership of the value, so once you have called into_iter on strings (or o), you can't call it again. You should use plain iter instead.
Here's a fixed version:
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.iter()
.find(|x| strings.iter().filter(|y| x == y).count() >= 2)
.unwrap();
Note however that this is pretty slow. Depending on your requirements, there are more efficient alternatives:
Sort the strings array first. Then you only need to look at the next item to see if it is duplicated instead of needing to go through the whole array over and over. Advantage: no extra memory used. Drawback: you lose the original order.
Use an auxiliary variable to store the values you've already seen and/or count the number of occurences of each string. This may be a HashSet, BTreeSet, HashMap or BTreeMap. See #Netwave's answer. Advantage: doesn't change the input array. Drawback: uses memory to keep track of the duplicates.
You can count the appearances in O(n) with a tree or table like:
fn main() {
let strings = vec!["Rust", "Rest", "Rust"];
let mut sorted_data : HashMap<&str, u32> = HashMap::new();
strings.iter().for_each(|item| {
if !sorted_data.contains_key(item) {
sorted_data.insert(item, 0);
}
*sorted_data.get_mut(item).unwrap() += 1;
});
println!("{:?}", sorted_data);
}
The just use the one with the biggest key, for example with the new fold_first:
let result = sorted_data.iter().fold_first(|(k1, v1), (k2, v2)| { if v2 > v1 {(k2, v2)} else {(k1, v1)}}).unwrap();
Playground

Sort HashMap data by value

I want to sort HashMap data by value in Rust (e.g., when counting character frequency in a string).
The Python equivalent of what I’m trying to do is:
count = {}
for c in text:
count[c] = count.get('c', 0) + 1
sorted_data = sorted(count.items(), key=lambda item: -item[1])
print('Most frequent character in text:', sorted_data[0][0])
My corresponding Rust code looks like this:
// Count the frequency of each letter
let mut count: HashMap<char, u32> = HashMap::new();
for c in text.to_lowercase().chars() {
*count.entry(c).or_insert(0) += 1;
}
// Get a sorted (by field 0 ("count") in reversed order) list of the
// most frequently used characters:
let mut count_vec: Vec<(&char, &u32)> = count.iter().collect();
count_vec.sort_by(|a, b| b.1.cmp(a.1));
println!("Most frequent character in text: {}", count_vec[0].0);
Is this idiomatic Rust? Can I construct the count_vec in a way so that it would consume the HashMaps data and owns it (e.g., using map())? Would this be more idomatic?
Is this idiomatic Rust?
There's nothing particularly unidiomatic, except possibly for the unnecessary full type constraint on count_vec; you could just use
let mut count_vec: Vec<_> = count.iter().collect();
It's not difficult from context to work out what the full type of count_vec is. You could also omit the type constraint for count entirely, but then you'd have to play shenanigans with your integer literals to have the correct value type inferred. That is to say, an explicit annotation is eminently reasonable in this case.
The other borderline change you could make if you feel like it would be to use |a, b| a.1.cmp(b.1).reverse() for the sort closure. The Ordering::reverse method just reverses the result so that less-than becomes greater-than, and vice versa. This makes it slightly more obvious that you meant what you wrote, as opposed to accidentally transposing two letters.
Can I construct the count_vec in a way so that it would consume the HashMaps data and owns it?
Not in any meaningful way. Just because HashMap is using memory doesn't mean that memory is in any way compatible with Vec. You could use count.into_iter() to consume the HashMap and move the elements out (as opposed to iterating over pointers), but since both char and u32 are trivially copyable, this doesn't really gain you anything.
This could be another way to address the matter without the need of an intermediary vector.
// Count the frequency of each letter
let mut count: HashMap<char, u32> = HashMap::new();
for c in text.to_lowercase().chars() {
*count.entry(c).or_insert(0) += 1;
}
let top_char = count.iter().max_by(|a, b| a.1.cmp(&b.1)).unwrap();
println!("Most frequent character in text: {}", top_char.0);
use BTreeMap for sorted data
BTreeMap sorts its elements by key by default, therefore exchanging the place of your key and value and putting them into a BTreeMap
let count_b: BTreeMap<&u32,&char> = count.iter().map(|(k,v)| (v,k)).collect();
should give you a sorted map according to character frequency.
Some character of the same frequency shall be lost though. But if you only want the most frequent character, it does not matter.
You can get the result using
println!("Most frequent character in text: {}", count_b.last_key_value().unwrap().1);

How do I sort a map by order of insertion?

I have tried using HashMap and BTreeMap for this but neither have worked:
use std::collections::{BTreeMap, HashMap};
fn main() {
let mut btreemap = BTreeMap::new();
println!("BTreeMap");
btreemap.insert("Z", "1");
btreemap.insert("T", "2");
btreemap.insert("R", "3");
btreemap.insert("P", "4");
btreemap.insert("K", "5");
btreemap.insert("W", "6");
btreemap.insert("G", "7");
btreemap.insert("C", "8");
btreemap.insert("A", "9");
btreemap.insert("D", "0");
for (key, value) in btreemap {
println!("{} {}", key, value);
}
println!("Hash Map");
let mut hashmap = HashMap::new();
hashmap.insert("Z", "1");
hashmap.insert("T", "2");
hashmap.insert("R", "3");
hashmap.insert("P", "4");
hashmap.insert("K", "5");
hashmap.insert("W", "6");
hashmap.insert("G", "7");
hashmap.insert("C", "8");
hashmap.insert("A", "9");
hashmap.insert("D", "0");
for (key, value) in hashmap {
println!("{} {}", key, value);
}
}
When I run this via the Rust playground, I get a result that is not sorted by order of insertion; BTreeMap appears to be ordered alphabetically (prints A C D G K P R T W Z, along with the numbers) and HashMap seems to be ordered randomly (prints Z A C D R P T G WK ).
I've looked through the Rust standard library documentation and I don't see any other maps.
The default collections do not track order of insertion. If you wish to sort by that, you will need to either find a different collection that does track it, or track it yourself.
None of the standard library collections maintain insertion order. You can instead use IndexMap from the indexmap crate, which preserves insertion order as long as you don't call remove.
use indexmap::indexmap;
let map = indexmap! {
"Z" => 1,
"T" => 2,
"R" => 3,
"P" => 4,
"K" => 5,
"W" => 6,
};
for (k, v) in map {
println!("{}: {}", k, v);
}
// Z: 1
// T: 2
// R: 3
// P: 4
// K: 5
// W: 6
It accomplishes this by storing a hash table where the iteration order of the key-value pairs is independent of the hash values of the keys. This mean that lookups may be slower than the standard HashMap, but iteration and removal is very fast.
Associative containers (containers that map a key to a value) usually use one of two strategies to be able to look-up a key efficiently:
either they sort the keys according to some comparison operation
or they hash the keys according to some hashing operation
Here, you have the two archetypes: BTree sorts the key and HashMap hashes them.
If you solely wish to track the order of insertion, then an associative container is the wrong choice of container, what you wish for is a sequence container such as std::vec::Vec: always push the items at the end, and you can iterate over them in the order they were inserted.
Note: I advice writing a wrapper to prevent unwanted insertions anywhere else.
If, on the other hand, you want to have an associative container which also tracks insertion order, then what you are asking for does not exist yet in Rust as far as I know.
In C++, the go-to solution is called Boost.MultiIndex which allows you to create a container which you can query in multiple different ways; it is a quite complex piece of software, as you can see yourself if you browse its source. It might come to Rust, in time, but if you need something now you will have to hand-roll your own solution I fear; you can use the Boost code as a lean-to, although from experience it can be hard to read/understand.

Resources