I want to sort HashMap data by value in Rust (e.g., when counting character frequency in a string).
The Python equivalent of what I’m trying to do is:
count = {}
for c in text:
count[c] = count.get('c', 0) + 1
sorted_data = sorted(count.items(), key=lambda item: -item[1])
print('Most frequent character in text:', sorted_data[0][0])
My corresponding Rust code looks like this:
// Count the frequency of each letter
let mut count: HashMap<char, u32> = HashMap::new();
for c in text.to_lowercase().chars() {
*count.entry(c).or_insert(0) += 1;
}
// Get a sorted (by field 0 ("count") in reversed order) list of the
// most frequently used characters:
let mut count_vec: Vec<(&char, &u32)> = count.iter().collect();
count_vec.sort_by(|a, b| b.1.cmp(a.1));
println!("Most frequent character in text: {}", count_vec[0].0);
Is this idiomatic Rust? Can I construct the count_vec in a way so that it would consume the HashMaps data and owns it (e.g., using map())? Would this be more idomatic?
Is this idiomatic Rust?
There's nothing particularly unidiomatic, except possibly for the unnecessary full type constraint on count_vec; you could just use
let mut count_vec: Vec<_> = count.iter().collect();
It's not difficult from context to work out what the full type of count_vec is. You could also omit the type constraint for count entirely, but then you'd have to play shenanigans with your integer literals to have the correct value type inferred. That is to say, an explicit annotation is eminently reasonable in this case.
The other borderline change you could make if you feel like it would be to use |a, b| a.1.cmp(b.1).reverse() for the sort closure. The Ordering::reverse method just reverses the result so that less-than becomes greater-than, and vice versa. This makes it slightly more obvious that you meant what you wrote, as opposed to accidentally transposing two letters.
Can I construct the count_vec in a way so that it would consume the HashMaps data and owns it?
Not in any meaningful way. Just because HashMap is using memory doesn't mean that memory is in any way compatible with Vec. You could use count.into_iter() to consume the HashMap and move the elements out (as opposed to iterating over pointers), but since both char and u32 are trivially copyable, this doesn't really gain you anything.
This could be another way to address the matter without the need of an intermediary vector.
// Count the frequency of each letter
let mut count: HashMap<char, u32> = HashMap::new();
for c in text.to_lowercase().chars() {
*count.entry(c).or_insert(0) += 1;
}
let top_char = count.iter().max_by(|a, b| a.1.cmp(&b.1)).unwrap();
println!("Most frequent character in text: {}", top_char.0);
use BTreeMap for sorted data
BTreeMap sorts its elements by key by default, therefore exchanging the place of your key and value and putting them into a BTreeMap
let count_b: BTreeMap<&u32,&char> = count.iter().map(|(k,v)| (v,k)).collect();
should give you a sorted map according to character frequency.
Some character of the same frequency shall be lost though. But if you only want the most frequent character, it does not matter.
You can get the result using
println!("Most frequent character in text: {}", count_b.last_key_value().unwrap().1);
Related
I can't imagine this hasn't been asked before, but I have searched everywhere and could not find the answer.
I have an iterable, which contains duplicate elements. I want to count number of times each element occurs in this iterable and return n-th most frequent one.
I have a working code which does exactly that, but I really doubt its the most optimal way to achieve this.
use std::collections::{BinaryHeap, HashMap};
// returns n-th most frequent element in collection
pub fn most_frequent<T: std::hash::Hash + std::cmp::Eq + std::cmp::Ord>(array: &[T], n: u32) -> &T {
// intialize empty hashmap
let mut map = HashMap::new();
// count occurence of each element in iterable and save as (value,count) in hashmap
for value in array {
// taken from https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.entry
// not exactly sure how this works
let counter = map.entry(value).or_insert(0);
*counter += 1;
}
// determine highest frequency of some element in the collection
let mut heap: BinaryHeap<_> = map.values().collect();
let mut max = heap.pop().unwrap();
// get n-th largest value
for _i in 1..n {
max = heap.pop().unwrap();
}
// find that element (get key from value in hashmap)
// taken from https://stackoverflow.com/questions/59401720/how-do-i-find-the-key-for-a-value-in-a-hashmap
map.iter()
.find_map(|(key, &val)| if val == *max { Some(key) } else { None })
.unwrap()
}
Are there any better ways or more optimal std methods to achieve what I want? Or maybe there are some community made crates that I could use.
Your implementation has a time complexity of Ω(n log n), where n is the length of the array. The optimal solution to this problem has a complexity of Ω(n log k) for retrieving the k-th most frequent element. The usual implementation of this optimal solution indeed involves a binary heap, but not in the way you used it.
Here's a suggested implementation of the common algorithm:
use std::cmp::{Eq, Ord, Reverse};
use std::collections::{BinaryHeap, HashMap};
use std::hash::Hash;
pub fn most_frequent<T>(array: &[T], k: usize) -> Vec<(usize, &T)>
where
T: Hash + Eq + Ord,
{
let mut map = HashMap::new();
for x in array {
*map.entry(x).or_default() += 1;
}
let mut heap = BinaryHeap::with_capacity(k + 1);
for (x, count) in map.into_iter() {
heap.push(Reverse((count, x)));
if heap.len() > k {
heap.pop();
}
}
heap.into_sorted_vec().into_iter().map(|r| r.0).collect()
}
(Playground)
I changed the prototype of the function to return a vector of the k most frequent elements together with their counts, since this is what you need to keep track of anyway. If you only want the k-th most frequent element, you can index the result with [k - 1][1].
The algorithm itself first builds a map of element counts the same way your code does – I just wrote it in a more concise form.
Next, we buid a BinaryHeap for the most frequent elements. After each iteration, this heap contains at most k elements, which are the most frequent ones seen so far. If there are more than k elements in the heap, we drop the least frequent element. Since we always drop the least frequent element seen so far, the heap always retains the k most frequent elements seen so far. We need to use the Reverse wrapper to get a min heap, as documented in the documentation of BinaryHeap.
Finally, we collect the results into a vector. The into_sorted_vec() function basically does this job for us, but we still want to unwrap the items from its Reverse wrapper – that wrapper is an implemenetation detail of our function and should not be returned to the caller.
(In Rust Nightly, we could also use the into_iter_sorted() method, saving one vector allocation.)
The code in this answer makes sure the heap is essentially limited to k elements, so an insertion to the heap has a complexity of Ω(log k). In your code, you push all elements from the array to the heap at once, without limiting the size of the heap, so you end up with a complexity of Ω(log n) for insertions. You essentially use the binary heap to sort a list of counts. Which works, but it's certainly neither the easiest nor the fastest way to achieve that, so there is little justification for going that route.
I have a vector of strings, and I want to find a string that has the number of occurrences more than one. I've tried this but didn't work.
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.into_iter()
.find(|x| o.into_iter().filter(|y| x == y).count() >= 2)
// sorry o ^ here is supposed to be strings
.unwrap();
There are two issues in your code:
o doesn't exist. I assume you meant to use strings instead.
into_itertakes ownership of the value, so once you have called into_iter on strings (or o), you can't call it again. You should use plain iter instead.
Here's a fixed version:
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.iter()
.find(|x| strings.iter().filter(|y| x == y).count() >= 2)
.unwrap();
Note however that this is pretty slow. Depending on your requirements, there are more efficient alternatives:
Sort the strings array first. Then you only need to look at the next item to see if it is duplicated instead of needing to go through the whole array over and over. Advantage: no extra memory used. Drawback: you lose the original order.
Use an auxiliary variable to store the values you've already seen and/or count the number of occurences of each string. This may be a HashSet, BTreeSet, HashMap or BTreeMap. See #Netwave's answer. Advantage: doesn't change the input array. Drawback: uses memory to keep track of the duplicates.
You can count the appearances in O(n) with a tree or table like:
fn main() {
let strings = vec!["Rust", "Rest", "Rust"];
let mut sorted_data : HashMap<&str, u32> = HashMap::new();
strings.iter().for_each(|item| {
if !sorted_data.contains_key(item) {
sorted_data.insert(item, 0);
}
*sorted_data.get_mut(item).unwrap() += 1;
});
println!("{:?}", sorted_data);
}
The just use the one with the biggest key, for example with the new fold_first:
let result = sorted_data.iter().fold_first(|(k1, v1), (k2, v2)| { if v2 > v1 {(k2, v2)} else {(k1, v1)}}).unwrap();
Playground
This question already has answers here:
Efficiently insert or replace multiple elements in the middle or at the beginning of a Vec?
(3 answers)
Closed 5 years ago.
I was expecting a Vec::insert_slice(index, slice) method — a solution for strings (String::insert_str()) does exist.
I know about Vec::insert(), but that inserts only one element at a time, not a slice. Alternatively, when the prepended slice is a Vec one can append to it instead, but this does not generalize. The idiomatic solution probably uses Vec::splice(), but using iterators as in the example makes me scratch my head.
Secondly, the whole concept of prepending has seemingly been exorcised from the docs. There isn't a single mention. I would appreciate comments as to why. Note that relatively obscure methods like Vec::swap_remove() do exist.
My typical use case consists of indexed byte strings.
String::insert_str makes use of the fact that a string is essentially a Vec<u8>. It reallocates the underlying buffer, moves all the initial bytes to the end, then adds the new bytes to the beginning.
This is not generally safe and can not be directly added to Vec because during the copy the Vec is no longer in a valid state — there are "holes" in the data.
This doesn't matter for String because the data is u8 and u8 doesn't implement Drop. There's no such guarantee for an arbitrary T in a Vec, but if you are very careful to track your state and clean up properly, you can do the same thing — this is what splice does!
the whole concept of prepending has seemingly been exorcised
I'd suppose this is because prepending to a Vec is a poor idea from a performance standpoint. If you need to do it, the naïve case is straight-forward:
fn prepend<T>(v: Vec<T>, s: &[T]) -> Vec<T>
where
T: Clone,
{
let mut tmp: Vec<_> = s.to_owned();
tmp.extend(v);
tmp
}
This has a bit higher memory usage as we need to have enough space for two copies of v.
The splice method accepts an iterator of new values and a range of values to replace. In this case, we don't want to replace anything, so we give an empty range of the index we want to insert at. We also need to convert the slice into an iterator of the appropriate type:
let s = &[1, 2, 3];
let mut v = vec![4, 5];
v.splice(0..0, s.iter().cloned());
splice's implementation is non-trivial, but it efficiently does the tracking we need. After removing a chunk of values, it then reuses that chunk of memory for the new values. It also moves the tail of the vector around (maybe a few times, depending on the input iterator). The Drop implementation of Slice ensures that things will always be in a valid state.
I'm more surprised that VecDeque doesn't support it, as it's designed to be more efficient about modifying both the head and tail of the data.
Taking into consideration what Shepmaster said, you could implement a function prepending a slice with Copyable elements to a Vec just like String::insert_str() does in the following way:
use std::ptr;
unsafe fn prepend_slice<T: Copy>(vec: &mut Vec<T>, slice: &[T]) {
let len = vec.len();
let amt = slice.len();
vec.reserve(amt);
ptr::copy(vec.as_ptr(),
vec.as_mut_ptr().offset((amt) as isize),
len);
ptr::copy(slice.as_ptr(),
vec.as_mut_ptr(),
amt);
vec.set_len(len + amt);
}
fn main() {
let mut v = vec![4, 5, 6];
unsafe { prepend_slice(&mut v, &[1, 2, 3]) }
assert_eq!(&v, &[1, 2, 3, 4, 5, 6]);
}
Being fairly new to Rust, I was wondering on how to create a HashMap with a default value for a key? For example, having a default value 0 for any key inserted in the HashMap.
In Rust, I know this creates an empty HashMap:
let mut mymap: HashMap<char, usize> = HashMap::new();
I am looking to maintain a counter for a set of keys, for which one way to go about it seems to be:
for ch in "AABCCDDD".chars() {
mymap.insert(ch, 0)
}
Is there a way to do it in a much better way in Rust, maybe something equivalent to what Ruby provides:
mymap = Hash.new(0)
mymap["b"] = 1
mymap["a"] # 0
Answering the problem you have...
I am looking to maintain a counter for a set of keys.
Then you want to look at How to lookup from and insert into a HashMap efficiently?. Hint: *map.entry(key).or_insert(0) += 1
Answering the question you asked...
How does one create a HashMap with a default value in Rust?
No, HashMaps do not have a place to store a default. Doing so would cause every user of that data structure to allocate space to store it, which would be a waste. You'd also have to handle the case where there is no appropriate default, or when a default cannot be easily created.
Instead, you can look up a value using HashMap::get and provide a default if it's missing using Option::unwrap_or:
use std::collections::HashMap;
fn main() {
let mut map: HashMap<char, usize> = HashMap::new();
map.insert('a', 42);
let a = map.get(&'a').cloned().unwrap_or(0);
let b = map.get(&'b').cloned().unwrap_or(0);
println!("{}, {}", a, b); // 42, 0
}
If unwrap_or doesn't work for your case, there are several similar functions that might:
Option::unwrap_or_else
Option::map_or
Option::map_or_else
Of course, you are welcome to wrap this in a function or a data structure to provide a nicer API.
ArtemGr brings up an interesting point:
in C++ there's a notion of a map inserting a default value when a key is accessed. That always seemed a bit leaky though: what if the type doesn't have a default? Rust is less demanding on the mapped types and more explicit about the presence (or absence) of a key.
Rust adds an additional wrinkle to this. Actually inserting a value would require that simply getting a value can also change the HashMap. This would invalidate any existing references to values in the HashMap, as a reallocation might be required. Thus you'd no longer be able to get references to two values at the same time! That would be very restrictive.
What about using entry to get an element from the HashMap, and then modify it.
From the docs:
fn entry(&mut self, key: K) -> Entry<K, V>
Gets the given key's corresponding entry in the map for in-place
manipulation.
example
use std::collections::HashMap;
let mut letters = HashMap::new();
for ch in "a short treatise on fungi".chars() {
let counter = letters.entry(ch).or_insert(0);
*counter += 1;
}
assert_eq!(letters[&'s'], 2);
assert_eq!(letters[&'t'], 3);
assert_eq!(letters[&'u'], 1);
assert_eq!(letters.get(&'y'), None);
.or_insert() and .or_insert_with()
Adding to the existing example for .entry().or_insert(), I wanted to mention that if the default value passed to .or_insert() is dynamically generated, it's better to use .or_insert_with().
Using .or_insert_with() as below, the default value is not generated if the key already exists. It only gets created when necessary.
for v in 0..s.len() {
components.entry(unions.get_root(v))
.or_insert_with(|| vec![]) // vec only created if needed.
.push(v);
}
In the snipped below, the default vector passed to .or_insert() is generated on every call. If the key exists, a vector is being created and then disposed of, which can be wasteful.
components.entry(unions.get_root(v))
.or_insert(vec![]) // vec always created.
.push(v);
So for fixed values that don't have much creation overhead, use .or_insert(), and for values that have appreciable creation overhead, use .or_insert_with().
A way to start a map with initial values is to construct the map from a vector of tuples. For instance, considering, the code below:
let map = vec![("field1".to_string(), value1), ("field2".to_string(), value2)].into_iter().collect::<HashMap<_, _>>();
I have a vector data with size unknown at compile time. I want to create a new vector of the exact that size. These variants don't work:
let size = data.len();
let mut try1: Vec<u32> = vec![0 .. size]; //ah, you need compile-time constant
let mut try2: Vec<u32> = Vec::new(size); //ah, there is no constructors with arguments
I'm a bit frustrated - there is no any information in Rust API, book, reference or rustbyexample.com about how to do such simple base task with vector.
This solution works but I don't think it is good to do so, it is strange to generate elements one by one and I don't have need in any exact values of elements:
let mut temp: Vec<u32> = range(0u32, data.len() as u32).collect();
The recommended way of doing this is in fact to form an iterator and collect it to a vector. What you want is not precisely clear, however; if you want [0, 1, 2, …, size - 1], you would create a range and collect it to a vector:
let x = (0..size).collect::<Vec<_>>();
(range(0, size) is better written (0..size) now; the range function will be disappearing from the prelude soon.)
If you wish a vector of zeroes, you would instead write it thus:
let x = std::iter::repeat(0).take(size).collect::<Vec<_>>();
If you merely want to preallocate the appropriate amount of space but not push values onto the vector, Vec::with_capacity(capacity) is what you want.
You should also consider whether you need it to be a vector or whether you can work directly with the iterator.
You can use Vec::with_capacity() constructor followed by an unsafe set_len() call:
let n = 128;
let v: Vec<u32> = Vec::with_capacity(n);
unsafe { v.set_len(n); }
v[12] = 64; // won't panic
This way the vector will "extend" over the uninitialized memory. If you're going to use it as a buffer it is a valid approach, as long as the type of elements is Copy (primitives are ok, but it will break horribly if the type has a destructor).