Use char as &str in HashMap - string

I want to create a HashMap which maps words — a Vec of &str — and letters of those words each to another. For example, vec!["ab", "b", "abc"] will be converted to the following HashMap
{
// Letters are keys, words which contain the keys are values
"a" => ["ab", "abc"],
"b" => ["ab", "bc", "abc"],
"c" => ["bc", "abc"],
// Words are keys, letters which are in the words are values
"ab" => ["a", "b"],
"abc" => ["a", "b", "c"],
}
I tried this code [playground]:
let words = vec!["ab", "bc", "abc"];
let mut map: HashMap<&str, Vec<&str>> = HashMap::new();
for word in words.iter() {
for letter in word.chars() {
map.entry(letter).or_default().push(word);
map.entry(word).or_default().push(letter);
}
}
but there is a problem: letter is of type char but I need a &str because map accepts only &strs as keys. I also tried to convert letter to a &str:
for word in words.iter() {
for letter in word.chars() {
let letter = letter.to_string()
// no changes
but this code creates a new String which has a smaller lifetime than map's one. In other words, letter is dropped after the nested for loop but and I get compiler error.
How can I use a char in HashMap which accepts only &strs as keys?

I would separate the map into two:
One is char -> Vec<&str>. It maps letters to a list of words.
The second one would be &str -> Vec<char>, but I do not know if you really need it: Why not just iterate over the chars of a &str directly?
Storing a Vec<char> essentially just doubles the amount of memory you use, unless the Vec<char> is e.g. sorted in a particular order (which may or may not be necessary).
If you really want to keep them in one map, I think it is easier to have a HashMap<String, Vec<String>>.
The problem seems to be that chars gives you one char after another, but what you actually want is a &str after another, where each &str actually encompasses a single character. I did not find anything like that in the docs for &str, but maybe there is something somewhere that iterates like this.
You could work-around it using matches:
let words = vec!["ab", "bc", "abc"];
let mut map: HashMap<&str, Vec<&str>> = HashMap::new();
for word in words.iter() {
for letter in word.matches(|_| true) { // iterates over &str's that encompass one single character
map.entry(letter).or_default().push(word);
map.entry(word).or_default().push(letter);
}
}

Related

How do I get a substring of a String object using a character position range?

Say I have a struct Foo that owns a string:
struct Foo {
owned_string: String
}
I want to implement some methods on this struct that return substrings from the owned String. For efficiency reasons, I don't want to allocate any new memory for this, I just want the return values to point to the original String.
Let's say I know the substring I want, it's characters 10 through 15.
I can't just slice it like self.owned_string[10..16], since that would give me bytes, not characters.
I can take the characters and collect them into a new String object, like self.owned_string.chars().skip(9).take(6).collect::<String>(), but that creates a new String object. String objects own their strings (AFAIK), so presumably new memory was allocated for this, which is not what I want.
How do I create string slices that reference a substring of a String object, but using character positions? (Without allocating any new memory)
You can use char_indices() then slice the string according to the positions the iterator gives you:
let mut iter = s.char_indices();
let (start, _) = iter.nth(10).unwrap();
let (end, _) = iter.nth(5).unwrap();
let slice = &s[start..end];
However, note that as mentioned in the documentation of chars():
It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.
#ChayimFriedman's answer is of course correct, I just wanted to contribute a more telling example:
fn print_string(s: &str) {
println!("String: {}", s);
}
fn main() {
let s: String = "🤣😄😁😆😅".to_string();
let mut iter = s.char_indices();
// Retrieve the position of the char at pos 1
let (start, _) = iter.nth(1).unwrap();
// Now the next char will be at position `2`. Which would be
// equivalent of querying `.next()` or `.nth(0)`.
// So if we query for `nth(2)` we query 3 characters; meaning
// the position of character 4.
let (end, _) = iter.nth(2).unwrap();
// Gives you a &str, which is exactly what you want.
// A reference to a substring, zero allocations, zero overhead.
let substring = &s[start..end];
print_string(&s);
print_string(substring);
}
String: 🤣😄😁😆😅
String: 😄😁😆
I've done it with smileys because smileys are definitely multi-byte unicode characters.
As #ChayimFriedman already noted, the reason why we have to iterate through the char_indices is because unicode characters are variably sized. They can be anywhere from 1 to 8 bytes long, so the only way to find out where the character boundaries are is to actually read the string up to the character we desire.

How to iterate prefixes and suffixes of str or String in rust?

I have a string: "abcd" and I want to:
Iterate its prefixes from shortest to longest:
"", "a", "ab", "abc", "abcd"
Iterate its prefixes from longest to shortest:
"abcd", "abc", "ab", "a", ""
Iterate its suffixes from shortest to longest:
"", "d", "cd", "bcd", "abcd"
Iterate its suffixes from longest to shortest:
"abcd", "bcd", "cd", "d", ""
Strings are more complicated then one might expect
To match human intuition you usually want to treat a string as a sequence of 0 or more grapheme clusters.
A grapheme cluster is a sequence of 1 or more Unicode code points
In the utf8 encoding a code point is represented as a sequence of 1, 2, 3 or 4 bytes
Both String and str in rust use utf8 to represent strings and indexes are byte offsets
Slicing a part of a code point makes no sense and produces garbage data. Rust chooses to panic instead:
#[cfg(test)]
mod tests {
#[test]
#[should_panic(expected = "byte index 2 is not a char boundary; it is inside '\\u{306}' (bytes 1..3) of `y̆`")]
fn bad_index() {
let y = "y̆";
&y[2..];
}
}
To work at the code point level rust has:
str.chars()
str.char_indices()
str.is_char_boundary()
Further reading: https://doc.rust-lang.org/book/ch08-02-strings.html
A solution
Warning: this code works at the code point level and is grapheme cluster oblivious.
From shortest to longest:
use core::iter;
pub fn prefixes(s: &str) -> impl Iterator<Item = &str> + DoubleEndedIterator {
s.char_indices()
.map(move |(pos, _)| &s[..pos])
.chain(iter::once(s))
}
pub fn suffixes(s: &str) -> impl Iterator<Item = &str> + DoubleEndedIterator {
s.char_indices()
.map(move |(pos, _)| &s[pos..])
.chain(iter::once(""))
.rev()
}
In reverse:
prefixes(s).rev()
suffixes(s).rev()
test
See also: How to iterate prefixes or suffixes of vec or slice in rust?

How to find a string of multiple occurences in a vector?

I have a vector of strings, and I want to find a string that has the number of occurrences more than one. I've tried this but didn't work.
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.into_iter()
.find(|x| o.into_iter().filter(|y| x == y).count() >= 2)
// sorry o ^ here is supposed to be strings
.unwrap();
There are two issues in your code:
o doesn't exist. I assume you meant to use strings instead.
into_itertakes ownership of the value, so once you have called into_iter on strings (or o), you can't call it again. You should use plain iter instead.
Here's a fixed version:
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.iter()
.find(|x| strings.iter().filter(|y| x == y).count() >= 2)
.unwrap();
Note however that this is pretty slow. Depending on your requirements, there are more efficient alternatives:
Sort the strings array first. Then you only need to look at the next item to see if it is duplicated instead of needing to go through the whole array over and over. Advantage: no extra memory used. Drawback: you lose the original order.
Use an auxiliary variable to store the values you've already seen and/or count the number of occurences of each string. This may be a HashSet, BTreeSet, HashMap or BTreeMap. See #Netwave's answer. Advantage: doesn't change the input array. Drawback: uses memory to keep track of the duplicates.
You can count the appearances in O(n) with a tree or table like:
fn main() {
let strings = vec!["Rust", "Rest", "Rust"];
let mut sorted_data : HashMap<&str, u32> = HashMap::new();
strings.iter().for_each(|item| {
if !sorted_data.contains_key(item) {
sorted_data.insert(item, 0);
}
*sorted_data.get_mut(item).unwrap() += 1;
});
println!("{:?}", sorted_data);
}
The just use the one with the biggest key, for example with the new fold_first:
let result = sorted_data.iter().fold_first(|(k1, v1), (k2, v2)| { if v2 > v1 {(k2, v2)} else {(k1, v1)}}).unwrap();
Playground

How can I append a char or &str to a String without first converting it to String?

I am attempting to write a lexer for fun, however something keeps bothering me.
let mut chars: Vec<char> = Vec::new();
let mut contents = String::new();
let mut tokens: Vec<&String> = Vec::new();
let mut append = String::new();
//--snip--
for _char in chars {
append += &_char.to_string();
append = append.trim().to_string();
if append.contains("print") {
println!("print found at: \n{}", append);
append = "".to_string();
}
}
Any time I want to do something as simple as append a &str to a String I have to convert it using .to_string, String::from(), .to_owned, etc.
Is there something I am doing wrong, so that I don't have to constantly do this, or is this the primary way of appending?
If you're trying to do something with a type, check the documentation. From the documentation for String:
push: "Appends the given char to the end of this String."
push_str: "Appends a given string slice onto the end of this String."
It's important to understand the differences between String and &str, and why different methods accept and return each of them.
A &str or &mut str are usually preferred in function arguments and return types. That's because they are just pointers to data so nothing needs to be copied or moved when they are passed around.
A String is returned when a function needs to do some new allocation, while &str and &mut str are slices into an existing String. Even though &mut str is mutable, you can't mutate it in a way that increases its length because that would require additional allocation.
The trim function is able to return a &str slice because that doesn't involve mutating the original string - a trimmed string is just a substring, which a slice perfectly describes. But sometimes that isn't possible; for example, a function that pads a string with an extra character would have to return a String because it would be allocating new memory.
You can reduce the number of type conversions in your code by choosing different methods:
for c in chars {
append.push(c); // append += &_char.to_string();
append = append.trim().to_string();
if append.contains("print") {
println!("print found at: \n{}", append);
append.clear(); // append = "".to_string();
}
}
There isn't anything like a trim_in_place method for String, so the way you have done it is probably the only way.

How to iterate through characters in a string in Rust to match words?

I'd like to iterate through a sentence to extract out simple words from the string. Here's what I have so far, trying to make the parse function first match world in the input string:
fn parse(input: String) -> String {
let mut val = String::new();
for c in input.chars() {
if c == "w".to_string() {
// guessing I have to test one character at a time
val.push_str(c.to_str());
}
}
return val;
}
fn main() {
let s = "Hello world!".to_string();
println!("{}", parse(s)); // should say "world"
}
What is the correct way to iterate through the characters in a string to match patterns in Rust (such as for a basic parser)?
Checking for words in a string is easy with the str::contains method.
As for writing a parser itself, I don't think it's any different in Rust than other languages. You have to create some sort of state machine.
For examples, you could check out serialize::json. I also wrote a CSV parser that uses a buffer with a convenient read_char method. The advantage of using this approach is that you don't need to load the whole input into memory at once.

Resources