Convert String to Vec<char> at compile time for pattern matching - rust

I'm writing a parser in Rust and I'm creating tokens from a Vec<char>. Currently, my code looks like
match &source[..] {
['l', 'e', 't', ..] => ...,
['t', 'r', 'u', 'e', ..] => ...,
_ => ...
}
Obviously this is a lot more verbose than I'd like, and not easy to read. Is there any way I can convert "let" to ['l', 'e', 't'] at compile time (with a macro or const function) in order to pattern match on it like this?

I don't think that you can do that with the macros from the Rust standard library, but you could write your own macro:
use proc_macro::{TokenStream, TokenTree, Group, Delimiter, Punct, Literal, Spacing};
use syn::{parse_macro_input, LitStr};
#[proc_macro]
pub fn charize(input: TokenStream) -> TokenStream {
// some stuff for later
let comma_token = TokenTree::Punct(Punct::new(',', Spacing::Alone));
let rest_token_iterator = std::iter::once(TokenTree::Punct(Punct::new('.', Spacing::Joint))).chain(std::iter::once(TokenTree::Punct(Punct::new('.', Spacing::Alone))));
let string_to_charize: String = parse_macro_input!(input as LitStr).value();
let char_tokens_iterator = string_to_charize.chars().map(|char| TokenTree::Literal(Literal::character(char)));
// if you are on nightly, Iterator::intersperse() is much cleaner than this (https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.intersperse)
let char_tokens_interspersed_iterator = char_tokens_iterator.map(|token| [comma_token.clone(), token]).flatten().skip(1);
let char_tokens_interspersed_with_rest_iterator = char_tokens_interspersed_iterator.chain(std::iter::once(comma_token.clone())).chain(rest_token_iterator);
std::iter::once(TokenTree::Group(Group::new(Delimiter::Bracket, char_tokens_interspersed_with_rest_iterator.collect()))).collect()
}
Macro in action:
match &['d', 'e', 'm', 'o', 'n', 's', 't', 'r', 'a', 't', 'i', 'o', 'n'][..] {
charize!("doesn't match") => println!("Does not match"),
charize!("demo") => println!("It works"),
charize!("also doesn't match") => println!("Does not match"),
_ => panic!("Does not match")
}
Note that this is a procedural macro and as such must live in a proc_macro crate.

Related

How to find most similar string using n-grams

I am trying to use n-grams to find the most similar string for each string within a list, currently I have this vector of strings
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
]
Using the following code I create the bigrams for each string and store them in a vector:
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
]
.iter()
.map(|elem|
elem
.len()
.rem(2)
.ne(&0)
.then_some(format!("{elem} "))
.unwrap_or(elem.to_string())
)
.map(|elem| elem.chars().array_chunks().collect::<Vec<[char; 2]>>())
.collect::<Vec<_>>();
Output:
[['B', 'i'], ['l', 'b'], ['o', ' '], ['B', 'a'], ['g', 'g'], ['i', 'n'], ['s', ' ']]
[['G', 'a'], ['n', 'd'], ['a', 'l'], ['f', ' ']]
[['T', 'h'], ['o', 'r'], ['i', 'n']]
[['B', 'a'], ['l', 'i'], ['n', ' ']]
[['K', 'i'], ['l', 'i']]
[['F', 'i'], ['l', 'i']]
[['J', 'o'], ['h', 'n']]
[['F', 'r'], ['o', 'd'], ['o', ' '], ['B', 'a'], ['g', 'g'], ['i', 'n'], ['s', ' ']]
Question is, how can I apply some sort of set logic to these vectors of bigrams to find the most similar string for each of the strings and get the somewhat the following output?:
'Bilbo Baggins' most similar string: 'Frodo Baggins'
'Gandalf' most similar string: None
'Thoring' most similar string: 'Balin'
'Balin' most similar string: 'Thorin'
'Kili' most similar string: 'Fili'
'Fili' most similar string: 'Kili'
'John' most similar string: None
'Frodo Baggins' most similar string: 'Bilbo Baggins'
There are many different algorithms to calculate distances between strings. The algorithm you are looking for with your bigrams is probably the cosine similarity function. It can be used to iterate over your bigrams and compute a value that represents the similarity between two vectors (or strings, in this case). It seems to favour matching to longer strings because there are more groups of characters repeated between them.
Here is an example of finding the closest names by cosine similarity:
use std::collections::HashSet;
const SIM_THRESHOLD: f32 = 0.22;
fn main() {
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
];
for name in arr {
let mut closest = (None, -1.0);
for other in arr.iter().filter(|&e| *e != name) {
let sim = str_diff(name, other);
if sim > closest.1 && sim > SIM_THRESHOLD {
closest = (Some(*other), sim);
}
}
println!(
"\"{}\" most similar string: {}",
name,
if let Some(name) = closest.0 {
format!("\"{}\"", name)
} else {
"None".to_string()
}
);
}
}
/// Returns the bigram cosine similarity between two strings. A `1` means the
/// strings are identical, while a `0` means they are completely different.
/// Returns NaN if both strings are fewer than 2 characters.
pub fn str_diff(a: &str, b: &str) -> f32 {
cos_sim(&ngram(&a, &b, 2))
}
// Returns the term frequency of `n` consecutive characters between two strings.
// The order of the terms is not guarenteed, but will always be consistent
// between the two returned vectors (order could be guarenteed with a BTreeSet,
// but that is slower).
fn ngram(s1: &str, s2: &str, n: usize) -> (Vec<u32>, Vec<u32>) {
let mut grams = HashSet::<&str>::new();
for i in 0..((s1.len() + 1).saturating_sub(n)) {
grams.insert(&s1[i..(i + n)]);
}
for i in 0..((s2.len() + 1).saturating_sub(n)) {
grams.insert(&s2[i..(i + n)]);
}
let mut q1 = Vec::new();
let mut q2 = Vec::new();
for i in grams {
q1.push(s1.matches(i).count() as u32);
q2.push(s2.matches(i).count() as u32);
}
(q1, q2)
}
// Returns the dot product of two slices of equal length. Returns an `Err` if
// the slices are not of equal length.
fn dot_prod(a: &[u32], b: &[u32]) -> Result<u32, &'static str> {
if a.len() != b.len() {
return Err("Slices must be of equal length");
}
let mut v = Vec::new();
for i in 0..a.len() {
v.push(a[i] * b[i]);
}
Ok(v.iter().sum())
}
// Returns the cosine similarity between two vectors of equal length.
// `S_c(A, B) = (A · B) / (||A|| ||B||)`
fn cos_sim((a, b): &(Vec<u32>, Vec<u32>)) -> f32 {
if a.len() != b.len() {
return f32::NAN;
}
let a_mag = (dot_prod(a, a).unwrap() as f32).sqrt();
let b_mag = (dot_prod(b, b).unwrap() as f32).sqrt();
// use `clamp` to constrain floating point errors within 0..=1
(dot_prod(a, b).unwrap() as f32 / (a_mag * b_mag)).clamp(0.0, 1.0)
}
"Bilbo Baggins" most similar string: "Frodo Baggins"
"Gandalf" most similar string: None
"Thorin" most similar string: "Balin"
"Balin" most similar string: "Bilbo Baggins"
"Kili" most similar string: "Fili"
"Fili" most similar string: "Kili"
"John" most similar string: None
"Frodo Baggins" most similar string: "Bilbo Baggins"
The Balin doesn't look quite right because cosine distance doesn't take into account string length. Another popular method is to find the Levenshtein distance (I used the Wagner Fischer algorithm to compute it), which is the number of insertions, deletions, or substitutions to transform one string into another
use std::cmp::min;
fn main() {
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
];
for name in arr {
let mut closest = (None, usize::MAX);
for other in arr.iter().filter(|&e| *e != name) {
let dist = distance(name, *other);
if dist < closest.1 && dist < min(name.len(), other.len()) {
closest = (Some(*other), dist);
}
}
println!(
"\"{}\" most similar string: {}",
name,
if let Some(name) = closest.0 {
format!("\"{}\"", name)
} else {
"None".to_string()
}
);
}
}
/// Calculates the Levenshtein Distance between 2 strings
fn distance(a: &str, b: &str) -> usize {
let a = a.chars().collect::<Vec<char>>();
let b = b.chars().collect::<Vec<char>>();
let mut d = vec![vec![0; b.len() + 1]; a.len() + 1];
let mut cost = 0;
for i in 1..=a.len() {
d[i][0] = i
}
for j in 1..=b.len() {
d[0][j] = j
}
for j in 1..=b.len() {
for i in 1..=a.len() {
if a[i-1] == b[j-1] {
cost = 0;
} else {
cost = 1;
}
d[i][j] = min(min(d[i - 1][j] + 1, d[i][j - 1] + 1), d[i - 1][j - 1] + cost);
}
}
d[a.len()][b.len()]
}
"Bilbo Baggins" most similar string: "Frodo Baggins"
"Gandalf" most similar string: None
"Thorin" most similar string: "Balin"
"Balin" most similar string: "Kili"
"Kili" most similar string: "Fili"
"Fili" most similar string: "Kili"
"John" most similar string: None
"Frodo Baggins" most similar string: "Bilbo Baggins"
Balin still isn't quite what you want since it takes the fewest modifications to get to Kili, but it definitely looks closer. Hopefully, this can help lead you closer to where you want to go, but you may need to use a combination of algorithms, or find one which weights beginnings/ends of words differently if you want the Balin to match Thorin.

Why use .iter() in a for loop | Rust [duplicate]

This question already has answers here:
What is the difference between iter and into_iter?
(5 answers)
Closed 6 months ago.
In these two examples is there any benefit in using .iter() in a for loop?
let chars = ['g', 'd', 'k', 'k', 'n'];
for i in chars {
println!("{}", i);
}
let chars = ['g', 'd', 'k', 'k', 'n'];
for i in chars.iter() {
println!("{}", i);
}
for i in array is interpreted by the compiler as for i in array.into_iter().
This means that you are iterating over elements of type char, and the array is copied (as an array is Copy if its elements are also Copy).
On the other hand, for i in array.iter() references the array instead iterates over elements of type &char, avoiding a copy.

Efficient ways to build new Strings in Rust

I have just recently started learning Rust and have been messing around with some code. I wanted to create a simple function that removes vowels from a String and returns a new String. The code below functions, but I was concerned if this truly was a valid, typical approach in this language or if I'm missing something...
// remove vowels by building a String using .contains() on a vowel array
fn remove_vowels(s: String) -> String {
let mut no_vowels: String = String::new();
for c in s.chars() {
if !['a', 'e', 'i', 'o', 'u'].contains(&c) {
no_vowels += &c.to_string();
}
}
return no_vowels;
}
First, using to_string() to construct a new String and then using & to borrow just seemed off. Is there a simpler way to append characters to a String, or is this the only way to go? Or should I rewrite this entirely and iterate through the inputted String using a loop by its length, not by a character array?
Also, I have been informed that it's quite popular in Rust to not use the return statement but to instead let the last expression return the value from the function. Is my return statement required here, or is there a cleaner way to return that value that follows convention?
If you consume the original String as your example does, you can remove the vowels in-place using retain(), which will avoid allocating a new string:
fn remove_vowels(mut s: String) -> String {
s.retain(|c| !['a', 'e', 'i', 'o', 'u'].contains(&c));
s
}
See it working on the playground. Side note: you may want to consider uppercase vowels as well.
You can use collect on an iterator of characters to create a String. You can filter out the characters you don't want using filter.
// remove vowels by building a String using .contains() on a vowel array
fn remove_vowels(s: &str) -> String {
s.chars()
.filter(|c| !['a', 'e', 'i', 'o', 'u'].contains(c))
.collect()
}
playground
If this is in a performance critical region, then since you know the characters you're removing are single bytes in utf8, they are OK to remove directly from the bytes instead. Which means you can write something like
fn remove_vowels(s: &str) -> String {
String::from_utf8(
s.bytes()
.filter(|c| ![b'a', b'e', b'i', b'o', b'u'].contains(c))
.collect()
).unwrap()
}
which may be more efficient. playground

Creating a sliding window iterator of slices of chars from a String

I am looking for the best way to go from String to Windows<T> using the windows function provided for slices.
I understand how to use windows this way:
fn main() {
let tst = ['a', 'b', 'c', 'd', 'e', 'f', 'g'];
let mut windows = tst.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
}
But I am a bit lost when working this problem:
fn main() {
let tst = String::from("abcdefg");
let inter = ? //somehow create slice of character from tst
let mut windows = inter.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
}
Essentially, I am looking for how to convert a string into a char slice that I can use the window method with.
The problem that you are facing is that String is really represented as something like a Vec<u8> under the hood, with some APIs to let you access chars. In UTF-8 the representation of a code point can be anything from 1 to 4 bytes, and they are all compacted together for space-efficiency.
The only slice you could get directly of an entire String, without copying everything, would be a &[u8], but you wouldn't know if the bytes corresponded to whole or just parts of code points.
The char type corresponds exactly to a code point, and therefore has a size of 4 bytes, so that it can accommodate any possible value. So, if you build a slice of char by copying from a String, the result could be up to 4 times larger.
To avoid making a potentially large, temporary memory allocation, you should consider a more lazy approach – iterate through the String, making slices at exactly the char boundaries. Something like this:
fn char_windows<'a>(src: &'a str, win_size: usize) -> impl Iterator<Item = &'a str> {
src.char_indices()
.flat_map(move |(from, _)| {
src[from ..].char_indices()
.skip(win_size - 1)
.next()
.map(|(to, c)| {
&src[from .. from + to + c.len_utf8()]
})
})
}
This will give you an iterator where the items are &str, each with 3 chars:
let mut windows = char_windows(&tst, 3);
for win in windows {
println!("{:?}", win);
}
The nice thing about this approach is that it hasn't done any copying at all - each &str produced by the iterator is still a slice into the original source String.
All of that complexity is because Rust uses UTF-8 encoding for strings by default. If you absolutely know that your input string doesn't contain any multi-byte characters, you can treat it as ASCII bytes, and taking slices becomes easy:
let tst = String::from("abcdefg");
let inter = tst.as_bytes();
let mut windows = inter.windows(3);
However, you now have slices of bytes, and you'll need to turn them back into strings to do anything with them:
for win in windows {
println!("{:?}", String::from_utf8_lossy(win));
}
This solution will work for your purpose. (playground)
fn main() {
let tst = String::from("abcdefg");
let inter = tst.chars().collect::<Vec<char>>();
let mut windows = inter.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
println!("{:?}", windows.next().unwrap());
}
String can iterate over its chars, but it's not a slice, so you have to collect it into a vec, which then coerces into a slice.
You can use itertools to walk over windows of any iterator, up to a width of 4:
extern crate itertools; // 0.7.8
use itertools::Itertools;
fn main() {
let input = "日本語";
for (a, b) in input.chars().tuple_windows() {
println!("{}, {}", a, b);
}
}
See also:
Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?

How do I convert from a char array [char; N] to a string slice &str?

Given a fixed-length char array such as:
let s: [char; 5] = ['h', 'e', 'l', 'l', 'o'];
How do I obtain a &str?
You can't without some allocation, which means you will end up with a String.
let s2: String = s.iter().collect();
The problem is that strings in Rust are not collections of chars, they are UTF-8, which is an encoding without a fixed size per character.
For example, the array in this case would take 5 x 32-bits for a total of 20 bytes. The data of the string would take 5 bytes total (although there's also 3 pointer-sized values, so the overall String takes more memory in this case).
We start with the array and call []::iter, which yields values of type &char. We then use Iterator::collect to convert the Iterator<Item = &char> into a String. This uses the iterator's size_hint to pre-allocate space in the String, reducing the need for extra allocations.
Another quick one-liner I didn't see above:
let whatever_char_array = ['h', 'e', 'l', 'l', 'o'];
let string_from_char_array = String::from_iter(whatever_char_array);
Note:
This feature (iterating over an array) was introduced recently. I tried looking for the exact rustc version, but could not...
I will give you a very simple functional solution but it's not the best one. You can learn some basics:
let s: [char; 5] = ['h', 'e', 'l', 'l', 'o'];
let mut str = String::from("");
for x in &s {
str.push(*x);
}
println!("{}", str);
Before the variable names you can put an underscore if you want to keep the signature, but in this simple example it is not necessary. The program starts by creating an empty mutable String so you can add elements (chars) to the String. Then we make a for loop over the s array by taking its reference. We add each element to the initial string. At the end you can return your string or just print it.

Resources