How to find most similar string using n-grams

How to find most similar string using n-grams - rust

I am trying to use n-grams to find the most similar string for each string within a list, currently I have this vector of strings
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
]
Using the following code I create the bigrams for each string and store them in a vector:
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
]
.iter()
.map(|elem|
elem
.len()
.rem(2)
.ne(&0)
.then_some(format!("{elem} "))
.unwrap_or(elem.to_string())
)
.map(|elem| elem.chars().array_chunks().collect::<Vec<[char; 2]>>())
.collect::<Vec<_>>();
Output:
[['B', 'i'], ['l', 'b'], ['o', ' '], ['B', 'a'], ['g', 'g'], ['i', 'n'], ['s', ' ']]
[['G', 'a'], ['n', 'd'], ['a', 'l'], ['f', ' ']]
[['T', 'h'], ['o', 'r'], ['i', 'n']]
[['B', 'a'], ['l', 'i'], ['n', ' ']]
[['K', 'i'], ['l', 'i']]
[['F', 'i'], ['l', 'i']]
[['J', 'o'], ['h', 'n']]
[['F', 'r'], ['o', 'd'], ['o', ' '], ['B', 'a'], ['g', 'g'], ['i', 'n'], ['s', ' ']]
Question is, how can I apply some sort of set logic to these vectors of bigrams to find the most similar string for each of the strings and get the somewhat the following output?:
'Bilbo Baggins' most similar string: 'Frodo Baggins'
'Gandalf' most similar string: None
'Thoring' most similar string: 'Balin'
'Balin' most similar string: 'Thorin'
'Kili' most similar string: 'Fili'
'Fili' most similar string: 'Kili'
'John' most similar string: None
'Frodo Baggins' most similar string: 'Bilbo Baggins'

There are many different algorithms to calculate distances between strings. The algorithm you are looking for with your bigrams is probably the cosine similarity function. It can be used to iterate over your bigrams and compute a value that represents the similarity between two vectors (or strings, in this case). It seems to favour matching to longer strings because there are more groups of characters repeated between them.
Here is an example of finding the closest names by cosine similarity:
use std::collections::HashSet;
const SIM_THRESHOLD: f32 = 0.22;
fn main() {
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
];
for name in arr {
let mut closest = (None, -1.0);
for other in arr.iter().filter(|&e| *e != name) {
let sim = str_diff(name, other);
if sim > closest.1 && sim > SIM_THRESHOLD {
closest = (Some(*other), sim);
}
}
println!(
"\"{}\" most similar string: {}",
name,
if let Some(name) = closest.0 {
format!("\"{}\"", name)
} else {
"None".to_string()
}
);
}
}
/// Returns the bigram cosine similarity between two strings. A `1` means the
/// strings are identical, while a `0` means they are completely different.
/// Returns NaN if both strings are fewer than 2 characters.
pub fn str_diff(a: &str, b: &str) -> f32 {
cos_sim(&ngram(&a, &b, 2))
}
// Returns the term frequency of `n` consecutive characters between two strings.
// The order of the terms is not guarenteed, but will always be consistent
// between the two returned vectors (order could be guarenteed with a BTreeSet,
// but that is slower).
fn ngram(s1: &str, s2: &str, n: usize) -> (Vec<u32>, Vec<u32>) {
let mut grams = HashSet::<&str>::new();
for i in 0..((s1.len() + 1).saturating_sub(n)) {
grams.insert(&s1[i..(i + n)]);
}
for i in 0..((s2.len() + 1).saturating_sub(n)) {
grams.insert(&s2[i..(i + n)]);
}
let mut q1 = Vec::new();
let mut q2 = Vec::new();
for i in grams {
q1.push(s1.matches(i).count() as u32);
q2.push(s2.matches(i).count() as u32);
}
(q1, q2)
}
// Returns the dot product of two slices of equal length. Returns an `Err` if
// the slices are not of equal length.
fn dot_prod(a: &[u32], b: &[u32]) -> Result<u32, &'static str> {
if a.len() != b.len() {
return Err("Slices must be of equal length");
}
let mut v = Vec::new();
for i in 0..a.len() {
v.push(a[i] * b[i]);
}
Ok(v.iter().sum())
}
// Returns the cosine similarity between two vectors of equal length.
// `S_c(A, B) = (A · B) / (||A|| ||B||)`
fn cos_sim((a, b): &(Vec<u32>, Vec<u32>)) -> f32 {
if a.len() != b.len() {
return f32::NAN;
}
let a_mag = (dot_prod(a, a).unwrap() as f32).sqrt();
let b_mag = (dot_prod(b, b).unwrap() as f32).sqrt();
// use `clamp` to constrain floating point errors within 0..=1
(dot_prod(a, b).unwrap() as f32 / (a_mag * b_mag)).clamp(0.0, 1.0)
}
"Bilbo Baggins" most similar string: "Frodo Baggins"
"Gandalf" most similar string: None
"Thorin" most similar string: "Balin"
"Balin" most similar string: "Bilbo Baggins"
"Kili" most similar string: "Fili"
"Fili" most similar string: "Kili"
"John" most similar string: None
"Frodo Baggins" most similar string: "Bilbo Baggins"
The Balin doesn't look quite right because cosine distance doesn't take into account string length. Another popular method is to find the Levenshtein distance (I used the Wagner Fischer algorithm to compute it), which is the number of insertions, deletions, or substitutions to transform one string into another
use std::cmp::min;
fn main() {
let arr = [
"Bilbo Baggins",
"Gandalf",
"Thorin",
"Balin",
"Kili",
"Fili",
"John",
"Frodo Baggins",
];
for name in arr {
let mut closest = (None, usize::MAX);
for other in arr.iter().filter(|&e| *e != name) {
let dist = distance(name, *other);
if dist < closest.1 && dist < min(name.len(), other.len()) {
closest = (Some(*other), dist);
}
}
println!(
"\"{}\" most similar string: {}",
name,
if let Some(name) = closest.0 {
format!("\"{}\"", name)
} else {
"None".to_string()
}
);
}
}
/// Calculates the Levenshtein Distance between 2 strings
fn distance(a: &str, b: &str) -> usize {
let a = a.chars().collect::<Vec<char>>();
let b = b.chars().collect::<Vec<char>>();
let mut d = vec![vec![0; b.len() + 1]; a.len() + 1];
let mut cost = 0;
for i in 1..=a.len() {
d[i][0] = i
}
for j in 1..=b.len() {
d[0][j] = j
}
for j in 1..=b.len() {
for i in 1..=a.len() {
if a[i-1] == b[j-1] {
cost = 0;
} else {
cost = 1;
}
d[i][j] = min(min(d[i - 1][j] + 1, d[i][j - 1] + 1), d[i - 1][j - 1] + cost);
}
}
d[a.len()][b.len()]
}
"Bilbo Baggins" most similar string: "Frodo Baggins"
"Gandalf" most similar string: None
"Thorin" most similar string: "Balin"
"Balin" most similar string: "Kili"
"Kili" most similar string: "Fili"
"Fili" most similar string: "Kili"
"John" most similar string: None
"Frodo Baggins" most similar string: "Bilbo Baggins"
Balin still isn't quite what you want since it takes the fewest modifications to get to Kili, but it definitely looks closer. Hopefully, this can help lead you closer to where you want to go, but you may need to use a combination of algorithms, or find one which weights beginnings/ends of words differently if you want the Balin to match Thorin.

Related

How do I get the cartesian product of 2 vectors by using Iterator?

I have 2 Vecs:
let x = vec!['1', '2', '3'];
let y = vec!['a', 'b', 'c'];
Now I want to use iterator to make a new vec like this ['1a', '1b', '1c', '2a', '2b', '2c', '3a', '3b', '3c']. How can I do?

Easiest way would be to use the cartesian product macro available in the itertools crate
use itertools::iproduct; // 0.10.1
fn main() {
let x = vec!['1', '2', '3'];
let y = vec!['a', 'b', 'c'];
let product: Vec<String> = iproduct!(x, y)
.map(|(a, b)| format!("{}{}", a, b))
.collect();
println!("{:?}", product);
}
Playground

Here is how to do it with vanilla Rust iterators:
fn main() {
let x = vec!['1', '2', '3'];
let y = vec!['a', 'b', 'c'];
let product: Vec<String> = x
.iter()
.map(|&item_x| y
.iter()
.map(move |&item_y| [item_x, item_y]
.iter()
.collect()
)
)
.flatten()
.collect();
println!("{:?}", product);
}
Explanation
The easiest way to construct a String from two chars is to collect iterator over the chars:
let string: String = [item_x, item_y].iter().collect();
For each item in x we iterate over y and construct such string.
x.iter().map(|&item_x| y.iter.map(move |&item_y| ...));
We use pattern matching to get value in the map closure rather then references. Because of that and the fact that the char has Copy trait, we can move item_x into inner closure, resolving any lifetime issues.
As the result of the code above we get an iterator over iterators over Strings. To flatten that iterator, we use flatten method (who would think?). Then we collect the flat iterator into the resulting Vec.
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=bf2987ed96303a0db0f629884492011e

The existing answers make sense if your goal is to get the Cartesian product of two iterators. If you've got vectors or slices already though (like in the original question) you can do a little better:
fn main() {
let x = vec!['1', '2', '3'];
let y = vec!['a', 'b', 'c'];
let result: Vec<String> = product(&x, &y)
.map(|(a, b)| format!("{}{}", a, b))
.collect();
println!("{:?}", result)
}
fn product<'a: 'c, 'b: 'c, 'c, T>(
xs: &'a [T],
ys: &'b [T],
) -> impl Iterator<Item = (&'a T, &'b T)> + 'c {
xs.iter().flat_map(move |x| std::iter::repeat(x).zip(ys))
}
Playground
Any iterator based solution will necessarily require storing the full contents of the iterator somewhere, but if you already have the data in a vector or array you can use the known size to only store the indices instead.

Move items from vec to other by indexes in Rust

I have a vector items of items and a vector idxs of indexes, how can I get a vector picked filled by moving all values at once from items indexed in idxs ?
For example:
let mut items: Vec<char> = ['a', 'b', 'c', 'd', 'e', 'f'];
let mut idxs: Vec<usize> = [3, 4, 1];
let picked = pick(&mut items, &idxs);
// items should be: ['a', 'c', 'f']
// picked should be: ['d', 'e', 'b']
I can make it with:
let mut picked: Vec<char> = Vec::new();
let placeholder = 'z';
for idx in idxs {
items.insert(idx, placeholder); // insert any placeholder value of type T for keeping order
let item = items.remove(idx + 1);
picked.push(item);
}
items = items.filter(|item| item != placeholder);
But I think I am overkilling it. And keeping a placeholder value for each different types is complicated, in my case I have to avoid it.
Is there a more idiomatic way to do that ?

Here are two algorithms for the problem.
The following algorithm is O(n + m). That is the best possible asymptotic run time assuming that items must stay in its original order, since that means all elements must potentially be moved to compact them after the removals.
fn pick<T>(items: &mut Vec<T>, idxs: &[usize]) -> Vec<T> {
// Move the items into a vector of Option<T> we can remove items from
// without reordering.
let mut opt_items: Vec<Option<T>> = items.drain(..).map(Some).collect();
// Take the items.
let picked: Vec<T> = idxs
.into_iter()
.map(|&i| opt_items[i].take().expect("duplicate index"))
.collect();
// Put the unpicked items back.
items.extend(opt_items.into_iter().filter_map(|opt| opt));
picked
}
fn main() {
let mut items: Vec<char> = vec!['a', 'b', 'c', 'd', 'e', 'f'];
let idxs: Vec<usize> = vec![3, 4, 1];
let picked = pick(&mut items, &idxs);
dbg!(picked, items);
}
This algorithm is instead O(m log m) (where m is the length of idxs). The price for this is that it reorders the un-picked elements of items.
fn pick<T>(items: &mut Vec<T>, idxs: &[usize]) -> Vec<T> {
// Second element is the index into `idxs`.
let mut sorted_idxs: Vec<(usize, usize)> =
idxs.iter().copied().enumerate().map(|(ii, i)| (i, ii)).collect();
sorted_idxs.sort();
// Set up random-access output storage.
let mut output: Vec<Option<T>> = Vec::new();
output.resize_with(idxs.len(), || None);
// Take the items, in reverse sorted order.
// Reverse order ensures that `swap_remove` won't move any item we want.
for (i, ii) in sorted_idxs.into_iter().rev() {
output[ii] = Some(items.swap_remove(i));
}
// Unwrap the temporary `Option`s.
output.into_iter().map(Option::unwrap).collect()
}
Both of these algorithms could be optimized by using unsafe code to work with uninitialized/moved memory instead of using vectors of Option. The second algorithm would then need a check for duplicate indices to be safe.

If idxs is unsorted and order matters, and if you can't use a placeholder, then you can move the items like this:
let mut picked: Vec<char> = Vec::new();
let mut idxs = idxs.clone(); // Not required if you are allowed to mutate the original idx.
for i in 0 .. idxs.len() {
picked.push (items.remove (idxs[i]));
for j in i+1 .. idxs.len() {
if idxs[j] > idxs[i] { idxs[j] -= 1; }
}
}

Expected &str found char with rust?

I am getting this error
expected &str, found char
For this code
// Expected output
// -------
// h exists
// c exists
fn main() {
let list = ["c","h","p","u"];
let s = "Hot and Cold".to_string();
let mut v: Vec<String> = Vec::new();
for i in s.split(" ") {
let c = i.chars().nth(0).unwrap().to_lowercase().nth(0).unwrap();
println!("{}", c);
if list.contains(&c) {
println!("{} exists", c);
}
}
}
How do I solve this?

Change list from an array of &strs to an array of chars:
let list = ['c', 'h', 'p', 'u'];
Double-quotes "" create string literals, while single-quotes '' create character literals. See Literal Expressions in the Rust reference.

I'm assuming you want a list to be a list of chars not a list of strs, in that case try changing
let list = ["c","h","p","u"];
to
let list = ['c','h','p','u'];
and it should work
Rust playground

Creating a sliding window iterator of slices of chars from a String

I am looking for the best way to go from String to Windows<T> using the windows function provided for slices.
I understand how to use windows this way:
fn main() {
let tst = ['a', 'b', 'c', 'd', 'e', 'f', 'g'];
let mut windows = tst.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
}
But I am a bit lost when working this problem:
fn main() {
let tst = String::from("abcdefg");
let inter = ? //somehow create slice of character from tst
let mut windows = inter.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
}
Essentially, I am looking for how to convert a string into a char slice that I can use the window method with.

The problem that you are facing is that String is really represented as something like a Vec<u8> under the hood, with some APIs to let you access chars. In UTF-8 the representation of a code point can be anything from 1 to 4 bytes, and they are all compacted together for space-efficiency.
The only slice you could get directly of an entire String, without copying everything, would be a &[u8], but you wouldn't know if the bytes corresponded to whole or just parts of code points.
The char type corresponds exactly to a code point, and therefore has a size of 4 bytes, so that it can accommodate any possible value. So, if you build a slice of char by copying from a String, the result could be up to 4 times larger.
To avoid making a potentially large, temporary memory allocation, you should consider a more lazy approach – iterate through the String, making slices at exactly the char boundaries. Something like this:
fn char_windows<'a>(src: &'a str, win_size: usize) -> impl Iterator<Item = &'a str> {
src.char_indices()
.flat_map(move |(from, _)| {
src[from ..].char_indices()
.skip(win_size - 1)
.next()
.map(|(to, c)| {
&src[from .. from + to + c.len_utf8()]
})
})
}
This will give you an iterator where the items are &str, each with 3 chars:
let mut windows = char_windows(&tst, 3);
for win in windows {
println!("{:?}", win);
}
The nice thing about this approach is that it hasn't done any copying at all - each &str produced by the iterator is still a slice into the original source String.
All of that complexity is because Rust uses UTF-8 encoding for strings by default. If you absolutely know that your input string doesn't contain any multi-byte characters, you can treat it as ASCII bytes, and taking slices becomes easy:
let tst = String::from("abcdefg");
let inter = tst.as_bytes();
let mut windows = inter.windows(3);
However, you now have slices of bytes, and you'll need to turn them back into strings to do anything with them:
for win in windows {
println!("{:?}", String::from_utf8_lossy(win));
}

This solution will work for your purpose. (playground)
fn main() {
let tst = String::from("abcdefg");
let inter = tst.chars().collect::<Vec<char>>();
let mut windows = inter.windows(3);
// prints ['a', 'b', 'c']
println!("{:?}", windows.next().unwrap());
// prints ['b', 'c', 'd']
println!("{:?}", windows.next().unwrap());
// etc...
println!("{:?}", windows.next().unwrap());
}
String can iterate over its chars, but it's not a slice, so you have to collect it into a vec, which then coerces into a slice.

You can use itertools to walk over windows of any iterator, up to a width of 4:
extern crate itertools; // 0.7.8
use itertools::Itertools;
fn main() {
let input = "日本語";
for (a, b) in input.chars().tuple_windows() {
println!("{}, {}", a, b);
}
}
See also:
Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?

How can I randomly select one element from a vector or array?

I have a vector where the element is a (String, String). How can I randomly pick one of these elements?

You want the rand crate, specifically the choose method.
use rand::seq::SliceRandom; // 0.7.2
fn main() {
let vs = vec![0, 1, 2, 3, 4];
println!("{:?}", vs.choose(&mut rand::thread_rng()));
}

Using choose_multiple:
use rand::seq::SliceRandom; // 0.7.2
fn main() {
let samples = vec!["hi", "this", "is", "a", "test!"];
let sample: Vec<_> = samples
.choose_multiple(&mut rand::thread_rng(), 1)
.collect();
println!("{:?}", sample);
}

Another choice for weighted sampling that is already included in the rand crate is WeightedIndex, which has an example:
use rand::prelude::*;
use rand::distributions::WeightedIndex;
let choices = ['a', 'b', 'c'];
let weights = [2, 1, 1];
let dist = WeightedIndex::new(&weights).unwrap();
let mut rng = thread_rng();
for _ in 0..100 {
// 50% chance to print 'a', 25% chance to print 'b', 25% chance to print 'c'
println!("{}", choices[dist.sample(&mut rng)]);
}
let items = [('a', 0), ('b', 3), ('c', 7)];
let dist2 = WeightedIndex::new(items.iter().map(|item| item.1)).unwrap();
for _ in 0..100 {
// 0% chance to print 'a', 30% chance to print 'b', 70% chance to print 'c'
println!("{}", items[dist2.sample(&mut rng)].0);
}

If you want to choose more than one element then the random_choice crate may be right for you:
extern crate random_choice;
use self::random_choice::random_choice;
fn main() {
let mut samples = vec!["hi", "this", "is", "a", "test!"];
let weights: Vec<f64> = vec![5.6, 7.8, 9.7, 1.1, 2.0];
let number_choices = 100;
let choices = random_choice().random_choice_f64(&samples, &weights, number_choices);
for choice in choices {
print!("{}, ", choice);
}
}

If you also want to remove the chosen element, here's one way to do that (using the rand crate):
let mut vec = vec![0,1,2,3,4,5,6,7,8,9];
let index = (rand::random::<f32>() * vec.len() as f32).floor() as usize;
let value = vec.remove( index );
println!("index: {} value: {}", index, value);
println!("{:?}", vec);
Rust Playground
remove(index) removes that value at index (shifting all the elements after it to the left) and the returns the value that was at index (docs).

Another way of getting a random value is via the indexing method using rng.gen_range and vec.get(). This also prevents the borrowing of the value (which occurs with the vec.choose() method)
fn main() {
let mut rng = rand::thread_rng();
let my_strings : Vec<&str> = vec!["a", "b", "c"];
let random_string_index: usize = rng.gen_range(0..my_strings.len());
let string = my_strings[random_string_index];
println!("{:?}", string);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find most similar string using n-grams - rust

Related

How do I get the cartesian product of 2 vectors by using Iterator?

Move items from vec to other by indexes in Rust

Expected &str found char with rust?

Creating a sliding window iterator of slices of chars from a String

How can I randomly select one element from a vector or array?

Categories

Resources