How to find a string of multiple occurences in a vector?

How to find a string of multiple occurences in a vector? - rust

I have a vector of strings, and I want to find a string that has the number of occurrences more than one. I've tried this but didn't work.
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.into_iter()
.find(|x| o.into_iter().filter(|y| x == y).count() >= 2)
// sorry o ^ here is supposed to be strings
.unwrap();

There are two issues in your code:
o doesn't exist. I assume you meant to use strings instead.
into_itertakes ownership of the value, so once you have called into_iter on strings (or o), you can't call it again. You should use plain iter instead.
Here's a fixed version:
let strings = vec!["Rust", "Rest", "Rust"]; // I want to find "Rust" in this case
let val = strings
.iter()
.find(|x| strings.iter().filter(|y| x == y).count() >= 2)
.unwrap();
Note however that this is pretty slow. Depending on your requirements, there are more efficient alternatives:
Sort the strings array first. Then you only need to look at the next item to see if it is duplicated instead of needing to go through the whole array over and over. Advantage: no extra memory used. Drawback: you lose the original order.
Use an auxiliary variable to store the values you've already seen and/or count the number of occurences of each string. This may be a HashSet, BTreeSet, HashMap or BTreeMap. See #Netwave's answer. Advantage: doesn't change the input array. Drawback: uses memory to keep track of the duplicates.

You can count the appearances in O(n) with a tree or table like:
fn main() {
let strings = vec!["Rust", "Rest", "Rust"];
let mut sorted_data : HashMap<&str, u32> = HashMap::new();
strings.iter().for_each(|item| {
if !sorted_data.contains_key(item) {
sorted_data.insert(item, 0);
}
*sorted_data.get_mut(item).unwrap() += 1;
});
println!("{:?}", sorted_data);
}
The just use the one with the biggest key, for example with the new fold_first:
let result = sorted_data.iter().fold_first(|(k1, v1), (k2, v2)| { if v2 > v1 {(k2, v2)} else {(k1, v1)}}).unwrap();
Playground

Related

Is Rust hash map storing the same key twice?

I am going through the Rust language book, and was going through the hash map section.
At the end of the section there is a quiz, which asks what would be the output of the following code.
use std::collections::HashMap;
fn main() {
let mut h: HashMap<char, Vec<usize>> = HashMap::new();
for (i, c) in "hello!".chars().enumerate() {
h.entry(c).or_insert(Vec::new()).push(i);
}
let mut sum = 0;
for i in h.get(&'l').unwrap() {
sum += *i;
}
println!("{}", sum);
}
The answer is 5, and I confirmed by executing the same. The explanation given is following
This program stores a vector of indexes for each occurrence of a given letter into a hashmap. Then it sums all the indexes for the letter 'l', which occurs at indexes 2 and 3 in the string "hello!".
Now when I look at section adding a new key and updating value based on old one, it seems Rust only allows unique key for a hash map as per definition.
However, when I run the above program with an extra print statement in the for loop
for i in h.get(&'l').unwrap() {
println!("{}", i);
sum += *i;
}
It seems there are two entries for the character l. As per my understanding, when iterating over string hello! when the first for loop encounters the second l it should update and overwrite the index value of l from 2 to 3, instead of making another entry with key l and value 3.
So the question is, is Rust allowing duplicate keys? Somehow this is not making sense.

How to find the number of times that a substring occurs in a given string (include jointed)?

By jointed I mean:
let substring = "CNC";
And the string:
let s = "CNCNC";
In my version "jointed" would mean that there are 2 such substrings present.
What is the best way of doing that in Rust? I can think of a few but then it's basically ugly C.
I have something like that:
fn find_a_string(s: &String, sub_string: &String) -> u32 {
s.matches(sub_string).count() as u32
}
But that returns 1, because matches() finds only disjointed substrings.
What's the best way to do that in Rust?

Probably there is a better algorithm. Here I just move a window with the size of the sub-string we are looking for over the input string and compare if that window is the same as the substring.
fn main() {
let string = "aaaa";
let substring = "aa";
let substrings = string
.as_bytes()
.windows(substring.len())
.filter(|&w| w == substring.as_bytes())
.count();
println!("{}", substrings);
}

The approach of iterating over all windows is perfectly serviceable when your needle/haystack is small. And indeed, it might even be the preferred solution for small needles/haystacks, since a theoretically optimal solution is a fair bit more complicated. But it can get quite a bit slower as the lengths grow.
While Aho-Corasick is more well known for its support for searching multiple patterns simultaneously, it can be used with a single pattern to find overlapping matches in linear time. (In this case, it looks a lot like Knuth-Morris-Pratt.)
The aho-corasick crate can do this:
use aho_corasick::AhoCorasick;
fn main() {
let haystack = "CNCNC";
let needle = "CNC";
let matcher = AhoCorasick::new(&[needle]);
for m in matcher.find_overlapping_iter(haystack) {
let (s, e) = (m.start(), m.end());
println!("({:?}, {:?}): {:?}", s, e, &haystack[s..e]);
}
}
Output:
(0, 3): "CNC"
(2, 5): "CNC"
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ab6c547b1700bbbc4a29a99adcaceabe

How can I use Rayon to split a big range into chunks of ranges and have each thread find within a chunk?

I am making a program that brute forces a password by parallelization. At the moment the password to crack is already available as plain text, I'm just attempting to brute force it anyway.
I have a function called generate_char_array() which, based on an integer seed, converts base and returns a u8 slice of characters to try and check. This goes through the alphabet first for 1 character strings, then 2, etc.
let found_string_index = (0..1e12 as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let bytes = generate_char_array(*i, &mut array);
return &password_bytes == &bytes;
});
With the found string index (or seed integer rather), I can generate the found string.
The problem is that the way Rayon parallelizes this for me is split the arbitrary large integer range into thread_count-large slices (e.g. for 4 threads, 0..2.5e11, 2.5e11..5e11 etc). This is not good, because the end of the range is for arbitrarily super large password lengths (10+, I don't know), whereas most passwords (including the fixed "zzzzz" I tend to try) are much shorter, and thus what I get is that the first thread does all the work, and the rest of the threads just waste time testing way too long passwords and synchronizing; getting actually slower than single thread performance as a result.
How could I instead split the arbitrary big range (doesn't have to have an end actually) into chunks of ranges and have each thread find within chunks? That would make the workers in different threads actually useful.

This goes through the alphabet first for 1 character strings, then 2
You wish to impose some sequencing on your data processing, but the whole point of Rayon is to go in parallel.
Instead, use regular iterators to sequentially go up in length and then use parallel iterators inside a specific length to quickly process all of the values of that length.
Since you haven't provided enough code for a runnable example, I've made this rough approximation to show the general shape of such a solution:
extern crate rayon;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::ops::RangeInclusive;
type Seed = u8;
const LENGTHS: RangeInclusive<usize> = 1..=3;
const SEEDS: RangeInclusive<Seed> = 0..=std::u8::MAX;
fn find<F>(test_password: F) -> Option<(usize, Seed)>
where
F: Fn(usize, Seed) -> bool + Sync,
{
// Rayon doesn't support RangeInclusive yet
let seeds: Vec<_> = SEEDS.collect();
// Step 1-by-1 through the lengths, sequentially
LENGTHS.flat_map(|length| {
// In parallel, investigate every value in this length
// This doesn't do that, but it shows how the parallelization
// would be introduced
seeds
.par_iter()
.find_any(|&&seed| test_password(length, seed))
.map(|&seed| (length, seed))
}).next()
}
fn main() {
let pass = find(|l, s| {
println!("{}, {}", l, s);
// Actually generate and check the password based on the search criteria
l == 3 && s == 250
});
println!("Found password length and seed: {:?}", pass);
}
This can "waste" a little time at the end of each length as the parallel threads spin down one-by-one before spinning back up for the next length, but that seems unlikely to be a primary concern.

This is a version of what I suggested in my comment.
The main loop is parallel and is only over the first byte of each attempt. For each first byte, do the full brute force search for the remainder.
let matched_bytes = (0 .. 0xFFu8).into_par_iter().filter_map(|n| {
let mut array = [0u8; 8];
// the first digit is always the same in this run
array[0] = n;
// The highest byte is 0 because it's provided from the outer loop
(0 ..= 0x0FFFFFFFFFFFFFFF as u64).into_iter().filter_map(|i| {
// pass a slice so that the first byte is not affected
generate_char_array(i, &mut array[1 .. 8]);
if &password_bytes[..] == &array[0 .. password_bytes.len()] {
Some(array.clone())
} else {
None
}
}).next()
}).find_any(|_| true);
println!("found = {:?}", matched_bytes);
Also, even for a brute force method, this is probably highly inefficient still!

If Rayon splits the slices as you described, then apply simple math to balance the password lengths:
let found_string_index = (0..max_val as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let v = i/span + (i%span) * num_cpu;
let bytes = generate_char_array(*v, &mut array);
return &password_bytes == &bytes;
});
The span value depends on the number of CPUs (the number of threads used by Rayon), in your case:
let num_cpu = 4;
let span = 2.5e11 as u64;
let max_val = span * num_cpu;
Note the performance of this approach is highly dependent on how Rayon performs the split of the sequence on parallel threads. Verify that it works as you reported in the question.

Sort HashMap data by value

I want to sort HashMap data by value in Rust (e.g., when counting character frequency in a string).
The Python equivalent of what I’m trying to do is:
count = {}
for c in text:
count[c] = count.get('c', 0) + 1
sorted_data = sorted(count.items(), key=lambda item: -item[1])
print('Most frequent character in text:', sorted_data[0][0])
My corresponding Rust code looks like this:
// Count the frequency of each letter
let mut count: HashMap<char, u32> = HashMap::new();
for c in text.to_lowercase().chars() {
*count.entry(c).or_insert(0) += 1;
}
// Get a sorted (by field 0 ("count") in reversed order) list of the
// most frequently used characters:
let mut count_vec: Vec<(&char, &u32)> = count.iter().collect();
count_vec.sort_by(|a, b| b.1.cmp(a.1));
println!("Most frequent character in text: {}", count_vec[0].0);
Is this idiomatic Rust? Can I construct the count_vec in a way so that it would consume the HashMaps data and owns it (e.g., using map())? Would this be more idomatic?

Is this idiomatic Rust?
There's nothing particularly unidiomatic, except possibly for the unnecessary full type constraint on count_vec; you could just use
let mut count_vec: Vec<_> = count.iter().collect();
It's not difficult from context to work out what the full type of count_vec is. You could also omit the type constraint for count entirely, but then you'd have to play shenanigans with your integer literals to have the correct value type inferred. That is to say, an explicit annotation is eminently reasonable in this case.
The other borderline change you could make if you feel like it would be to use |a, b| a.1.cmp(b.1).reverse() for the sort closure. The Ordering::reverse method just reverses the result so that less-than becomes greater-than, and vice versa. This makes it slightly more obvious that you meant what you wrote, as opposed to accidentally transposing two letters.
Can I construct the count_vec in a way so that it would consume the HashMaps data and owns it?
Not in any meaningful way. Just because HashMap is using memory doesn't mean that memory is in any way compatible with Vec. You could use count.into_iter() to consume the HashMap and move the elements out (as opposed to iterating over pointers), but since both char and u32 are trivially copyable, this doesn't really gain you anything.

This could be another way to address the matter without the need of an intermediary vector.
// Count the frequency of each letter
let mut count: HashMap<char, u32> = HashMap::new();
for c in text.to_lowercase().chars() {
*count.entry(c).or_insert(0) += 1;
}
let top_char = count.iter().max_by(|a, b| a.1.cmp(&b.1)).unwrap();
println!("Most frequent character in text: {}", top_char.0);

use BTreeMap for sorted data
BTreeMap sorts its elements by key by default, therefore exchanging the place of your key and value and putting them into a BTreeMap
let count_b: BTreeMap<&u32,&char> = count.iter().map(|(k,v)| (v,k)).collect();
should give you a sorted map according to character frequency.
Some character of the same frequency shall be lost though. But if you only want the most frequent character, it does not matter.
You can get the result using
println!("Most frequent character in text: {}", count_b.last_key_value().unwrap().1);

Collect items from an iterator at a specific index

I was wondering if it is possible to use .collect() on an iterator to grab items at a specific index. For example if I start with a string, I would normally do:
let line = "Some line of text for example";
let l = line.split(" ");
let lvec: Vec<&str> = l.collect();
let text = &lvec[3];
But what would be nice is something like:
let text: &str = l.collect(index=(3));

No, it's not; however you can easily filter before you collect, which in practice achieves the same effect.
If you wish to filter by index, you need to add the index in and then strip it afterwards:
enumerate (to add the index to the element)
filter based on this index
map to strip the index from the element
Or in code:
fn main() {
let line = "Some line of text for example";
let l = line.split(" ")
.enumerate()
.filter(|&(i, _)| i == 3 )
.map(|(_, e)| e);
let lvec: Vec<&str> = l.collect();
let text = &lvec[0];
println!("{}", text);
}
If you only wish to get a single index (and thus element), then using nth is much easier. It returns an Option<&str> here, which you need to take care of:
fn main() {
let line = "Some line of text for example";
let text = line.split(" ").nth(3).unwrap();
println!("{}", text);
}
If you can have an arbitrary predicate but wishes only the first element that matches, then collecting into a Vec is inefficient: it will consume the whole iterator (no laziness) and allocate potentially a lot of memory that is not needed at all.
You are thus better off simply asking for the first element using the next method of the iterator, which returns an Option<&str> here:
fn main() {
let line = "Some line of text for example";
let text = line.split(" ")
.enumerate()
.filter(|&(i, _)| i % 7 == 3 )
.map(|(_, e)| e)
.next()
.unwrap();
println!("{}", text);
}
If you want to select part of the result, by index, you may also use skip and take before collecting, but I guess you have enough alternatives presented here already.

There is a nth function on Iterator that does this:
let text = line.split(" ").nth(3).unwrap();

No; you can use take and next, though:
let line = "Some line of text for example";
let l = line.split(" ");
let text = l.skip(3).next();
Note that this results in text being an Option<&str>, as there's no guarantee that the sequence actually has at least four elements.
Addendum: using nth is definitely shorter, though I prefer to be explicit about the fact that accessing the nth element of an iterator necessarily consumes all the elements before it.

For anyone who may be interested, you can can do loads of cool things with iterators (thanks Matthieu M), for example to get multiple 'words' from a string according to their index, you can use filter along with logical or || to test for multiple indexes !
let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062"
let words: Vec<&str> = line.split(" ")
.enumerate()
.filter(|&(i, _)| i==1 || i==3 || i==6 )
.map(|(_, e) | e)
.collect();

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find a string of multiple occurences in a vector? - rust

Related

Is Rust hash map storing the same key twice?

How to find the number of times that a substring occurs in a given string (include jointed)?

How can I use Rayon to split a big range into chunks of ranges and have each thread find within a chunk?

Sort HashMap data by value

Collect items from an iterator at a specific index

Categories

Resources