Searching a String into Vec<String> in rust - rust

I'm writing a program that interprets a language.
I need to search for a string (not known at compile time) in a Vec.
fn get_name_index(name: &String, array: &Vec<String>) -> usize {
match array.binary_search(name) {
Ok(index) => index,
Err(_) => {
eprintln!("Error : variable {:?} not found in name array", name);
std::process::exit(1)
}
}
}
This happens multiple times during execution, but at the moment, the array.binary_search() function does not return the right answer.
I searched for the error, but my array is what it should be (printing each element, or examining with gdb: the same), and the error is still there.
Is there any other way to search for a String in a Vec<String>? Or is there an error in my code?
Thanks

First, a few issues: data must be sorted before using a binary search. A binary search is a fast search algorithm (O(log n), or scales as the log of the size of the container), much faster than a linear search (O(n), or scales linear to the size of the container). However, any speed improvements from a binary search are dwarfed by the overhead of sorting the container (O(n log n)).
Single Search
Therefore, the best approach depends on how often you search your container. If you are only going to check it a few times, you should use a linear search, as follows:
fn get_name_index(name: &String, array: &Vec<String>) -> Option<usize> {
array.iter().position(|&&x| x == name)
}
Repeated Searches
If you are going to repeatedly call get_name_index, you should use a binary search (or possibly even better, below):
// Sort the array before using
array.sort_unstable();
// Repeatedly call this function
fn get_name_index(name: &String, array: &Vec<String>) -> Option<usize> {
match array.binary_search(name) {
Ok(index) => Some(index),
Err(_) => None,
}
}
However, this may be suboptimal for some cases. A few considerations: a HashSet may be faster for certain sets of data (O(1) complexity at its best). However, this is slightly misleading, since all the characters of the name must be processed on each compare for a HashSet, while generally only a few characters must be compared to determine whether to jump left or right for a binary search. For data that is highly uniform and mostly differs with a few characters at the end, a HashSet might be better, otherwise, I'd generally recommend using binary_search on the vector.

As mcarton said, the vector needs to be sorted before you can do a binary search. Here's an example:
let mut v = vec![String::from("_res"), String::from("b"), String::from("a")];
println!("{:?}", &v);
v.sort_unstable();
println!("{:?}", &v);
I tried this with your code and it found "a" in the second position. Without the call to sort_unstable() it failed to find "a".

Related

Is there any difference between these two ways to get lowercase `Vec<char>`

Is there any difference between these two ways to get lowercase Vec. This version iterates over chars, converts them and collects the results:
fn lower(s: &str) -> Vec<char> {
s.chars().flat_map(|c| c.to_lowercase()).collect()
}
and this version first converts to a String and then collects the chars of that:
fn lower_via_string(s: &str) -> Vec<char> {
s.to_lowercase().chars().collect()
}
A short look at the code for str::to_lowercase immediately revealed a counterexample: It appears that Σ at the end of words receives special treatment from str::to_lowercase, which chars()-then-char::to_lowercase can't give, so the results differ on "xΣ ".
Playground
Before looking at the code of std::to_lowercase, I thought: Well, it should be really easy to find a counterexample with a fuzzer. I messed up the setup at first and it didn't find anything, but now was able to get it right, so I'll add it for completeness:
cargo new theyrenotequal
cd theyrenotequal
cargo fuzz init
cat >fuzz/fuzz_targets/fuzz_target_1.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
fuzz_target!(|data: &str| {
if data.to_lowercase().chars().collect::<Vec<_>>()
!= data
.chars()
.flat_map(|c| c.to_lowercase())
.collect::<Vec<_>>()
{
panic!("Fuxxed: {}", data)
}
});
cargo fuzz run fuzz_target_1 -Zbuild-std
That spat out "AΣ#ӮѮ" after 8 million iterations.
Completing the answer of #Caesar, in case the behavioral difference doesn't matter, there is still a performance difference.
String::to_lowercase() allcates a new String and fills it with the characters. char::to_lowercase() only does that on-the-fly. So the former is expected to be much slower. I don't think there can't be a version of String::to_lowercase() that returns an iterator and avoids the penalty of the allocation, just that it hasn't done yet.

Most efficient way to keep collection of string references

What is the most efficient way to keep a collection of references to strings in Rust?
Specifically, I have the following as the beginning of some code to parse command line arguments (option parsing to be added):
let args: Vec<String> = env::args().collect();
let mut files: Vec<&String> = Vec::new();
let mut i = 1;
while i < args.len() {
let arg = &args[i];
i += 1;
if arg.as_bytes()[0] != b'-' {
files.push(arg);
continue;
}
}
args is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory, and it would only be necessary to make a vector of references to the existing strings. But the compiler seems to concur that it needs to be Vec<String>.
It would seem inefficient to do the same for files; there is surely no need for further copying. Instead, I have declared it as Vec<&String>, which as I understand it, means only creating a vector of references to the existing strings, which is optimal. (Not that it makes a measurable performance difference for command line arguments, but I want to figure this out now, so I can get it right later when dealing with much larger data.)
Where I am slightly confused is that Rust seems to frequently recommend str over String, and indeed the compiler is happy to have files hold either str or &str.
My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.
Is the above correct, or am I missing something?
args is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory
The command-line arguments do exist in memory but
they are not String, they are not even guaranteed to be UTF8
they are not in a Vec layout
Fundamentally there isn't even any prescription as to their storage, all you know is they're C strings (nul-terminated) and you get an array of pointers to those, whose last element is a null pointer.
Which is why args is an iterator of String: it will lazily decode and validate each argument as you request it, in fact you can check its source code:
pub fn args() -> Args {
Args { inner: args_os() }
}
#[stable(feature = "env", since = "1.0.0")]
impl Iterator for Args {
type Item = String;
fn next(&mut self) -> Option<String> {
self.inner.next().map(|s| s.into_string().unwrap())
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
}
}
Now I couldn't tell you why args_os yields OsString rather than OsStr, I would assume portability of some sort (e.g. some platforms might not guarantee the args data lives for the entirety of the program).
My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.
Is the above correct, or am I missing something?
&String exists only for regularity (in the sense that it's a natural outgrowth of shared references and String existing concurrently), it's not actually useful: an &String only lets you access readonly / immutable methods of String, all of which are really provided by str aside from capacity() (which is rarely useful) and a handful of methods duplicated from str to String (I assume for efficiency) like len or is_empty.
&str is also generally more efficient than &String: while its size is 2 words (pointer, length) rather than one (pointer), it points directly to the relevant data rather than pointing to a pointer to the relevant data (and requiring a dereference to access the length property). As such, &String is rarely considered useful and clippy will warn against it by default (also &Vec as &[] is usually better for the same reason).

Why does `format_args!()` ignore truncation? How to truncate without allocation then?

Why does this work?
fn main() {
println!("{:.3}", "this is just a test");
}
prints => thi
While this doesn't?
fn main() {
println!("{:.3}", format_args!("this is just a test"));
}
prints => this is just a test
Here's a playground.
For a little more context, I’m interested in the reasoning behind it, and a way to do it without any allocations.
I'm developing a terminal game in Rust, where I have a write! which shows some statistics about the rendering and game loop, and that text can be quite long. Now that I read the terminal size and adjust its output accordingly, I need to truncate that output, but without any allocations.
I thought I was super clever when I refactored this:
write!(
stdout,
"{} ({} {} {}) {}",
...
)
into this:
write!(
stdout,
"{:.10}", // simulate only 10 cols in terminal.
format_args!(
"{} ({} {} {}) {}",
...
)
)
How unfortunate, it doesn’t work… How to do that without allocating a String?
For one thing, not every type obeys all formatting arguments:
println!("{:.3}", 1024);
1024
Second, format_args! serves as the backbone for all of the std::fmt utilities. From the docs on format_args:
This macro functions by taking a formatting string literal containing {} for each additional argument passed. format_args! prepares the additional parameters to ensure the output can be interpreted as a string and canonicalizes the arguments into a single type. Any value that implements the Display trait can be passed to format_args!, as can any Debug implementation be passed to a {:?} within the formatting string.
This macro produces a value of type fmt::Arguments. This value can be passed to the macros within std::fmt for performing useful redirection. All other formatting macros (format!, write!, println!, etc) are proxied through this one. format_args!, unlike its derived macros, avoids heap allocations.
You can use the fmt::Arguments value that format_args! returns in Debug and Display contexts as seen below. The example also shows that Debug and Display format to the same thing: the interpolated format string in format_args!.
let debug = format!("{:?}", format_args!("{} foo {:?}", 1, 2));
let display = format!("{}", format_args!("{} foo {:?}", 1, 2));
assert_eq!("1 foo 2", display);
assert_eq!(display, debug);
Looking at the source for impl Display for Arguments, it just ignores any formatting parameters. I couldn't find this explicitly documented anywhere, but I can think of a couple reasons for this:
The arguments are already considered formatted. If you really want to format a formatted string, use format! instead.
Since its used internally for multiple purposes, its probably better to keep this part simple; its already doing the format heavy-lifting. Attempting to make the thing responsible for formatting arguments itself accept formatting parameters sounds needlessly complicated.
I'd really like to truncate some output without allocating any Strings, would you know how to do it?
You can write to a fixed-size buffer:
use std::io::{Write, ErrorKind, Result};
use std::fmt::Arguments;
fn print_limited(args: Arguments<'_>) -> Result<()> {
const BUF_SIZE: usize = 3;
let mut buf = [0u8; BUF_SIZE];
let mut buf_writer = &mut buf[..];
let written = match buf_writer.write_fmt(args) {
// successfully wrote into the buffer, determine amount written
Ok(_) => BUF_SIZE - buf_writer.len(),
// a "failed to write whole buffer" error occurred meaning there was
// more to write than there was space for, return entire size.
Err(error) if error.kind() == ErrorKind::WriteZero => BUF_SIZE,
// something else went wrong
Err(error) => return Err(error),
};
// Pick a way to print `&buf[..written]`
println!("{}", std::str::from_utf8(&buf[..written]).unwrap());
Ok(())
}
fn main() {
print_limited(format_args!("this is just a test")).unwrap();
print_limited(format_args!("{}", 123)).unwrap();
print_limited(format_args!("{}", 'a')).unwrap();
}
thi
123
a
This was actually more involved than I originally thought. There might be a cleaner way to do this.
I found this word here
For non-numeric types, this can be considered a "maximum width". If the resulting string is longer than this width, then it is truncated down to this many characters and that truncated value is emitted with proper fill, alignment and width if those parameters are set.
For integral types, this is ignored.
For floating-point types, this indicates how many digits after the decimal point should be printed.
And format_args return type is std::fmt::Arguments,that is not String ,even though it looks like a string.
If you want to get same print contents,i think those code will work
/// unstable
println!("{:.3}", format_args!("this is just a test").as_str().unwrap());
println!("{:.3}", format_args!("this is just a test").to_string().as_str());

Rust - how to find n-th most frequent element in a collection

I can't imagine this hasn't been asked before, but I have searched everywhere and could not find the answer.
I have an iterable, which contains duplicate elements. I want to count number of times each element occurs in this iterable and return n-th most frequent one.
I have a working code which does exactly that, but I really doubt its the most optimal way to achieve this.
use std::collections::{BinaryHeap, HashMap};
// returns n-th most frequent element in collection
pub fn most_frequent<T: std::hash::Hash + std::cmp::Eq + std::cmp::Ord>(array: &[T], n: u32) -> &T {
// intialize empty hashmap
let mut map = HashMap::new();
// count occurence of each element in iterable and save as (value,count) in hashmap
for value in array {
// taken from https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.entry
// not exactly sure how this works
let counter = map.entry(value).or_insert(0);
*counter += 1;
}
// determine highest frequency of some element in the collection
let mut heap: BinaryHeap<_> = map.values().collect();
let mut max = heap.pop().unwrap();
// get n-th largest value
for _i in 1..n {
max = heap.pop().unwrap();
}
// find that element (get key from value in hashmap)
// taken from https://stackoverflow.com/questions/59401720/how-do-i-find-the-key-for-a-value-in-a-hashmap
map.iter()
.find_map(|(key, &val)| if val == *max { Some(key) } else { None })
.unwrap()
}
Are there any better ways or more optimal std methods to achieve what I want? Or maybe there are some community made crates that I could use.
Your implementation has a time complexity of Ω(n log n), where n is the length of the array. The optimal solution to this problem has a complexity of Ω(n log k) for retrieving the k-th most frequent element. The usual implementation of this optimal solution indeed involves a binary heap, but not in the way you used it.
Here's a suggested implementation of the common algorithm:
use std::cmp::{Eq, Ord, Reverse};
use std::collections::{BinaryHeap, HashMap};
use std::hash::Hash;
pub fn most_frequent<T>(array: &[T], k: usize) -> Vec<(usize, &T)>
where
T: Hash + Eq + Ord,
{
let mut map = HashMap::new();
for x in array {
*map.entry(x).or_default() += 1;
}
let mut heap = BinaryHeap::with_capacity(k + 1);
for (x, count) in map.into_iter() {
heap.push(Reverse((count, x)));
if heap.len() > k {
heap.pop();
}
}
heap.into_sorted_vec().into_iter().map(|r| r.0).collect()
}
(Playground)
I changed the prototype of the function to return a vector of the k most frequent elements together with their counts, since this is what you need to keep track of anyway. If you only want the k-th most frequent element, you can index the result with [k - 1][1].
The algorithm itself first builds a map of element counts the same way your code does – I just wrote it in a more concise form.
Next, we buid a BinaryHeap for the most frequent elements. After each iteration, this heap contains at most k elements, which are the most frequent ones seen so far. If there are more than k elements in the heap, we drop the least frequent element. Since we always drop the least frequent element seen so far, the heap always retains the k most frequent elements seen so far. We need to use the Reverse wrapper to get a min heap, as documented in the documentation of BinaryHeap.
Finally, we collect the results into a vector. The into_sorted_vec() function basically does this job for us, but we still want to unwrap the items from its Reverse wrapper – that wrapper is an implemenetation detail of our function and should not be returned to the caller.
(In Rust Nightly, we could also use the into_iter_sorted() method, saving one vector allocation.)
The code in this answer makes sure the heap is essentially limited to k elements, so an insertion to the heap has a complexity of Ω(log k). In your code, you push all elements from the array to the heap at once, without limiting the size of the heap, so you end up with a complexity of Ω(log n) for insertions. You essentially use the binary heap to sort a list of counts. Which works, but it's certainly neither the easiest nor the fastest way to achieve that, so there is little justification for going that route.

Getting query string from Window object in WebAssembly in Rust

Context: I am learning Rust & WebAssembly and as a practice exercise I have a project that paints stuff in HTML Canvas from Rust code. I want to get the query string from the web request and from there the code can decide which drawing function to call.
I wrote this function to just return the query string with the leading ? removed:
fn decode_request(window: web_sys::Window) -> std::string::String {
let document = window.document().expect("no global window exist");
let location = document.location().expect("no location exists");
let raw_search = location.search().expect("no search exists");
let search_str = raw_search.trim_start_matches("?");
format!("{}", search_str)
}
It does work, but it seems amazingly verbose given how much simpler it would be in some of the other languages I have used.
Is there an easier way to do this? Or is the verbosity just the price you pay for safety in Rust and I should just get used to it?
Edit per answer from #IInspectable:
I tried the chaining approach and I get an error of:
temporary value dropped while borrowed
creates a temporary which is freed while still in use
note: consider using a `let` binding to create a longer lived value rustc(E0716)
It would be nice to understand that better; I am still getting the niceties of ownership through my head. Is now:
fn decode_request(window: Window) -> std::string::String {
let location = window.location();
let search_str = location.search().expect("no search exists");
let search_str = search_str.trim_start_matches('?');
search_str.to_owned()
}
which is certainly an improvement.
This question is really about API design rather than its effects on the implementation. The implementation turned out to be fairly verbose mostly due to the contract chosen: Either produce a value, or die. There's nothing inherently wrong with this contract. A client calling into this function will never observe invalid data, so this is perfectly safe.
This may not be the best option for library code, though. Library code usually lacks context, and cannot make a good call on whether any given error condition is fatal or not. That's a question client code is in a far better position to answer.
Before moving on to explore alternatives, let's rewrite the original code in a more compact fashion, by chaining the calls together, without explicitly assigning each result to a variable:
fn decode_request(window: web_sys::Window) -> std::string::String {
window
.location()
.search().expect("no search exists")
.trim_start_matches('?')
.to_owned()
}
I'm not familiar with the web_sys crate, so there is a bit of guesswork involved. Namely, the assumption, that window.location() returns the same value as the document()'s location(). Apart from chaining calls, the code presented employs two more changes:
trim_start_matches() is passed a character literal in place of a string literal. This produces optimal code without relying on the compiler's optimizer to figure out, that a string of length 1 is attempting to search for a single character.
The return value is constructed by calling to_owned(). The format! macro adds overhead, and eventually calls to_string(). While that would exhibit the same behavior in this case, using the semantically more accurate to_owned() function helps you catch errors at compile time (e.g. if you accidentally returned 42.to_string()).
Alternatives
A more natural way to implement this function is to have it return either a value representing the query string, or no value at all. Rust provides the Option type to conveniently model this:
fn decode_request(window: web_sys::Window) -> Option<String> {
match window
.location()
.search() {
Ok(s) => Some(s.trim_start_matches('?').to_owned()),
_ => None,
}
}
This allows a client of the function to make decisions, depending on whether the function returns Some(s) or None. This maps all error conditions into a None value.
If it is desirable to convey the reason for failure back to the caller, the decode_request function can choose to return a Result value instead, e.g. Result<String, wasm_bindgen::JsValue>. In doing so, an implementation can take advantage of the ? operator, to propagate errors to the caller in a compact way:
fn decode_request(window: web_sys::Window) -> Result<String, wasm_bindgen::JsValue> {
Ok(window
.location()
.search()?
.trim_start_matches('?')
.to_owned())
}

Resources