Creating an owned iterator with lines - rust

I am learning Rust by following the Rust Book and I am currently trying to modify the project in Chapter 12, but I can't understand why my code is not working.
The function in question is the search function
fn search(query: &str, contents: String) -> Vec<String> {
contents.lines().filter(|line| line.contains(query)).collect()
}
which is supposed to get contents of a file as a string and return a collection of the lines in the file containing query. In this form, it throws the error "a value of type std::vec::Vec<std::string::String> cannot be built from an iterator over elements of type &str".
I think that the error comes from the use of lines since it doesn't take ownership of contents. My question is if there is a better way to do this or if there is a similar method to lines that does take ownership.

As a Rust learner it is important to know the differences between strings (String) and string slices (&str), and how those two types interact.
The lines() method of an &str returns the lines as iterator over &str, which is the same type.
However the lines method of a string also returns an iterator over &str, which is the same type as before but in this case not the same type as the input.
This means, your output will be of type Vec<&str>.
However in that case you need a lifetime because otherwise you can't return a reference. In this case your example would look like this:
fn search<'a>(query: &str, contents: &'a str) -> Vec<&'a str> {
contents.lines().filter(|line| line.contains(query)).collect()
}
fn main() {
println!("found: {:?}",search("foo", "the foot\nof the\nfool"));
}
However if you want the vector to contain strings, you can use the to_owned() function to convert a &str into a String:
fn search(query: &str, contents: &str) -> Vec<String> {
contents.lines().map(|line| line.to_owned()).filter(|line| line.contains(query)).collect()
}
fn main() {
println!("{:?}",search("foo", "the foot\nof the\nfool"));
}
However this is inefficient because some strings are created that aren't used so it is better to map last:
fn search(query: &str, contents: &str) -> Vec<String> {
contents.lines().filter(|line| line.contains(query)).map(|line| line.to_owned()).collect()
}
fn main() {
println!("{:?}",search("foo", "the foot\nof the\nfool"));
}
Or with contents of type String, but I think this doesn't make much sense:
fn search(query: &str, contents: String) -> Vec<String> {
contents.lines().map(|line| line.to_owned()).filter(|line| line.contains(query)).collect()
}
fn main() {
println!("{:?}",search("foo", "the foot\nof the\nfool".to_owned()));
}
Explanation: Passing contents as a String isn't very useful because the search function will own it, but it is not mutable, so you can't change it to the search result, and also your search result is a vector, and you can't transform a single owned String into multiple owned ones.
P.S.: I'm also relatively new to Rust, so feel free to comment or edit my post if I missed something.

Related

Rust lifetime scoping in structs

So, I'm working on porting a string tokenizer that I wrote in Python over to Rust, and I've run into an issue I can't seem to get past with lifetimes and structs.
So, the process is basically:
Get an array of files
Convert each file to a Vec<String> of tokens
User a Counter and Unicase to get counts of individual instances of tokens from each vec
Save that count in a struct, along with some other data
(Future) do some processing on the set of Structs to accumulate the total data along side the per-file data
struct Corpus<'a> {
words: Counter<UniCase<&'a String>>,
parts: Vec<CorpusPart<'a>>
}
pub struct CorpusPart<'a> {
percent_of_total: f32,
word_count: usize,
words: Counter<UniCase<&'a String>>
}
fn process_file(entry: &DirEntry) -> CorpusPart {
let mut contents = read_to_string(entry.path())
.expect("Could not load contents.");
let tokens = tokenize(&mut contents);
let counted_words = collect(&tokens);
CorpusPart {
percent_of_total: 0.0,
word_count: tokens.len(),
words: counted_words
}
}
pub fn tokenize(normalized: &mut String) -> Vec<String> {
// snip ...
}
pub fn collect(results: &Vec<String>) -> Counter<UniCase<&'_ String>> {
results.iter()
.map(|w| UniCase::new(w))
.collect::<Counter<_>>()
}
However, when I try to return CorpusPart it complains that it is trying to reference a local variable tokens. How can/should I deal with this? I tried adding lifetime annotations, but couldn't figure it out...
Essentially, I no longer need the Vec<String>, but I do need some of the Strings that were in it for the counter.
Any help is appreciated, thank you!
The issue here is that you are throwing away Vec<String>, but still referencing the elements inside it. If you no longer need Vec<String>, but still require some of the contents inside, you have to transfer the ownership to something else.
I assume you want Corpus and CorpusPart to both point to the same Strings, so you are not duplicating Strings needlessly. If that is the case, either Corpus or CorpusPart must own the String, so that the one that don't own the String references the Strings owned by the other. (Sounds more complicated that it actually is)
I will assume CorpusPart owns the String, and Corpus just points to those strings
use std::fs::DirEntry;
use std::fs::read_to_string;
pub struct UniCase<a> {
test: a
}
impl<a> UniCase<a> {
fn new(item: a) -> UniCase<a> {
UniCase {
test: item
}
}
}
type Counter<a> = Vec<a>;
struct Corpus<'a> {
words: Counter<UniCase<&'a String>>, // Will reference the strings in CorpusPart (I assume you implemented this elsewhere)
parts: Vec<CorpusPart>
}
pub struct CorpusPart {
percent_of_total: f32,
word_count: usize,
words: Counter<UniCase<String>> // Has ownership of the strings
}
fn process_file(entry: &DirEntry) -> CorpusPart {
let mut contents = read_to_string(entry.path())
.expect("Could not load contents.");
let tokens = tokenize(&mut contents);
let length = tokens.len(); // Cache the length, as tokens will no longer be valid once passed to collect
let counted_words = collect(tokens);
CorpusPart {
percent_of_total: 0.0,
word_count: length,
words: counted_words
}
}
pub fn tokenize(normalized: &mut String) -> Vec<String> {
Vec::new()
}
pub fn collect(results: Vec<String>) -> Counter<UniCase<String>> {
results.into_iter() // Use into_iter() to consume the Vec that is passed in, and take ownership of the internal items
.map(|w| UniCase::new(w))
.collect::<Counter<_>>()
}
I aliased Counter<a> to Vec<a>, as I don't know what Counter you are using.
Playground

Simple implementation to get iterator of first word of each line of input

I needed an iterator that streams the first alphabetic word of each line of an implementation of Read. This iterator:
Returns an error if reading the input failed
Returns an iterator of strings, each representing an alphabetic word
ignores empty strings or first words containing characters other than [a-zA-Z]
I eventually ended up with the following implementation (test here):
fn get_first_words<'a>(r: &'a mut impl Read) -> impl Iterator<Item = Result<String>> + 'a {
BufReader::new(r).lines().filter_map(|rline| {
match rline.map(|line| {
line.split_whitespace()
.next()
.filter(|word| word.chars().all(char::is_alphabetic))
.map(&str::to_string)
}) {
Err(e) => Some(Err(e)),
Ok(Some(w)) => Some(Ok(w)),
Ok(None) => None,
}
})
}
This works fine but was more complex than I had expected. There are nested iterators in this implementation, and there was some type juggling in order to keep Result as the wrapping type while filtering on the contained values.
Could this have been written more simply, with less nested logic and with less type juggling?
You can replace your match expression with Result::transpose(). I would also suggest to split out the function returning the first word to make the code more readable. Finally, you don't need to accept &'a mut impl Read – simply accepting impl Read instead will work as well, since there is a forwarding implementation that implements Read for &mut impl Read. Together, the simplified code could look like this:
fn first_word(s: String) -> Option<String> {
s.split_whitespace()
.next()
.filter(|word| word.chars().all(char::is_alphabetic))
.map(From::from)
}
fn get_first_words(r: impl Read) -> impl Iterator<Item = Result<String>> {
BufReader::new(r)
.lines()
.filter_map(|line| line.map(first_word).transpose())
}
Edit: Using impl Read instead of &mut impl Read will result in mutable references being moved into the function rather than being implicitly reborrowed, so maybe it's not a good idea after all, since it will be confusing to remember to explicitly reborrow them where necessary.

Proper way to return a new string in Rust

I just spent a week reading the Rust Book, and now I'm working on my first program, which returns the filepath to the system wallpaper:
pub fn get_wallpaper() -> &str {
let output = Command::new("gsettings");
// irrelevant code
if let Ok(message) = String::from_utf8(output.stdout) {
return message;
} else {
return "";
}
}
I'm getting the error expected lifetime parameter on &str and I know Rust wants an input &str which will be returned as the output because any &str I create inside the function will be cleaned up immediately after the function ends.
I know I can sidestep the issue by returning a String instead of a &str, and many answers to similar questions have said as much. But I can also seemingly do this:
fn main() {
println!("message: {}", hello_string(""));
}
fn hello_string(x: &str) -> &str {
return "hello world";
}
to get a &str out of my function. Can someone explain to me why this is bad and why I should never do it? Or maybe it's not bad and okay in certain situations?
You cannot return a &str if you've allocated the String in the function. There's further discussion about why, as well as the fact that it's not limited to strings. That makes your choice much easier: return the String.
Strings are heap-allocated and built to be mutable.
Strings are heap-allocated because they have an unknown length. Since that allocation is solely owned by the String, that's what grants the ability to mutate the string.
My function just returns a filepath for reference purposes, and I'd rather leave it up to the caller to decide if they need a heap-stored mutable string.
This isn't possible. Your function has performed an allocation. If you don't return the allocation to the caller, then the value must be deallocated to prevent memory leaks. If it was returned after deallocation, that would be an invalid reference, leading to memory safety violations.
But I can also seemingly do this:
fn hello_string(x: &str) -> &str {
return "hello world";
}
to get a &str out of my function. Can someone explain to me why this
is bad and why I should never do it? Or maybe it's not bad and okay in
certain situations?
It's not bad, it just doesn't really allow you to do what you want in your original case. That "hello world" is a &'static str, a string slice that's been stored inside the code of the program itself. It has a fixed length and is known to live longer than main.
The signature fn hello_string(x: &str) -> &str can be expanded to fn hello_string<'a>(x: &'a str) -> &'a str. This indicates that the resulting string slice must have the same lifetime as the input string. A static string will outlive any lifetime, so that's valid to substitute.
This would be useful for a function where the result is based on the input string only:
fn long_string(x: &str) -> &str {
if x.len() > 10 {
"too long"
} else {
x
}
}
However, in your case, the function owns the String. If you attempted to return a reference to a String, completely unrelated to the input string:
fn hello_string(x: &str) -> &str {
&String::from("hello world")
}
You'll run into the common error message "borrowed value does not live long enough". That's because the borrowed value only lives until the end of method, not as long as the input string slice. You can't "trick" the compiler (or if you can, that's a major bug).
If you want to return a &str in Rust you have to add generic lifetime. Example:
fn hello_string<'life>() -> &'life str {
return "hello world";
}
or,
fn hello_string<'life>(a: &'life str, b: &'life str) -> &'life str {
return "hello world";
}
Here have a 3 rules.
Each parameter that is a reference gets it's own lifetime parameter.
If there is exactly one input lifetime parameter, that lifetime is assigned to all output lifetime parameters.
If there are multiple input lifetime parameters, but one of them is &self or &mut self the lifetime of self is assigned to all output lifetime parameters.

Is there a way to avoid cloning when converting a PathBuf to a String?

I need to simply (and dangerously - error handling omitted for brevity) get the current executable name. I made it work, but my function converts a &str to String only to call as_str() on it later for pattern matching.
fn binary_name() -> String {
std::env::current_exe().unwrap().file_name().unwrap().to_str().unwrap().to_string()
}
As I understand it, std::env::current_exe() gives me ownership of the PathBuf which I could transfer by returning it. As it stands, I borrow it to convert it to &str. From there, the only way to return the string is to clone it before the PathBuf is dropped.
Is there any way to avoid this &OsStr -> &str -> String -> &str cycle?
Is there a way to avoid cloning when converting a PathBuf to a String?
Absolutely. However, that's not what you are doing. You are taking a part of the PathBuf via file_name and converting that. You cannot take ownership of a part of a string.
If you weren't taking a subset, then converting an entire PathBuf can be done by converting to an OsString and then to a String. Here, I ignore the specific errors and just return success or failure:
use std::path::PathBuf;
fn exe_name() -> Option<String> {
std::env::current_exe()
.ok()
.map(PathBuf::into_os_string)
.and_then(|exe| exe.into_string().ok())
}
Is there any way to avoid this &OsStr -> &str -> String -> &str cycle?
No, because you are creating the String (or OsString or PathBuf, whichever holds ownership depending on the variant of code) inside your method. Check out Return local String as a slice (&str) for why you cannot return a reference to a stack-allocated item (including a string).
As stated in that Q&A, if you want to have references, the thing owning the data has to outlive the references:
use std::env;
use std::path::Path;
use std::ffi::OsStr;
fn binary_name(path: &Path) -> Option<&str> {
path.file_name().and_then(OsStr::to_str)
}
fn main() {
let exe = env::current_exe().ok();
match exe.as_ref().and_then(|e| binary_name(e)) {
Some("cat") => println!("Called as cat"),
Some("dog") => println!("Called as dog"),
Some(other) => println!("Why did you call me {}?", other),
None => println!("Not able to figure out what I was called as"),
}
}
Your original code can be written to not crash on errors easily enough
fn binary_name() -> Option<String> {
let exe = std::env::current_exe();
exe.ok()
.as_ref()
.and_then(|p| p.file_name())
.and_then(|s| s.to_str())
.map(String::from)
}

Why is it discouraged to accept a reference &String, &Vec, or &Box as a function argument?

I wrote some Rust code that takes a &String as an argument:
fn awesome_greeting(name: &String) {
println!("Wow, you are awesome, {}!", name);
}
I've also written code that takes in a reference to a Vec or Box:
fn total_price(prices: &Vec<i32>) -> i32 {
prices.iter().sum()
}
fn is_even(value: &Box<i32>) -> bool {
**value % 2 == 0
}
However, I received some feedback that doing it like this isn't a good idea. Why not?
TL;DR: One can instead use &str, &[T] or &T to allow for more generic code.
One of the main reasons to use a String or a Vec is because they allow increasing or decreasing the capacity. However, when you accept an immutable reference, you cannot use any of those interesting methods on the Vec or String.
Accepting a &String, &Vec or &Box also requires the argument to be allocated on the heap before you can call the function. Accepting a &str allows a string literal (saved in the program data) and accepting a &[T] or &T allows a stack-allocated array or variable. Unnecessary allocation is a performance loss. This is usually exposed right away when you try to call these methods in a test or a main method:
awesome_greeting(&String::from("Anna"));
total_price(&vec![42, 13, 1337])
is_even(&Box::new(42))
Another performance consideration is that &String, &Vec and &Box introduce an unnecessary layer of indirection as you have to dereference the &String to get a String and then perform a second dereference to end up at &str.
Instead, you should accept a string slice (&str), a slice (&[T]), or just a reference (&T). A &String, &Vec<T> or &Box<T> will be automatically coerced (via deref coercion) to a &str, &[T] or &T, respectively.
fn awesome_greeting(name: &str) {
println!("Wow, you are awesome, {}!", name);
}
fn total_price(prices: &[i32]) -> i32 {
prices.iter().sum()
}
fn is_even(value: &i32) -> bool {
*value % 2 == 0
}
Now you can call these methods with a broader set of types. For example, awesome_greeting can be called with a string literal ("Anna") or an allocated String. total_price can be called with a reference to an array (&[1, 2, 3]) or an allocated Vec.
If you'd like to add or remove items from the String or Vec<T>, you can take a mutable reference (&mut String or &mut Vec<T>):
fn add_greeting_target(greeting: &mut String) {
greeting.push_str("world!");
}
fn add_candy_prices(prices: &mut Vec<i32>) {
prices.push(5);
prices.push(25);
}
Specifically for slices, you can also accept a &mut [T] or &mut str. This allows you to mutate a specific value inside the slice, but you cannot change the number of items inside the slice (which means it's very restricted for strings):
fn reset_first_price(prices: &mut [i32]) {
prices[0] = 0;
}
fn lowercase_first_ascii_character(s: &mut str) {
if let Some(f) = s.get_mut(0..1) {
f.make_ascii_lowercase();
}
}
In addition to Shepmaster's answer, another reason to accept a &str (and similarly &[T] etc) is because of all of the other types besides String and &str that also satisfy Deref<Target = str>. One of the most notable examples is Cow<str>, which lets you be very flexible about whether you are dealing with owned or borrowed data.
If you have:
fn awesome_greeting(name: &String) {
println!("Wow, you are awesome, {}!", name);
}
But you need to call it with a Cow<str>, you'll have to do this:
let c: Cow<str> = Cow::from("hello");
// Allocate an owned String from a str reference and then makes a reference to it anyway!
awesome_greeting(&c.to_string());
When you change the argument type to &str, you can use Cow seamlessly, without any unnecessary allocation, just like with String:
let c: Cow<str> = Cow::from("hello");
// Just pass the same reference along
awesome_greeting(&c);
let c: Cow<str> = Cow::from(String::from("hello"));
// Pass a reference to the owned string that you already have
awesome_greeting(&c);
Accepting &str makes calling your function more uniform and convenient, and the "easiest" way is now also the most efficient. These examples will also work with Cow<[T]> etc.
The recommendation is using &str over &String because &str also satisfies &String which could be used for both owned strings and the string slices but not the other way around:
use std::borrow::Cow;
fn greeting_one(name: &String) {
println!("Wow, you are awesome, {}!", name);
}
fn greeting_two(name: &str) {
println!("Wow, you are awesome, {}!", name);
}
fn main() {
let s1 = "John Doe".to_string();
let s2 = "Jenny Doe";
let s3 = Cow::Borrowed("Sally Doe");
let s4 = Cow::Owned("Sally Doe".to_string());
greeting_one(&s1);
// greeting_one(&s2); // Does not compile
// greeting_one(&s3); // Does not compile
greeting_one(&s4);
greeting_two(&s1);
greeting_two(s2);
greeting_two(&s3);
greeting_two(&s4);
}
Using vectors to manipulate text is never a good idea and does not even deserve discussion because you will loose all the sanity checks and performance optimizations. String type uses vector internally anyway. Remember, Rust uses UTF-8 for strings for storage efficiency. If you use vector, you have to repeat all the hard work. Other than that, borrowing vectors or boxed values should be OK.
Because those types can be coerced, so if we use those types functions will accept less types:
1- a reference to String can be coerced to a str slice. For example create a function:
fn count_wovels(words:&String)->usize{
let wovels_count=words.chars().into_iter().filter(|x|(*x=='a') | (*x=='e')| (*x=='i')| (*x=='o')|(*x=='u')).count();
wovels_count
}
if you pass &str, it will not be accepted:
let name="yilmaz".to_string();
println!("{}",count_wovels(&name));
// this is not allowed because argument should be &String but we are passing str
// println!("{}",wovels("yilmaz"))
But if that function accepts &str instead
// words:&str
fn count_wovels(words:&str)->usize{ ... }
we can pass both types to the function
let name="yilmaz".to_string();
println!("{}",count_wovels(&name));
println!("{}",wovels("yilmaz"))
With this, our function can accept more types
2- Similary, a reference to Box &Box[T], will be coerced to the reference to the value inside the Box Box[&T]. for example
fn length(name:&Box<&str>){
println!("lenght {}",name.len())
}
this accepts only &Box<&str> type
let boxed_str=Box::new("Hello");
length(&boxed_str);
// expected reference `&Box<&str>` found reference `&'static str`
// length("hello")
If we pass &str as type, we can pass both types
3- Similar relation exists between ref to a Vec and ref to an array
fn square(nums:&Vec<i32>){
for num in nums{
println!("square of {} is {}",num,num*num)
}
}
fn main(){
let nums=vec![1,2,3,4,5];
let nums_array=[1,2,3,4,5];
// only &Vec<i32> is accepted
square(&nums);
// mismatched types: mismatched types expected reference `&Vec<i32>` found reference `&[{integer}; 5]`
//square(&nums_array)
}
this will work for both types
fn square(nums:&[i32]){..}

Resources