Creating word iterator from line iterator - rust

I have a string iterator lines that I get from stdin with
use std::io::{self, BufRead};
let mut stdin = io::stdin();
let lines = stdin.lock().lines().map(|l| l.unwrap());
The lines iterator yields values of type String, not &str. I want to create an iterator that iterates over the input words instead of lines. It seems like this should be doable but my naive attempt does not work:
let words = lines.flat_map(|l| l.split_whitespace());
The compiler tells me that l is being dropped while still borrowed, which makes sense:
error[E0597]: `l` does not live long enough
--> src/lib.rs:6:36
|
6 | let words = lines.flat_map(|l| l.split_whitespace());
| ^ - `l` dropped here while still borrowed
| |
| borrowed value does not live long enough
7 | }
| - borrowed value needs to live until here
Is there some other clean way that accomplishes this?

In your example code, lines is an iterator over the lines read in from the reader you have obtained from stdin. As you say, it returns String instances, but you are not storing them anywhere.
std::string::String::split_whitespace is defined like this:
pub fn split_whitespace(&self) -> SplitWhitespace
So, it takes a reference to a string - it does not consume the string. It returns an iterator that yields string slices &str - which reference portions of the string, but don't own it.
In fact as soon as the closure you have passed to flat_map is done with it, no-one owns it, so it is dropped. That would leave the &str yielded by words dangling, thus the error.
One solution is to collect the lines into a vector, like this:
let lines: Vec<String> = stdin.lock().lines().map(|l| l.unwrap()).collect();
let words = lines.iter().flat_map(|l| l.split_whitespace());
The String instances are kept in the Vec<String>, which can live on so that the &str yielded by words have something to refer to.
If there were a lot of lines, and you did not want to keep them all in memory, you might prefer to do it a line at a time:
let lines = stdin.lock().lines().map(|l| l.unwrap());
let words = lines.flat_map(|l| {
l.split_whitespace()
.map(|s| s.to_owned())
.collect::<Vec<String>>()
.into_iter()
});
Here the words of each line are collected into a Vec, a line at a time. The trade-off is less overall memory consumption, against the overhead of constructing a Vec<String> for each line, and copy each word into it.
You might have been hoping for a zero-copy implementation, which consumed the Strings that lines produces. I think that would be possible to create, by creating a split_whitespace() function that takes ownership of the String and returns an iterator that owns the string.

Related

Get elements from Vector of tab delimited Strings

I have a vector of Strings as in the example below, and for every element in that vector, I want to get the second and third items. I don't know if I should be collecting a &str or String, but I haven't gotten to that part because this does not compile.
Everything is "fine" until I add the slicing [1..]
let elements: Vec<&str> = vec!["foo\tbar\tbaz", "ffoo\tbbar\tbbaz"]
.iter()
.map(|rec| rec.rsplit('\t').collect::<Vec<_>>()[1..])
.collect();
It complains because
the size for values of type `[&str]` cannot be known at compilation time
the trait `std::marker::Sized` is not implemented for `[&str]`rustcE0277
As the compiler tells you, the slicing is broken because in Rust a slice returns, well, the slice. Whose size is unknown at compile-time (hence the compiler complaining that it's unsized).
That's why you normally reference the slice e.g.
&thing[1..]
unless it's a context where it doesn't matter. Or you immediately convert the slice to a vector or array.
However here it would not work, because a slice is a "borrowing" structure, it doesn't own anything. And it borrows the Vec being created inside the map, which means you'll get a borrowing error, because the Vec will be destroyed at the end of the callback, and thus the slice would be referencing invalid memory:
error[E0515]: cannot return value referencing temporary value
--> src/main.rs:5:16
|
5 | .map(|rec| &rec.rsplit('\t').collect::<Vec<_>>()[1..])
| ^------------------------------------^^^^^
| ||
| |temporary value created here
| returns a value referencing data owned by the current function
The solution is to filter the iterator before collecting the vec, using Iterator::skip:
let elements: Vec<&str> = my_vec
.iter()
.map(|rec| rec.rsplit('\t').skip(1).collect::<Vec<_>>())
.collect();
However this means you now have an Iterator<Item=Vec<&str>>, which doesn't collect to a Vec<&str>.
You could always Iterator::flatten the inner vecs, but in reality they're completely unnecessary: you can just Iterator::flat_map each original string into a stream of strings which automatically get folded into the parent:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f2c33c1b6a30224202357dc4bd5c1d19
let my_vec = vec!["foo\tbar\tbaz", "ffoo\tbbar\tbbaz"];
let elements: Vec<&str> = my_vec
.iter()
.flat_map(|rec| rec.rsplit('\t').skip(1))
.collect();
dbg!(elements);
By the by, the code you're showing doesn't match the description, you say:
for every element in that vector, I want to get the second and third items
but since you're using rsplit what you're getting is the second and first: rsplit will iterate from the end, hence the r for reverse.

Confused about ownership in situations involving lines and map

fn problem() -> Vec<&'static str> {
let my_string = String::from("First Line\nSecond Line");
my_string.lines().collect()
}
This fails with the compilation error:
|
7 | my_string.lines().collect()
| ---------^^^^^^^^^^^^^^^^^^
| |
| returns a value referencing data owned by the current function
| `my_string` is borrowed here
I understand what this error means - it's to stop you returning a reference to a value which has gone out of scope. Having looked at the type signatures of the functions involved, it appears that the problem is with the lines method, which borrows the string it's called on. But why does this matter? I'm iterating over the lines of the string in order to get a vector of the parts, and what I'm returning is this "new" vector, not anything that would (illegally) directly reference my_string.
(I'm aware I could fix this particular example very easily by just using the string literal rather than converting to an "owned" string with String::from. This is a toy example to reproduce the problem - in my "real" code the string variable is read from a file, so I obviously can't use a literal.)
What's even more mysterious to me is that the following variation on the function, which to me ought to suffer from the same problem, works fine:
fn this_is_ok() -> Vec<i32> {
let my_string = String::from("1\n2\n3\n4");
my_string.lines().map(|n| n.parse().unwrap()).collect()
}
The reason can't be map doing some magic, because this also fails:
fn also_fails() -> Vec<&'static str> {
let my_string = String::from("First Line\nSecond Line");
my_string.lines().map(|s| s).collect()
}
I've been playing about for quite a while, trying various different functions inside the map - and some pass and some fail, and I've honestly no idea what the difference is. And all this is making me realise that I have very little handle on how Rust's ownership/borrowing rules work in non-trivial cases, even though I thought I at least understood the basics. So if someone could give me a relatively clear and comprehensive guide to what is going on in all these examples, and how it might be possible to fix those which fail, in some straightforward way, I would be extremely grateful!
The key is in the type of the value yielded by lines: &str. In order to avoid unnecessary clones, lines actually returns references to slices of the string it's called on, and when you collect it to a Vec, that Vec's elements are simply references to slices of your string. So, of course when your function exits and the string is dropped, the references inside the Vec will be dropped and invalid. Remember, &str is a borrowed string, and String is an owned string.
The parsing works because you take those &strs then you read them into an i32, so the data is transferred to a new value and you no longer need a reference to the original string.
To fix your problem, simply use str::to_owned to convert each element into a String:
fn problem() -> Vec<String> {
let my_string = String::from("First Line\nSecond Line");
my_string.lines().map(|v| v.to_owned()).collect()
}
It should be noted that to_string also works, and that to_owned is actually part of the ToOwned trait, so it is useful for other borrowed types as well.
For references to sized values (str is unsized so this doesn't apply), such as an Iterator<Item = &i32>, you can simply use Iterator::cloned to clone every element so they are no longer references.
An alternative solution would be to take the String as an argument so it, and therefore references to it, can live past the scope of the function:
fn problem(my_string: &str) -> Vec<&str> {
my_string.lines().collect()
}
The problem here is that this line:
let my_string = String::from("First Line\nSecond Line");
copies the string data to a buffer allocated on the heap (so no longer 'static). Then lines returns references to that heap-allocated buffer.
Note that &str also implements a lines method, so you don't need to copy the string data to the heap, you can use your string directly:
fn problem() -> Vec<&'static str> {
let my_string = "First Line\nSecond Line";
my_string.lines().collect()
}
Playground
which avoids all unnecessary allocations and copying.

Unwrapping a skipped Chars iterator

Many iterator methods is Rust generate iterators wrapped up in iterators. One such case is the skip method, that skips the given number of elements and yields the remaining ones wrapped in the Skip struct that implements the Iterator trait.
I would like to read a file line by line, and sometimes skip the n first characters of a line. I figured that using Iterator.skip would work, but now I'm stuck figuring out how I can actually unwrap the yielded Chars iterator so I could materialize the remaining &str with chars.as_str().
What is the idiomatic way of unwrapping an iterator in rust? The call chain
let line: &String = ...;
let remaining = line.chars().skip(n).as_str().trim();
raises the error
error[E0599]: no method named `as_str` found for struct `std::iter::Skip<std::str::Chars<'_>>` in the current scope
--> src/parser/directive_parsers.rs:367:63
|
367 | let option_val = line.chars().skip(option_val_indent).as_str().trim();
| ^^^^^^ method not found in `std::iter::Skip<std::str::Chars<'_>>`
error: aborting due to previous error
You can retrieve the start byte index of the nth character using the nth() method on the char_indices() iterator on the string. Once you have this byte index, you can use it to get a subslice of the original string:
let line = "This is a line.";
let index = line.char_indices().nth(n).unwrap().0;
let remaining = &line[index..];
Rather than iterate over chars, you can use char_indices to find the exact point at which to take a slice from the string, ensuring that you don't index into the middle of a multi-byte character. This will save on an allocation for each line in the iterator:
input
.iter()
.map(|line| {
let n = 2; // get n from somewhere?
let (index, _) = line.char_indices().nth(n).unwrap();// better error handling
&line[index..]
})

Immutable access in rust

I am new to rust from python and have used the functional style in python extensively.
What I am trying to do is to take in a string (slice) (or any iterable) and iterate with a reference to the current index and the next index. Here is my attempt:
fn main() {
// intentionally immutable, this should not change
let x = "this is a
multiline string
with more
then 3 lines.";
// initialize multiple (mutable) iterators over the slice
let mut lineiter = x.chars();
let mut afteriter = x.chars();
// to have some reason to do this
afteriter.skip(1);
// zip them together, comparing the current line with the next line
let mut zipped = lineiter.zip(afteriter);
for (char1, char2) in zipped {
println!("{:?} {:?}", char1, char2);
}
}
I think it should be possible to get different iterators that have different positions in the slice but are referring to the same parts of memory without having to copy the string, but the error I get is as follows:
error[E0382]: use of moved value: `afteriter`
--> /home/alex/Documents/projects/simple-game-solver/src/src.rs:15:35
|
10 | let afteriter = x.chars();
| --------- move occurs because `afteriter` has type `std::str::Chars<'_>`, which does not implement the `Copy` trait
11 | // to have some reason to do this
12 | afteriter.skip(1);
| --------- value moved here
...
15 | let mut zipped = lineiter.zip(afteriter);
| ^^^^^^^^^ value used here after move
I also get a warning telling me that zipped does not need to be mutable.
Is it possible to instantiate multiple iterators over a single variable and if so how can it be done?
Is it possible to instantiate multiple iterators over a single variable and if so how can it be done?
If you check the signature and documentation for Iterator::skip:
fn skip(self, n: usize) -> Skip<Self>
Creates an iterator that skips the first n elements.
After they have been consumed, the rest of the elements are yielded. Rather than overriding this method directly, instead override the nth method.
You can see that it takes self by value (consumes the input iterator) and returns a new iterator. This is not a method which consumes the first n elements of the iterator in-place, it's one which converts the existing iterator into one which skips the first n elements.
So instead of:
let mut afteriter = x.chars();
afteriter.skip(1);
you just write:
let mut afteriter = x.chars().skip(1);
I also get a warning telling me that zipped does not need to be mutable.
That's because Rust for loop uses the IntoIterator trait, which moves the iterable into the loop. It's not creating a mutable reference, it's just consuming whatever the RHS is.
Therefore it doesn't care what the mutability of the variable. You do need mut if you iterate explicitly, or if you call some other "terminal" method (e.g. nth or try_fold or all), or if you want to iterate on the mutable reference (that's mostly useful for collections though), but not to hand off iterators to some other combinator method, or to a for loop.
A for loop takes self, if you will. Just as for_each does in fact.
Thanks to #Stargateur for giving me the solution. The .skip(1) takes ownership of afteriter and returns ownership to a version without the first element. What was happening before was ownership was lost on the .skip and so the variable could not be mutated anymore (I am pretty sure)

How do I create an iterator of lines from a file that have been split into pieces?

I have a file that I need to read line-by-line and break into two sentences separated by a "=". I am trying to use iterators, but I can't find how to use it properly within split. The documentation says that std::str::Split implements the trait, but I'm still clueless how to use it.
use std::{
fs::File,
io::{prelude::*, BufReader},
};
fn example(path: &str) {
for line in BufReader::new(File::open(path).expect("Failed at opening file.")).lines() {
let words = line.unwrap().split("="); //need to make this an iterable
}
}
How can I use a trait I know is already implemented into something like split?
As #Mateen commented, split already returns an iterable. To fix the lifetime problems, save the value returned by unwrap() into a variable before calling split.
I'll try to explain the lifetime issue here.
First it really helps to look at the function signatures.
pub fn unwrap(self) -> T
pub fn split<'a, P: Pattern<'a>>(&'a self, pat: P) -> Split<'a, P>
unwrap is pretty simple, it takes ownership of itself and returns the inner value.
split looks scary, but it's not too difficult, 'a is just a name for the lifetime, and it just states how long the return value can be used for. In this case it means that both the input arguments must live at least as long as the return value.
// Takes by reference, no ownership change
// v
pub fn split<'a, P: Pattern<'a>>(&'a self, pat: P) -> Split<'a, P>
// ^ ^ ^ ^
// | |--|---| |
// This just declares a name. | |
// | |
// Both of these values must last longer than -----|
This is because split doesn't copy any of the string, it just points to the position on the original string where the split takes place. If the original string for some reason was dropped, the Split will not point to invalid data.
A variable's lifetime (unless the ownership is passed to something else) lasts till it is out of scope, this is either at the closing } if it is named (e.g. with let) or it is at the end of line / ;
That's why there is a lifetime problem in your code:
for line in std::io::BufReader::new(std::fs::File::open(path).expect("Failed at opening file.")).lines() {
let words = line
.unwrap() // <--- Unwrap consumes `line`, `line` can not be used after calling unwrap(),
.split("=") // Passed unwrap()'s output to split as a reference
; //<-- end of line, unwrap()'s output is dropped due to it not being saved to a variable, the result of split now points to nothing, so the compiler complains.
}
Solutions
Saving the return value of unwrap()
for line in std::io::BufReader::new(std::fs::File::open("abc").expect("Failed at opening file.")).lines() {
let words = line.unwrap();
let words_split = words.split("=");
} // <--- `word`'s lifetime ends here, but there is no lifetime issues since `words_split` also ends here.
You can rename words_split to words to shadow the original variable to not clutter variable names if you want, this also doesn't cause an issue since shadowed variables are not dropped immediately, but at the end of its original scope.
Or
Rather than having a iterator of type str, all of which are just fancy pointers to the original string, you can copy each slice out to it's own string, removing the reliance on keeping the original string in scope.
There is almost certainly no reason to do this in your case, since copying each slice takes more processing power and more memory, but rust gives you this control.
let words = line
.unwrap()
.split("=")
.map(|piece|
piece.to_owned() // <--- This copies all the characters in the str into it's own String.
).collect::<Vec<String>>()
; // <--- unwrap()'s output dropped here, but it doesn't matter since the pieces no longer points to the original line string.
let words_iterator = words.iter();
collect gives you the error cannot infer type because you didn't state what you wanted to collect into, either use the turbofish syntax above, or state it on words i.e. let words: Vec<String> = ...
You have to call collect because map doesn't do anything unless you use it, but that's out of the scope of this answer.

Resources