Iterating over lines in a file and looking for substring from a vec! in rust

Iterating over lines in a file and looking for substring from a vec! in rust - string

I'm writing a project in which a struct System can be constructed from a data file.
In the data file, some lines contain keywords that indicates values to be read either inside the line or in the subsequent N following lines (separated with a blank line from the line).
I would like to have a vec! containing the keywords (statically known at compile time), check if the line returned by the iterator contains the keyword and do the appropriate operations.
Now my code looks like this:
impl System {
fn read_data<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>> where P: AsRef<Path> {
let file = File::open(filename)?;
let f = BufReader::new(file);
Ok(f.lines())
}
...
pub fn new_from_data<P>(dataname: P) -> System where P: AsRef<Path> {
let keywd = vec!["atoms", "atom types".into(),
"Atoms".into()];
let mut sys = System::new();
if let Ok(mut lines) = System::read_data(dataname) {
while let Some(line) = lines.next() {
for k in keywd {
let split: Vec<&str> = line.unwrap().split(" ").collect();
if split.contains(k) {
match k {
"atoms" => sys.natoms = split[0].parse().unwrap(),
"atom types" => sys.ntypes = split[0].parse().unwrap(),
"Atoms" => {
lines.next();
// assumes fields are: atom-ID molecule-ID atom-type q x y z
for _ in 1..=sys.natoms {
let atline = lines.next().unwrap().unwrap();
let data: Vec<&str> = atline.split(" ").collect();
let atid: i32 = data[0].parse().unwrap();
let molid: i32 = data[1].parse().unwrap();
let atype: i32 = data[2].parse().unwrap();
let charge: f32 = data[3].parse().unwrap();
let x: f32 = data[4].parse().unwrap();
let y: f32 = data[5].parse().unwrap();
let z: f32 = data[6].parse().unwrap();
let at = Atom::new(atid, molid, atype, charge, x, y, z);
sys.atoms.push(at);
};
},
_ => (),
}
}
}
}
}
sys
}
}
I'm very unsure on two points:
I don't know if I treated the line by line reading of the file in an idiomatic way as I tinkered some examples from the book and Rust by example. But returning an iterator makes me wonder when and how unwrap the results. For example, when calling the iterator inside the while loop do I have to unwrap twice like in let atline = lines.next().unwrap().unwrap();? I think that the compiler does not complain yet because of the 1st error it encounters which is
I cannot wrap my head around the type the give to the value k as I get a typical:
error[E0308]: mismatched types
--> src/system/system.rs:65:39
|
65 | if split.contains(k) {
| ^ expected `&str`, found `str`
|
= note: expected reference `&&str`
found reference `&str`
error: aborting due to previous error
How are we supposed to declare the substring and compare it to the strings I put in keywd? I tried to deference k in contains, tell it to look at &keywd etc but I just feel I'm wasting my time for not properly adressing the problem. Thanks in advance, any help is indeed appreciated.

Let's go through the issues one by one. I'll go through the as they appear in the code.
First you need to borrow keywd in the for loop, i.e. &keywd. Because otherwise keywd gets moved after the first iteration of the while loop, and thus why the compiler complains about that.
for k in &keywd {
let split: Vec<&str> = line.unwrap().split(" ").collect();
Next, when you call .unwrap() on line, that's the same problem. That causes the inner Ok value to get moved out of the Result. Instead you can do line.as_ref().unwrap() as then you get a reference to the inner Ok value and aren't consuming the line Result.
Alternatively, you can .filter_map(Result::ok) on your lines, to avoid (.as_ref()).unwrap() altogether.
You can add that directly to read_data and even simply the return type using impl ....
fn read_data<P>(filename: P) -> io::Result<impl Iterator<Item = String>>
where
P: AsRef<Path>,
{
let file = File::open(filename)?;
let f = BufReader::new(file);
Ok(f.lines().filter_map(Result::ok))
}
Note that you're splitting line for every keywd, which is needless. So you can move that outside of your for loop as well.
All in all, it ends up looking like this:
if let Ok(mut lines) = read_data("test.txt") {
while let Some(line) = lines.next() {
let split: Vec<&str> = line.split(" ").collect();
for k in &keywd {
if split.contains(k) {
...
Given that we borrowed &keywd, then we don't need to change k to &k, as now k is already &&str.

Related

Access value after it has been borrowed

I have the following function. It is given a file. It should return a random line from the file as a string.
fn get_word(word_list: File) -> String {
let reader = BufReader::new(word_list);
let lines = reader.lines();
let word_count = lines.count();
let y: usize = thread_rng().gen_range(0, word_count - 1);
let element = lines.nth(y);
match element {
Some(x) => println!("Result: {}", x.unwrap()),
None => println!("Error with nth"),
}
let word = String::new(""); // Once the error is gone. I would create the string.
return word;
}
But I keep getting this error:
93 | let lines = reader.lines();
| ----- move occurs because `lines` has type `std::io::Lines<BufReader<File>>`, which does not implement the `Copy` trait
94 | let word_count = lines.count();
| ------- `lines` moved due to this method call
...
99 | let element = lines.nth(y);
| ^^^^^^^^^^^^ value borrowed here after move
|
I am new to Rust and have been learning by try and error. I don't know how to access the data after I have called the count function. If there is another method to accomplish what I want, I would gladly welcome it.

The .count() method consumes the iterator. From the documentation
Consumes the iterator, counting the number of iterations and returning it.
This method will call next repeatedly until None is encountered, returning the number of times it saw Some. Note that next has to be called at least once even if the iterator does not have any elements.
In other words, it reads the file content and discards it. If you want to get the Nth line, then you have to re-read the file using another iterator instance.
If your file is small, you can save the read lines in a vector:
let lines = reader.lines().collect::<Vec<String>>();
Then the length of the vector is the number of lines and you can avoid re-reading the file, but if it's a large file you may end-up crashing with "out of memory" error. In that case you should re-read the file content, or use a better strategy such as indexing where the new lines are, so you can jump straight to the new line, without having to re-read a lot of data.

The value returned by lines is an iterator, which reads the file sequentially. To count the number of lines, the iterator is consumed: self is taken by value; ownership is transferred into the count() function. So you can't rewind and then request the nth line.
The easiest solution is to read all the lines into a vector:
let lines = reader.lines().collect::<Vec<String>>();
let word_count = lines.len();
let y: usize = thread_rng().gen_range(0, word_count - 1);
let word = lines[y].clone();
return word;
Notice the clone call: you can't simply write return lines[y]; because you'd be borrowing the string from the vector, but the vector is destroyed as soon as the function returns. By returning a clone of the string, this is avoided.
(to_owned or even to_string would also work. You can also avoid a copy by using swap_remove; I'm not sure there is a more elegant way to move one element from a vector and discard the rest.)

Note that counting the lines and then selecting one of them requires you to either rewind the iterator and go through it twice (once to count and once to select), or to store everything in memory first (e.g. with .collect::<Vec<_>>). Selecting a random line from the list can however be done in a single pass by randomly choosing on each line whether to keep the currently selected line or replacing it with the latest read line:
fn get_word(word_list: File) -> String {
let reader = BufReader::new(word_list);
let lines = reader.lines();
let mut selected = lines.next().unwrap();
let mut count = 0;
for l in lines {
count += 1;
if thread_rng().gen_range (0, count) == 0 {
selected = l;
}
}
match selected {
Ok(x) => return x,
Err(_) => {
print!("Error get_word");
return String::new();
}
}
}
Or of course the simplest way is to just use choose:
fn get_word(word_list: File) -> String {
use rand::seq::IteratorRandom;
let reader = BufReader::new(word_list);
match reader.lines.choose (thread_rng()) {
Some (Ok (x)) => return x,
_ => {
print!("Error get_word");
return String::new();
}
}
}

In order to solve this problem I used the solution given of using .collect::<Vec<String>> but the whole solution needs a little more work. At least in my case.
First: .lines returns a Iterator of type Result<std::string::String, std::io::Error>.
Second: To access the value of this vector I have to borrow it with &.
Here the working function:
fn get_word(word_list: File) -> String {
let reader = BufReader::new(word_list);
let lines = reader.lines().collect::<Vec<_>>();
let word_count = lines.len();
let y: usize = thread_rng().gen_range(0, word_count - 1);
match &lines[y] {
Ok(x) => return x.to_string(),
Err(_) => {
print!("Error get_word");
return String::new();
}
}
}

Creating struct with values from function parameter Vec<String>and returning Vec<struct> to caller

The purpose of my program is to read questions/answers from a file (line by line), and create several structs from it, put into a Vec for further processing.
I have a rather long piece of code, which I tried to separate into several functions (full version on Playground; hopefully is valid link).
I suppose I'm not understanding a lot about borrowing, lifetimes and other things. Apart from that, the given examples from all around I've seen, I'm not able to adapt to my given problems.
Tryigin to remodel my struct fields from &str to String didn't change anything. As it was with creating Vec<Question> within get_question_list.
Function of concern is as follows:
fn get_question_list<'a>(mut questions: Vec<Question<'a>>, lines: Vec<String>) -> Vec<Question<'a>> {
let count = lines.len();
for i in (0..count).step_by(2) {
let q: &str = lines.get(i).unwrap();
let a: &str = lines.get(i + 1).unwrap();
questions.push(Question::new(q, a));
}
questions
}
This code fails with the compiler as following (excerpt):
error[E0597]: `lines` does not live long enough
--> src/main.rs:126:23
|
119 | fn get_question_list<'a>(mut questions: Vec<Question<'a>>, lines: Vec<String>) -> Vec<Question<'a>> {
| -- lifetime `'a` defined here
...
126 | let a: &str = lines.get(i + 1).unwrap();
| ^^^^^ borrowed value does not live long enough
127 |
128 | questions.push(Question::new(q, a));
| ----------------------------------- argument requires that `lines` is borrowed for `'a`
...
163 | }
| - `lines` dropped here while still borrowed
Call to get_question_list is around:
let lines: Vec<String> = content.split("\n").map(|s| s.to_string()).collect();
let counter = lines.len();
if counter % 2 != 0 {
return Err("Found lines in quiz file are not even (one question or answer is missing.).");
}
questions = get_question_list(questions, lines);
Ok(questions)

The issue is that your Questions are supposed to borrow something (hence the lifetime annotation), but lines gets moved into the function, so when you create a new question from a line, it's borrowing function-local data, which is going to be destroyed at the end of the function. As a consequence, the questions you're creating can't escape the function creating them.
Now what you could do is not move the lines into the function: lines: &[String] would have the lines be owned by the caller, which would "fix" get_question_list.
However the exact same problem exists in read_questions_from_file, and there it can not be resolved: the lines are read from a file, and thus are necessarily local to the function (unless you move the lines-reading to main and read_questions_from_file only borrows them as well).
Therefore the simplest proper fix is to change Question to own its data:
struct Question {
question: String,
answer: String
}
This way the question itself keeps its data alive, and the issue goes away.
We can improve things further though, I think:
First, we can strip out the entire mess around newlines by using String::lines, it will handle cross-platform linebreaks, and will strip them.
It also seems rather odd that get_question_list takes a vector by value only to append to it and immediately return it. A more intuitive interface would be to either:
take the "output vector" by &mut so the caller can pre-size or reuse it across multiple loads, which doesn't really seem useful in this case
or create the output vector internally, which seems like the most sensible case here
Here is what I would consider a more pleasing version: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c0d440d67654b92c75d136eba2bba0c1
fn read_questions_from_file(filename: &str) -> Result<Vec<Question>, Box<dyn Error>> {
let file_content = read_file(filename)?;
let lines: Vec<_> = file_content.lines().collect();
if lines.len() % 2 != 0 {
return Err(Box::new(OddLines));
}
let mut questions = Vec::with_capacity(lines.len() / 2);
for chunk in lines.chunks(2) {
if let [q, a] = chunk {
questions.push(Question::new(q.to_string(), a.to_string()))
} else {
unreachable!("Odd lines should already have been checked");
}
}
Ok(questions)
}
Note that I inlined / removed get_question_list as I don't think it pulls its weight at this point, and it's both trivial and very specific.
Here is a variant which works similarly but with different tradeoffs: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3b8f95aef5bcae904545617749086dbc
fn read_questions_from_file(filename: &str) -> Result<Vec<Question>, Box<dyn Error>> {
let file_content = read_file(filename)?;
let mut lines = file_content.lines();
let mut questions = Vec::new();
while let Some(q) = lines.next() {
let a = lines.next().ok_or(OddLines)?;
questions.push(Question::new(q.to_string(), a.to_string()));
}
Ok(questions)
}
it avoids collecting the lines to a Vec, but as a result has to process the file to the end before it knows that said file is suitable, and it can't preallocate Questions.
At this point, because we do not care for lines being a Vec anymore, we could operate on a BufRead and strip out read_file as well:
fn read_questions_from_file(filename: &str) -> Result<Vec<Question>, Box<dyn Error>> {
let file_content = BufReader::new(File::open(filename)?);
let mut lines = file_content.lines();
let mut questions = Vec::new();
while let Some(q) = lines.next() {
let a = lines.next().ok_or(OddLines)?;
questions.push(Question::new(q?, a?));
}
Ok(questions)
}
The extra ? are because while str::Lines yields &str, io::Lines yields Result<String, io::Error>: IO errors are reported lazily when a read is attempted, meaning every line-read could report a failure if read_to_string would have failed.
OTOH since io::Lines returns a Result<String, ...> we can use q and a directly without needing to convert them to String.

Split string only once in Rust

I want to split a string by a separator only once and put it into a tuple. I tried doing
fn splitOnce(in_string: &str) -> (&str, &str) {
let mut splitter = in_string.split(':');
let first = splitter.next().unwrap();
let second = splitter.fold("".to_string(), |a, b| a + b);
(first, &second)
}
but I keep getting told that second does not live long enough. I guess it's saying that because splitter only exists inside the function block but I'm not really sure how to address that. How to I coerce second into existing beyond the function block? Or is there a better way to split a string only once?

You are looking for str::splitn:
fn split_once(in_string: &str) -> (&str, &str) {
let mut splitter = in_string.splitn(2, ':');
let first = splitter.next().unwrap();
let second = splitter.next().unwrap();
(first, second)
}
fn main() {
let (a, b) = split_once("hello:world:earth");
println!("{} --- {}", a, b)
}
Note that Rust uses snake_case.
I guess it's saying that because splitter only exists inside the function block
Nope, it's because you've created a String and are trying to return a reference to it; you cannot do that. second is what doesn't live long enough.
How to I coerce second into existing beyond the function block?
You don't. This is a fundamental aspect of Rust. If something needs to live for a certain mount of time, you just have to make it exist for that long. In this case, as in the linked question, you'd return the String:
fn split_once(in_string: &str) -> (&str, String) {
let mut splitter = in_string.split(':');
let first = splitter.next().unwrap();
let second = splitter.fold("".to_string(), |a, b| a + b);
(first, second)
}

str::split_once is now built-in.
Doc examples:
assert_eq!("cfg".split_once('='), None);
assert_eq!("cfg=".split_once('='), Some(("cfg", "")));
assert_eq!("cfg=foo".split_once('='), Some(("cfg", "foo")));
assert_eq!("cfg=foo=bar".split_once('='), Some(("cfg", "foo=bar")));

How do I reuse the SplitWhitespace iterator?

I've got a piece of code which is supposed to check if two sentences are "too similar", as defined by a heuristic made clearest by the code.
fn too_similar(thing1: &String, thing2: &String) -> bool {
let split1 = thing1.split_whitespace();
let split2 = thing2.split_whitespace();
let mut matches = 0;
for s1 in split1 {
for s2 in split2 {
if s1.eq(s2) {
matches = matches + 1;
break;
}
}
}
let longer_length =
if thing1.len() > thing2.len() {
thing1.len()
} else {
thing2.len()
};
matches > longer_length / 2
}
However, I'm getting the following compilation error:
error[E0382]: use of moved value: `split2`
--> src/main.rs:7:19
|
7 | for s2 in split2 {
| ^^^^^^ value moved here in previous iteration of loop
|
= note: move occurs because `split2` has type `std::str::SplitWhitespace<'_>`, which does not implement the `Copy` trait
I'm not sure why split2 is getting moved in the first place, but what's the Rust way of writing this function?

split2 is getting moved because iterating with for consumes the iterator and since the type does not implement Copy, Rust isn't copying it implicitly.
You can fix this by creating a new iterator inside the first for:
let split1 = thing1.split_whitespace();
let mut matches = 0;
for s1 in split1 {
for s2 in thing2.split_whitespace() {
if s1.eq(s2) {
matches = matches + 1;
break;
}
}
}
...
You can also rewrite the matches counting loop using some higher order functions available in the Iterator trait:
let matches = thing1.split_whitespace()
.flat_map(|c1| thing2.split_whitespace().filter(move |&c2| c1 == c2))
.count();
longer_length can also be written as:
let longer_length = std::cmp::max(thing1.len(), thing2.len());

There are possibly some better ways to do the word comparison.
If the phrases are long, then iterating over thing2's words for every word in thing1 is not very efficient. If you don't have to worry about words which appear more than once, then HashSet may help, and boils the iteration down to something like:
let words1: HashSet<&str> = thing1.split_whitespace().collect();
let words2: HashSet<&str> = thing2.split_whitespace().collect();
let matches = words1.intersection(&words2).count();
If you do care about repeated words you probably need a HashMap, and something like:
let mut words_hash1: HashMap<&str, usize> = HashMap::new();
for word in thing1.split_whitespace() {
*words_hash1.entry(word).or_insert(0) += 1;
}
let matches2: usize = thing2.split_whitespace()
.map(|s| words_hash1.get(s).cloned().unwrap_or(0))
.sum();

Indexing a Vec of Box Error

I was playing around with the Vec struct and ran into some interesting errors and behavior that I can't quite understand. Consider the following code.
fn main() {
let v = vec![box 1i];
let f = v[0];
}
When evaluated in the rust playpen, the code produces the following errors:
<anon>:3:13: 3:17 error: cannot move out of dereference (dereference is implicit, due to indexing)
<anon>:3 let f = v[0];
^~~~
<anon>:3:9: 3:10 note: attempting to move value to here (to prevent the move, use `ref f` or `ref mut f` to capture value by reference)
<anon>:3 let f = v[0];
^
error: aborting due to previous error
playpen: application terminated with error code 101
Program ended.
My understanding of Vec's index method is that it returns references to the values in a Vec, so I don't understand what moves or implicit dereferences are happening.
Also, when I change the f variable to an underscore, as below, no errors are produced!
fn main() {
let v = vec![box 1i];
let _ = v[0];
}
I was hoping someone could explain the errors I was getting and why they go away when switching f to _.

No idea which syntax sugar v[0] implements, but it is trying to move the value instead of getting a reference.
But if you call .index(), it works and gives you a reference with the same lifetime of the vector:
fn main() {
let v = vec![box 1i];
let f = v.index(&0);
println!("{}", f);
}
The second example works because as the value is being discarded, it doesn't try to move it.
EDIT:
The desugar for v[0] is *v.index(&0) (from: https://github.com/rust-lang/rust/blob/fb72c4767fa423649feeb197b50385c1fa0a6fd5/src/librustc/middle/trans/expr.rs#L467 ).
fn main() {
let a = vec!(1i);
let b = a[0] == *a.index(&0);
println!("{}" , b);
}
true

In your code, let f = v[0]; assigns f by value (as said in the error message, it is implicitly dereferencing) : the compiler tries to copy or move v[0] into f. v[0] being a box, it cannot be copied, thus it should be moved like in this situation :
let a = box 1i;
let b = a;
// a cannot be used anymore, it has been moved
// following line would not compile
println!("{}", a);
But values cannot be moved out of the vector via indexing, as it is a reference that is returned.
Concerning _, this code :
fn main() {
let v = vec![box 1i];
let _ = v[0];
println!("{}", _);
}
produces this error :
<anon>:4:20: 4:21 error: unexpected token: `_`
<anon>:4 println!("{}", _);
^
_ is not a variable name but a special name of rust, telling you don't care about the value, so the compiler doesn't try to copy or move anything.

You can get your original function to work by de-referencing your v[0]:
fn main() {
let v = vec![box 1i];
let f = &v[0]; // notice the &
println!("{}",f);
}
I don't know why the underscore silences your error. It should probably raise an error since the underscore alone is an invalid variable name (I think). Attempting to print it yields an error:
fn main() {
let v = vec![box 1i];
let _ = &v[0];
println!("{}",_);
}
Output:
<anon>:4:19: 4:20 error: unexpected token: `_`
<anon>:4 println!("{}",_);
The underscore is used to silence unused variable warnings (for example the compiler will yell at you if you define some_var and never use it, but won't if you define _some_var and never use it). It is also used as a fallback in a match statement to match anything that did not match other paths:
fn main() {
let v = vec![box 1i];
let f = &v[0];
match **f {
3i => println!("{}",f),
_ => println!("nothing here")
};
}
Someone smarter than me should comment on if the underscore is a valid variable name. Honestly I think the compiler shouldn't allow it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Iterating over lines in a file and looking for substring from a vec! in rust - string

Related

Access value after it has been borrowed

Creating struct with values from function parameter Vec<String>and returning Vec<struct> to caller

Split string only once in Rust

How do I reuse the SplitWhitespace iterator?

Indexing a Vec of Box Error

Categories

Resources