Creating a vector of strings using the new std::fs::File - string

Porting my code from old_io to the new std::io
let path = Path::new("src/wordslist/english.txt");
let display = path.display();
let mut file = match File::open(&path) {
// The `desc` field of `IoError` is a string that describes the error
Err(why) => panic!("couldn't open {}: {}", display,
Error::description(&why)),
Ok(file) => file,
};
let mut s = String::new();
match file.read_to_string(&mut s) {
Err(why) => panic!("couldn't read {}: {}", display,
Error::description(&why)),
Ok(s) => s,
};
let words: Vec<_> = s.words().collect();
So this works but requires me to have a mutable string s to read the file contents, and then use words().collect() to gather into into a vector,
Is there a way to read the contents of a file to a vector using something like words() WITHOUT reading it to the mutable buffer string first? My thought is that this would be more performant in situations where the collect() call might happen at a later point, or after a words().map(something).

Your approach has a problem. .words() operates on an &str (string slice) which needs a parent String to refer to. Your example works fine because the Vec produced by s.words().collect() resides in the same scope as s, so it won't outlive the source string. But if you want to move it somewhere else, you'll need to end up with a Vec<String> instead of a Vec<&str>, which I'm assuming you already want if you're concerned about intermediate buffers.
You do have some options. Here's two that I can think of.
You can iterate over the characters of the file and split on whitespace:
// `.peekable()` gives us `.is_empty()` for an `Iterator`
// `.chars()` yields a `Result<char, CharsError>` which needs to be dealt with
let mut chars = file.chars().map(Result::unwrap).peekable();
let mut words: Vec<String> = Vec::new();
while !chars.is_empty() {
// This needs a type hint because it can't rely on info
// from the following `if` block
let word: String = chars.take_while(|ch| !ch.is_whitespace()).collect();
// We'll have an empty string if there's more than one
// whitespace character between words
// (more than one because the first is eaten
// by the last iteration of `.take_while()`)
if !word.is_empty() {
words.push(word);
}
}
You can wrap the File object in a std::io::BufReader and read it line-by-line with the .lines() iterator:
let mut reader = BufReader::new(file);
let mut words = Vec::new();
// `.lines()` yields `Result<String, io::Error>` so we have to handle that.
// (it will not yield an EOF error, this is for abnormal errors during reading)
for line in reader.lines().map(Result::unwrap) {
words.extend(line.words().map(String::from_str));
}
// Or alternately (this may not work due to lifetime errors in `flat_map()`
let words: Vec<_> = reader.lines().map(Result::unwrap)
.flat_map(|line| line.words().map(String::from_str))
.collect();
It's up to you to decide which of the two solutions you prefer. The former is probably more efficient but maybe less intuitive. The latter is easier to read, especially the for-loop version, but allocates intermediate buffers.

Related

Populating a Hashmap with a vector of string slices in rust

I've been pulling my hair out with this one.
I apologize in advance if it's a poorly worded question.
So, I have a Hashmap in the outer scope and want to populate it with string slices.
// Hashmap declaration.
let mut words: std::collections::HashMap< &str, std::vec::Vec<&str> > = std::collections::HashMap::new();
for file_name in ["conjuctions", "nouns", "verbs"].iter() { // Reading some file.
let file_content = std::fs::read_to_string("../wordlists/{file_name}.txt");
let mut fc = match file_content {
Ok(file_content) => file_content,
Err(_) => panic!("Failed to read the file: ../wordlists/{file_name}.txt"),
};
let mut wordlist_vec: Vec<&str> = fc.split("\n").collect();
words.insert( file_name, wordlist_vec );
}
println!(words["conjunctions"])
// Using it outside the above scope throws an error. That FC was dropped but still borrowed.
So basically, my question is, how can I use the hash map outside the scope for the loop above?
I think the issues emanate from using string slices (split returns slices ig) but I'm not too sure.
You simply need to use an owned String instead of &strs.
let mut words: HashMap<String, Vec<String>> = HashMap::new();
// ...
// We use map to change the elements of the iterator to owned Strings.
let mut wordlist_vec: Vec<String> = fc.split("\n").map(String::from).collect();
words.insert(file_name.to_string(), wordlist_vec);

Save command line argument to variable and use a default if it is missing or invalid

The idea here is simple but I have tried three different ways with different errors each time: read in a string as an argument, but if the string is invalid or the string isn't provided, use a default.
Can this be done using Result to detect a valid string or a panic?
The basic structure I expect:
use std::env;
use std::io;
fn main() {
let args: Vec<String> = args().collect();
let word: Result<String, Error> = &args[1].expect("Valid string");
let word: String = match word {
Ok(word) = word,
Err(_) = "World",
}
println!("Hello, {}", word);
}
So, there are a lot of issues in your code.
First and foremost, in a match statement, you do not use =, you use =>.
Additionally, your match statement returns something, which makes it not an executing block, but rather a returning block (those are not the official terms). That means that your blocks result is bound to a variable. Any such returning block must end with a semicolon.
So your match statement would become:
let word: String = match word {
Ok(word) => word,
Err(_) => ...,
};
Next, when you do use std::env, you do not import all of the functions from it into your namespace. All you do is that you create an alias, so that the compiler turns env::<something> intostd::env::<something> automatically.
Therefore, this needs to be changed:
let args: Vec<String> = env::args().collect();
The same problem exists in your next line. What is Error? Well, what you actually mean is io::Error, that is also not imported due to the same reasons stated above. You might be wondering now, how Result does not need to be imported. Well, it is because the Rust Team has decided on a certain set of functions and struct, which are automatically imported into every project. Error is not one of them.
let word: Result<String, io::Error> = ...;
The next part is wrong twice (or even thrice).
First of all, the operation [x] does not return a Result, it returns the value and panics if it is out-of-bounds.
Now, even if it was a result, this line would still be wrong. Why? Because you expect(...) the result. That would turn any Result into its value.
Now, what you are looking for is the .get(index) operation. It tries to get a value and if it fails, it returns None, so it returns an option. What is an option? It is like a result, but there is no error value. It must be noted that get() returns the option filled with a reference to the string.
The line should look something like this:
let word: Option<&String> = args.get(1);
Now you have two options to handle default values, but before we come to that, I need to tell you why your error value is wrong.
In Rust, there are two kinds of Strings.
There is ยด&str`, which you can create like this:
let a: &str = "Hello, World!";
These are immutable and non-borrowed strings stored on the stack. So you cannot just create a new one with arbitary values on the fly.
On the other hand, we have mutable and heap-allocated Strings.
let mut a: String = String::new();
a.push_str("Hello, World!");
// Or...
let b: String = String::from("Hello, World");
You store your arguments as a String, but in your match statement, you try to return a &str.
So, there are two ways to handle your error:
let word: Option<&String> = args.get(1);
let word: String = match word {
Some(word) => word.to_string(),
None => String::from("World"),
};
If you do not want to allocate that second string, you can also use
let word: Option<&String> = args.get(1);
let word: &str = match word {
Some(word) => word.as_str(),
None => "World",
};
The second option, unwrap_or
let args: Vec<String> = env::args().collect();
let default = &String::from("World");
let word: &String = args.get(1).unwrap_or(default);
println!("Hello, {}", word);
is a bit uglier, as it requires you to bind the default value to a variable. This will do what your match statement above does, but it's a bit prettier.
This works too:
let word: &str = args.get(1).unwrap_or(default);
So this is my favourite version of your program above:
use std::env;
fn main() {
let args: Vec<String> = env::args().collect();
let default = &String::from("World");
let word: &str = args.get(1).unwrap_or(default);
println!("Hello, {}", word);
}
But this one works too:
use std::env;
fn main() {
let args: Vec<String> = env::args().collect();
let word: Option<&String> = args.get(0);
let word: &str = match word {
Some(word) => word.as_str(),
None => "World",
};
println!("Hello, {}", word);
}

String concatenation in rust

I am trying to get a &str and &str to concatenate in a for loop withe intention of using the new combined string after a number of parts have been added to it. A general layout of the for loop can be seen below but I am having a lot of trouble combining strings due to numerous errors.
for line in reader.lines() {
let split_line = line.unwrap().split(",");
let mut edited_line = "";
for word in split_line {
if !word.contains("substring") {
let test_string = [edited_line, word].join(",");
edited_line = &test_string;
}
}
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
First error:
error[E0716]: temporary value dropped while borrowed
Comes when running the above.
Second error:
error[E0308]: mismatched types expected &str, found struct std::string::String
happens when you remove the & from test_string when it is assigned to edited_line
Note: format! and concat! macros both also give error 2.
It seems to be if I get error 2 and convert the std::string:String and convert it to &str I get the error stating the variables don't live long enough.
How am I supposed to go about building a string of many parts?
Note that Rust has two string types, String and &str (actually, there are more, but that's irrelevant here).
String is an owned string and can grow and shrink dynamically.
&str is a borrowed string and is immutable.
Calling [edited_line, word].join(",") creates a new String, which is allocated on the heap. edited_line = &test_string then borrows the String and implicitly converts it to a &str.
The problem is that its memory is freed as soon as the owner (test_string) goes out of scope, but the borrow lives longer than test_string. This is fundamentally impossible in Rust, since it would otherwise be a use-after-free bug.
The correct and most efficient way to do this is to create an empty String outside of the loop and only append to it in the loop:
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
edited_line.push(',');
edited_line.push_str(word);
}
}
Note that the resulting string will start with a comma, which might not be desired. To avoid it, you can write
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
if !edited_line.is_empty() {
edited_line.push(',');
}
edited_line.push_str(word);
}
}
This could be done more elegantly with the itertools crate, which provides a join method for iterators:
use itertools::Itertools;
let edited_line: String = line
.unwrap()
.split(",")
.filter(|word| !word.contains("substring"))
.join(",");
let mut edited_line = ""; makes edited_line a &str with a static lifetime.
To actually make edited_line a string, either append .to_owned(), or use String::new():
let mut edited_line = String::new();
// Or
let mut edited_line = "".to_owned();
See What are the differences between Rust's `String` and `str`? if you are unfamiliar with the differences.
Most importantly for your case, you can't extend a &str, but you can extend a String.
Once you switched edited_line to a String, using the method of setting edited_line to [edited_line, word].join(","); works:
for line in reader.lines() {
let split_line = line.unwrap().split(",");
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
let test_string = [edited_line.as_str(), word].join(","); // Added .as_str() to edited_line
edited_line = test_string; // Removed the & here
}
}
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
Playground
However, this is both not very efficient, nor ergonomic. Also it has the (probably unintended) result of prepending each line with a ,.
Here is an alternative that uses only one String instance:
for line in reader.lines() {
let split_line = line.unwrap().split(",");
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
edited_line.push(',');
edited_line.push_str(word);
}
}
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
This still prepends the , character before each line however. You can probably fix that by checking if edited_line is not empty before pushing the ,.
Playground
The third option is to change the for loop into an iterator:
for line in reader.lines() {
let edited_line = line.split(",")
.filter(|word| !word.contains("substring"))
.collect::<Vec<&str>>() // Collecting allows us to use the join function.
.join(",");
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
Playground
This way we can use the join function as intended, neatly eliminating the initial , at the start of each line.
PS: If you have trouble knowing what types each variable is, I suggest using an IDE like Intellij-rust, which shows type hints for each variable as you write them.

Does Read::read guarantee to append data and not overwrite any existing one?

I'm working on an SMTP library that reads lines over the network using a buffered reader.
I want a nice, safe way to read data from the network, without depending on Rust internals to make sure the code works as expected. Specifically, I'm wondering if the Read trait guarantees that data read with Read::read is appended to the buffer passed as an argument rather than overwriting the buffer entirely.
At the moment, I use a Range to make sure existing data is not overwritten without depending on Rust internals.
However, given that Rust used to have a nice way to do what I want, I'm wondering if the current code can be improved, possibly removing the unsafe blocks too.
No, it does not guarantee that:
use std::io::prelude::*;
use std::str;
fn main() {
let mut source1 = "hello, world!".as_bytes();
let mut source2 = "moo".as_bytes();
let mut dest = [0; 128];
source1.read(&mut dest).unwrap();
source2.read(&mut dest).unwrap();
let s = str::from_utf8(&dest[..16]).unwrap();
println!("{:?}", s)
}
This prints
"moolo, world!\u{0}\u{0}\u{0}"
Specifically, it cannot do what you want, based purely on the type signature:
fn read(&mut self, buf: &mut [u8]) -> Result<usize>;
All that the read method has access to is your mutable slice - there's nowhere to store information like "how far in the buffer you are". Furthermore, you aren't allowed to "extend" a mutable slice with more elements - you are only allowed to mutate the values within the slice.
For your particular case, you may want to look at BufRead::read_until. Here's a barely-tested example:
use std::io::{BufRead,BufReader};
use std::str;
fn main() {
let source1 = "header 1\r\nheader 2\r\n".as_bytes();
let mut reader = BufReader::new(source1);
let mut buf = vec![];
buf.reserve(128); // Maybe more efficient?
loop {
match reader.read_until(b'\n', &mut buf) {
Ok(0) => break,
Ok(_) => {},
Err(_) => panic!("Handle errors"),
}
if buf.len() < 2 { continue }
if buf[buf.len() - 2] == b'\r' {
{
let s = str::from_utf8(&buf).unwrap();
println!("Got a header {:?}", s);
}
buf.clear();
}
}
}

Extend lifetime of a variable for thread

I am reading a string from a file, splitting it by lines into a vector and then I want to do something with the extracted lines in separate threads. Like this:
use std::fs::File;
use std::io::prelude::*;
use std::thread;
fn main() {
match File::open("data") {
Ok(mut result) => {
let mut s = String::new();
result.read_to_string(&mut s);
let k : Vec<_> = s.split("\n").collect();
for line in k {
thread::spawn(move || {
println!("nL: {:?}", line);
});
}
}
Err(err) => {
println!("Error {:?}",err);
}
}
}
Of course this throws an error, because s will go out of scope before the threads are started:
s` does not live long enough
main.rs:9 let k : Vec<_> = s.split("\n").collect();
^
What can I do now? I've tried many things like Box or Arc, but I couldn't get it working. I somehow need to create a copy of s which also lives in the threads. But how do I do that?
The problem, fundamentally, is that line is a borrowed slice into s. There's really nothing you can do here, since there's no way to guarantee that each line will not outlive s itself.
Also, just to be clear: there is absolutely no way in Rust to "extend the lifetime of a variable". It simply cannot be done.
The simplest way around this is to go from line being borrowed to owned. Like so:
use std::thread;
fn main() {
let mut s: String = "One\nTwo\nThree\n".into();
let k : Vec<String> = s.split("\n").map(|s| s.into()).collect();
for line in k {
thread::spawn(move || {
println!("nL: {:?}", line);
});
}
}
The .map(|s| s.into()) converts from &str to String. Since a String owns its contents, it can be safely moved into each thread's closure, and will live independently of the thread that created it.
Note: you could do this in nightly Rust using the new scoped thread API, but that is still unstable.

Resources