Read file character-by-character in Rust

Read file character-by-character in Rust - io

Is there an idiomatic way to process a file one character at a time in Rust?
This seems to be roughly what I'm after:
let mut f = io::BufReader::new(try!(fs::File::open("input.txt")));
for c in f.chars() {
println!("Character: {}", c.unwrap());
}
But Read::chars is still unstable as of Rust v1.6.0.
I considered using Read::read_to_string, but the file may be large and I don't want to read it all into memory.

Let's compare 4 approaches.
1. Read::chars
You could copy Read::chars implementation, but it is marked unstable with
the semantics of a partial read/write of where errors happen is currently unclear and may change
so some care must be taken. Anyway, this seems to be the best approach.
2. flat_map
The flat_map alternative does not compile:
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
for c in f.lines().flat_map(|l| l.expect("lines failed").chars()) {
println!("Character: {}", c);
}
}
The problems is that chars borrows from the string, but l.expect("lines failed") lives only inside the closure, so compiler gives the error borrowed value does not live long enough.
3. Nested for
This code
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
for line in f.lines() {
for c in line.expect("lines failed").chars() {
println!("Character: {}", c);
}
}
}
works, but it keeps allocation a string for each line. Besides, if there is no line break on the input file, the whole file would be load to the memory.
4. BufRead::read_until
A memory efficient alternative to approach 3 is to use Read::read_until, and use a single string to read each line:
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
let mut buf = Vec::<u8>::new();
while f.read_until(b'\n', &mut buf).expect("read_until failed") != 0 {
// this moves the ownership of the read data to s
// there is no allocation
let s = String::from_utf8(buf).expect("from_utf8 failed");
for c in s.chars() {
println!("Character: {}", c);
}
// this returns the ownership of the read data to buf
// there is no allocation
buf = s.into_bytes();
buf.clear();
}
}

I cannot use lines() because my file could be a single line that is gigabytes in size. This an improvement on #malbarbo's recommendation of copying Read::chars from the an old version of Rust. The utf8-chars crate already adds .chars() to BufRead for you.
Inspecting their repository, it doesn't look like they load more than 4 bytes at a time.
Your code will look the same as it did before Rust removed Read::chars:
use std::io::stdin;
use utf8_chars::BufReadCharsExt;
fn main() {
for c in stdin().lock().chars().map(|x| x.unwrap()) {
println!("{}", c);
}
}
Add the following to your Cargo.toml:
[dependencies]
utf8-chars = "1.0.0"

There are two solutions that make sense here.
First, you could copy the implementation of Read::chars() and use it; that would make it completely trivial to move your code over to the standard library implementation if/when it stabilizes.
On the other hand, you could simply iterate line by line (using f.lines()) and then use line.chars() on each line to get the chars. This is a little more hacky, but it will definitely work.
If you only wanted one loop, you could use flat_map() with a lambda like |line| line.chars().

Related

Writing a Vec<String> to files using std::fs::write

I'm writing a program that handles a vector which is combination of numbers and letters (hence Vec<String>). I sort it with the .sort() method and am now trying to write it to a file.
Where strvec is my sorted vector that I'm trying to write using std::fs::write;
println!("Save results to file?");
let to_save: String = read!();
match to_save.as_str() {
"y" => {
println!("Enter filename");
let filename: String = read!();
let pwd = current_dir().into();
write("/home/user/dl/results", strvec);
Rust tells me "the trait AsRef<[u8]> is not implemented for Vec<String>". I've also tried using &strvec.
How do I avoid this/fix it?

When it comes to writing objects to the file you might want to consider serialization. Most common library for this in Rust is serde, however in this example where you want to store vector of Strings and if you don't need anything human readable in file (but it comes with small size :P), you can also use bincode:
use std::fs;
use bincode;
fn main() {
let v = vec![String::from("aaa"), String::from("bbb")];
let encoded_v = bincode::serialize(&v).expect("Could not encode vector");
fs::write("file", encoded_v).expect("Could not write file");
let read_v = fs::read("file").expect("Could not read file");
let decoded_v: Vec<String> = bincode::deserialize(&read_v).expect("Could not decode vector");
println!("{:?}", decoded_v);
}
Remember to add bincode = "1.3.3" under dependencies in Cargo.toml
#EDIT:
Actually you can easily save String to the file so simple join() should do:
use std::fs;
fn main() {
let v = vec![
String::from("aaa"),
String::from("bbb"),
String::from("ccc")];
fs::write("file", v.join("\n")).expect("");
}

Rust can't write anything besides a &[u8] to a file. There are too many different ways which data can be interpreted before it gets flattened, so you need to handle all of that ahead of time. For a Vec<String>, it's pretty simple, and you can just use concat to squish everything down to a single String, which can be interpreted as a &[u8] because of its AsRef<u8> trait impl.
Another option would be to use join, in case you wanted to add some sort of delimiter between your strings, like a space, comma, or something.
fn main() {
let strvec = vec![
"hello".to_string(),
"world".to_string(),
];
// "helloworld"
std::fs::write("/tmp/example", strvec.concat()).expect("failed to write to file");
// "hello world"
std::fs::write("/tmp/example", strvec.join(" ")).expect("failed to write to file");
}

You can't get a &[u8] from a Vec<String> without copying since a slice must refer to a contiguous sequence of items. Each String will have its own allocation on the heap somewhere, so while each individual String can be converted to a &[u8], you can't convert the whole vector to a single &[u8].
While you can .collect() the vector into a single String and then get a &[u8] from that, this does some unnecessary copying. Consider instead just iterating the Strings and writing each one to the file. With this helper, it's no more complex than using std::fs::write():
use std::path::Path;
use std::fs::File;
use std::io::Write;
fn write_each(
path: impl AsRef<Path>,
items: impl IntoIterator<Item=impl AsRef<[u8]>>,
) -> std::io::Result<()> {
let mut file = File::create(path)?;
for i in items {
file.write_all(i.as_ref())?;
}
// Surface any I/O errors that could otherwise be swallowed when
// the file is closed implicitly by being dropped.
file.sync_all()
}
The bound impl IntoIterator<Item=impl AsRef<[u8]>> is satisfied by both Vec<String> and by &Vec<String>, so you can call this as either write_each("path/to/output", strvec) (to consume the vector) or write_each("path/to/output", &strvec) (if you need to hold on to the vector for later).

How to capture the content of stdout/stderr when I cannot change the code that prints?

I have a function foo that can't be modified and contains println! and eprintln! code in it.
fn foo() {
println!("hello");
}
After I call the function, I have to test what it printed so I want to capture the stdout/stderr into a variable.

I strongly recommend against doing this, but if you are using nightly and don't mind using a feature that seems unlikely to ever be stabilized, you can directly capture stdout and stderr using hidden functionality of the standard library:
#![feature(internal_output_capture)]
use std::sync::Arc;
fn foo() {
println!("hello");
eprintln!("world");
}
fn main() {
std::io::set_output_capture(Some(Default::default()));
foo();
let captured = std::io::set_output_capture(None);
let captured = captured.unwrap();
let captured = Arc::try_unwrap(captured).unwrap();
let captured = captured.into_inner().unwrap();
let captured = String::from_utf8(captured).unwrap();
assert_eq!(captured, "hello\nworld\n");
}
It's very rare that a function "cannot be changed", so I'd encourage you to do so and use dependency injection instead. For example, if you are able to edit foo but do not want to change its signature, move all the code to a new function with generics which you can test directly:
use std::io::{self, Write};
fn foo() {
foo_inner(io::stdout(), io::stderr()).unwrap()
}
fn foo_inner(mut out: impl Write, mut err: impl Write) -> io::Result<()> {
writeln!(out, "hello")?;
writeln!(err, "world")?;
Ok(())
}
See also:
How can I test stdin and stdout?
How to take ownership of T from Arc<Mutex<T>>?
How do I convert a Vector of bytes (u8) to a string?

Not sure if this would work on windows, but should work on unix like systems. You should replace the file descriptor to something you can read later. I don't think it is really easy.
I would suggest to use stdio_override which already does that for you using files. You can redirect it, then execute the function and the read the file content.
From the example:
use stdio_override::StdoutOverride;
use std::fs;
let file_name = "./test.txt";
let guard = StdoutOverride::override_file(file_name)?;
println!("Isan to Stdout!");
let contents = fs::read_to_string(file_name)?;
assert_eq!("Isan to Stdout!\n", contents);
drop(guard);
println!("Outside!");
The library also support anything that implements AsRawFd, through the override_raw call. Confirming that it will probably just work on unix.
Otherwise, you can check on the implementation on how it is done internally, and maybe you could bypass a writer instead of a file somehow.

Shadow println!:
use std::{fs::File, io::Write, mem::MaybeUninit, sync::Mutex};
static mut FILE: MaybeUninit<Mutex<File>> = MaybeUninit::uninit();
macro_rules! println {
($($tt:tt)*) => {{
unsafe { writeln!(&mut FILE.assume_init_mut().lock().unwrap(), $($tt)*).unwrap(); }
}}
}
fn foo() {
println!("hello");
}
fn main() {
unsafe {
FILE.write(Mutex::new(File::create("out").unwrap()));
}
foo();
}

Is there a simpler way to pass a BufReader to a function?

To read the bytes of a PNG file, I want to create a function called read_8_bytes which will read the next 8 bytes in the file each time it's called.
fn main(){
let png = File::open("test.png").expect("1");
let mut png_reader = BufReader::new(png);
let mut byteBuffer: Vec<u8> = vec![0;8];
png_reader.read_exact(&mut byteBuffer).expect("2");
}
This works fine and if I keep calling read_exact from main I can read the next 8 bytes. I tried to create a function to do this and the solution just seems needlessly complicated. I'm wondering if there is a better way.
I thought I have to pass the BufReader to the function, but due to how Rust works this makes things complicated and I end up working out I need to do something like:
fn read_eight_bytes<R: BufRead>(fd: &mut R)
This compiles but I'm not happy because I don't understand why this needed to be done and seems complex. Is there a simple way of having a function I can pass a file descriptor type thing to and have it store the position like in C without having to do this?

Looking at your question, I think you are trying to say that you are confused as to why the <R: BufRead> is necessary or furthermore why this even works.
In your example, this generic is not strictly necessary. One could implement the function you describe like so:
use std::{fs, io};
fn main() -> io::Result<()> {
let mut file = fs::File::open("./path/to/file")?;
let bytes = read_eight_bytes(&mut file)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes(file: &mut fs::File) -> io::Result<[u8; 8]> {
use io::Read;
let mut bytes = [0; 8];
file.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
This is perfectly valid and hopefully should make sense.
But then, why does fn read_eight_bytes<R: BufRead>(file: &mut R) -> [u8; 8] work? First of all, I assume you understand the following concepts:
Generics
Traits
Given an understanding of the above concepts, you should know that this syntax means that the function read_eight_bytes is a generic function with a generic type named R. You should then also understand that the generic has a trait bound, requiring the type R to implement BufRead. And that this function takes a parameter which is a mutable reference to the variable file, which is of the type R.
Now taking a look at the definition of BufRead: we see that it contains several functions. But surprisingly there is no read_exact function! Why does a function like this compile?
use std::{fs, io};
use io::BufRead;
fn main() -> io::Result<()> {
let file = fs::File::open("./path/to/file")?;
let mut reader = io::BufReader::new(file);
let bytes = read_eight_bytes(&mut reader)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes<R: BufRead>(reader: &mut R) -> io::Result<[u8; 8]> {
let mut bytes = [0; 8];
reader.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
Note: I have altered the return type to io::Result<...>. This is considered to be a better practice compared to unwraping every Result.
I have also changed the function call to use a BufReader because BufReader implements BufRead whilst File does not. I will cover the difference a little further below.
The reason this works is because BufRead is a Super Trait. This means that any type that implements BufRead must also implement Read too. And thus it must have the read_exact function!
Given our function never requires the functions on BufRead we could change the trait bound to only require Read:
use std::{fs, io};
use io::Read;
fn main() -> io::Result<()> {
let file = fs::File::open("./path/to/file")?;
let mut reader = io::BufReader::new(file);
let bytes = read_eight_bytes(&mut reader)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes<R: Read>(reader: &mut R) -> io::Result<[u8; 8]> {
let mut bytes = [0; 8];
reader.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
Now here is something interesting about this change. The read_eight_bytes function can now be called in (at least) two different ways:
use std::{fs, io};
use io::Read;
fn main() -> io::Result<()> {
let mut file = fs::File::open("./path/to/file")?;
let bytes = read_eight_bytes(&mut file)?;
println!("{:?}", bytes);
let file = fs::File::open("./path/to/file")?;
let mut reader = io::BufReader::new(file);
let bytes = read_eight_bytes(&mut reader)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes<R: Read>(reader: &mut R) -> io::Result<[u8; 8]> {
let mut bytes = [0; 8];
reader.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
Why is this? This is because both File and BufReader implement the Read trait. And thus can both be used with the read_eight_bytes function!
So then why would someone want to use either File or BufReader over the other?
Well the BufReader documentation explains this:
The BufReader struct adds buffering to any reader.
It can be excessively inefficient to work directly with a Read
instance. For example, every call to read on TcpStream results in a
system call. A BufReader performs large, infrequent reads on the
underlying Read and maintains an in-memory buffer of the results.
BufReader can improve the speed of programs that make small and
repeated read calls to the same file or network socket. It does not
help when reading very large amounts at once, or reading just one or a
few times. It also provides no advantage when reading from a source
that is already in memory, like a Vec.
Now, remember how before we wrote this function just for the File type? The primary reason why one would want to write it with generics would be such that a caller can make the choice presented above. This is common practice in libraries where such a choice really does matter. However, generics come at the cost of increased compile times (when used excessively) and increased code complexity.

Idiomatic way of mimicking Python's input function in Rust

I have two three different versions of a function that mimics the input function from python.
use std::io::{self, BufRead, BufReader, Write};
// Adapted from https://docs.rs/python-input/0.8.0/src/python_input/lib.rs.html#13-23
fn input_1(prompt: &str) -> io::Result<String> {
print!("{}", prompt);
io::stdout().flush()?;
let mut buffer = String::new();
io::stdin().read_line(&mut buffer)?;
Ok(buffer.trim_end().to_string())
}
// https://www.reddit.com/r/rust/comments/6qn3y0/store_user_inputs_in_rust/
fn input_2(prompt: &str) -> io::Result<String> {
print!("{}", prompt);
io::stdout().flush()?;
BufReader::new(io::stdin())
.lines()
.next()
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "Cannot read stdin"))
.and_then(|inner| inner)
}
// tranzystorek user on Discord (edited for future reference)
fn input_3(prompt: &str) -> io::Result<String> {
print!("{}", prompt);
std::io::stdout().flush()?;
BufReader::new(std::io::stdin().lock())
.lines()
.take(1)
.collect()
}
fn main() {
let name = input_1("What's your name? ").unwrap();
println!("Hello, {}!", name);
let name = input_2("What's your name? ").unwrap();
println!("Hello, {}!", name);
let name = input_3("What's your name? ").unwrap();
println!("Hello, {}!", name);
}
But they seem to be very different aproaches and I don't know if there's any advantage using one over the other. From what I've read, having a function like python's input is not as simple as it seems which is why there's none in the standard library.
What problems could I face using any of the versions written above? Is there another, more idiomatic, way of writing this input function? (2018 edition)
Also, here: How can I read a single line from stdin? some of the answers use the lock() method but I don't get its purpose.
I'm learning Rust coming from python.

This is a question of style mostly - both methods are acceptable. Most of the Rustaceans I know would probably favour the second approach, as it's more functional in style but it really doesn't matter in this case.
The key change I'd make is use of the lock method in your second example.
To understand the lock method, consider the following scenario: if you application was multithreaded, and two threads attempted to read from stdin at the same time, what would happen?
The lock ensures that only one thread can access stdin at a time. You always access stdin through a lock. In fact, if you look at the implementation of Stdin::read_line - the method you call in the first example you'll see it's this very simple one-liner:
self.lock().read_line(buf)
So even when you aren't explicitly calling lock it's still being used behind the scenes.
Secondly .next() won't return None in this case, as it will block until data has been entered, so you can use .unwrap() safely here rather than .ok_or/.and_then.
Lastly you missed out the .trim_end() that you had in input_1 ;).
fn input_2(prompt: &str) -> io::Result<String> {
print!("{}", prompt);
io::stdout().flush()?;
io::stdin()
.lock()
.lines()
.next()
.unwrap()
.map(|x| x.trim_end().to_owned())
}

Borrow checker doesn't realize that `clear` drops reference to local variable

The following code reads space-delimited records from stdin, and writes comma-delimited records to stdout. Even with optimized builds it's rather slow (about twice as slow as using, say, awk).
use std::io::BufRead;
fn main() {
let stdin = std::io::stdin();
for line in stdin.lock().lines().map(|x| x.unwrap()) {
let fields: Vec<_> = line.split(' ').collect();
println!("{}", fields.join(","));
}
}
One obvious improvement would be to use itertools to join without allocating a vector (the collect call causes an allocation). However, I tried a different approach:
fn main() {
let stdin = std::io::stdin();
let mut cache = Vec::<&str>::new();
for line in stdin.lock().lines().map(|x| x.unwrap()) {
cache.extend(line.split(' '));
println!("{}", cache.join(","));
cache.clear();
}
}
This version tries to reuse the same vector over and over. Unfortunately, the compiler complains:
error: `line` does not live long enough
--> src/main.rs:7:22
|
7 | cache.extend(line.split(' '));
| ^^^^
|
note: reference must be valid for the block suffix following statement 1 at 5:39...
--> src/main.rs:5:40
|
5 | let mut cache = Vec::<&str>::new();
| ^
note: ...but borrowed value is only valid for the for at 6:4
--> src/main.rs:6:5
|
6 | for line in stdin.lock().lines().map(|x| x.unwrap()) {
| ^
error: aborting due to previous error
Which of course makes sense: the line variable is only alive in the body of the for loop, whereas cache keeps a pointer into it across iterations. But that error still looks spurious to me: since the cache is cleared after each iteration, no reference to line can be kept, right?
How can I tell the borrow checker about this?

The only way to do this is to use transmute to change the Vec<&'a str> into a Vec<&'b str>. transmute is unsafe and Rust will not raise an error if you forget the call to clear here. You might want to extend the unsafe block up to after the call to clear to make it clear (no pun intended) where the code returns to "safe land".
use std::io::BufRead;
use std::mem;
fn main() {
let stdin = std::io::stdin();
let mut cache = Vec::<&str>::new();
for line in stdin.lock().lines().map(|x| x.unwrap()) {
let cache: &mut Vec<&str> = unsafe { mem::transmute(&mut cache) };
cache.extend(line.split(' '));
println!("{}", cache.join(","));
cache.clear();
}
}

In this case Rust doesn't know what you're trying to do. Unfortunately, .clear() does not affect how .extend() is checked.
The cache is a "vector of strings that live as long as the main function", but in extend() calls you're appending "strings that live only as long as one loop iteration", so that's a type mismatch. The call to .clear() doesn't change the types.
Usually such limited-time uses are expressed by making a long-lived opaque object that enables access to its memory by borrowing a temporary object with the right lifetime, like RefCell.borrow() gives a temporary Ref object. Implementation of that would be a bit involved and would require unsafe methods for recycling Vec's internal memory.
In this case an alternative solution could be to avoid any allocations at all (.join() allocates too) and stream the printing thanks to Peekable iterator wrapper:
for line in stdin.lock().lines().map(|x| x.unwrap()) {
let mut fields = line.split(' ').peekable();
while let Some(field) = fields.next() {
print!("{}", field);
if fields.peek().is_some() {
print!(",");
}
}
print!("\n");
}
BTW: Francis' answer with transmute is good too. You can use unsafe to say you know what you're doing and override the lifetime check.

Itertools has .format() for the purpose of lazy formatting, which skips allocating a string too.
use std::io::BufRead;
use itertools::Itertools;
fn main() {
let stdin = std::io::stdin();
for line in stdin.lock().lines().map(|x| x.unwrap()) {
println!("{}", line.split(' ').format(","));
}
}
A digression, something like this is a “safe abstraction” in the littlest sense of the solution in another answer here:
fn repurpose<'a, T: ?Sized>(mut v: Vec<&T>) -> Vec<&'a T> {
v.clear();
unsafe {
transmute(v)
}
}

Another approach is to refrain from storing references altogether, and to store indices instead. This trick can also be useful in other data structure contexts, so this might be a nice opportunity to try it out.
use std::io::BufRead;
fn main() {
let stdin = std::io::stdin();
let mut cache = Vec::new();
for line in stdin.lock().lines().map(|x| x.unwrap()) {
cache.push(0);
cache.extend(line.match_indices(' ').map(|x| x.0 + 1));
// cache now contains the indices where new words start
// do something with this information
for i in 0..(cache.len() - 1) {
print!("{},", &line[cache[i]..(cache[i + 1] - 1)]);
}
println!("{}", &line[*cache.last().unwrap()..]);
cache.clear();
}
}
Though you made the remark yourself in the question, I feel the need to point out that there are more elegant methods to do this using iterators, that might avoid the allocation of a vector altogether.
The approach above was inspired by a similar question here, and becomes more useful if you need to do something more complicated than printing.

Elaborating on Francis's answer about using transmute(), this could be safely abstracted, I think, with this simple function:
pub fn zombie_vec<'a, 'b, T: ?Sized>(mut data: Vec<&'a T>) -> Vec<&'b T> {
data.clear();
unsafe {
std::mem::transmute(data)
}
}
Using this, the original code would be:
fn main() {
let stdin = std::io::stdin();
let mut cache0 = Vec::<&str>::new();
for line in stdin.lock().lines().map(|x| x.unwrap()) {
let mut cache = cache0; // into the loop
cache.extend(line.split(' '));
println!("{}", cache.join(","));
cache0 = zombie_vec(cache); // out of the loop
}
}
You need to move the outer vector into every loop iteration, and restore it back to before you finish, while safely erasing the local lifetime.

The safe solution is to use .drain(..) instead of .clear() where .. is a "full range". It returns an iterator, so drained elements can be processed in a loop. It is also available for other collections (String, HashMap, etc.)
fn main() {
let mut cache = Vec::<&str>::new();
for line in ["first line allocates for", "second"].iter() {
println!("Size and capacity: {}/{}", cache.len(), cache.capacity());
cache.extend(line.split(' '));
println!(" {}", cache.join(","));
cache.drain(..);
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read file character-by-character in Rust - io

Related

Writing a Vec<String> to files using std::fs::write

How to capture the content of stdout/stderr when I cannot change the code that prints?

Is there a simpler way to pass a BufReader to a function?

Idiomatic way of mimicking Python's input function in Rust

Borrow checker doesn't realize that `clear` drops reference to local variable

Categories

Resources