Fastest idiomatic I/O routine in Rust for programming contests? - string

My question has been partially answered, so I've revised it in response to things I've learned from comments and additional experiments.
In summary, I want a fast I/O routine for programming contests, in which problems are solved with a single file and no external crates. It should read in a sequence of whitespace-separated tokens from a BufRead (either stdin or a file). The tokens may be integers, floats or ASCII words, separated by spaces and newlines, so it seems I should support FromStr types generically. A small minority of problems are interactive, meaning not all of the input is available initially, but it always comes in complete lines.
For context, here's the discussion that led me to post here. Someone wrote very fast custom code to parse integers directly from the &[u8] output of BufRead::fill_buf(), but it's not generic in FromStr.
Here is my best solution so far (emphasis on the Scanner struct):
use std::io::{self, prelude::*};
fn solve<B: BufRead, W: Write>(mut scan: Scanner<B>, mut w: W) {
let n = scan.token();
let mut a = Vec::with_capacity(n);
let mut b = Vec::with_capacity(n);
for _ in 0..n {
a.push(scan.token::<i64>());
b.push(scan.token::<i64>());
}
let mut order: Vec<_> = (0..n).collect();
order.sort_by_key(|&i| b[i] - a[i]);
let ans: i64 = order
.into_iter()
.enumerate()
.map(|(i, x)| a[x] * i as i64 + b[x] * (n - 1 - i) as i64)
.sum();
writeln!(w, "{}", ans);
}
fn main() {
let stdin = io::stdin();
let stdout = io::stdout();
let reader = Scanner::new(stdin.lock());
let writer = io::BufWriter::new(stdout.lock());
solve(reader, writer);
}
pub struct Scanner<B> {
reader: B,
buf_str: String,
buf_iter: std::str::SplitWhitespace<'static>,
}
impl<B: BufRead> Scanner<B> {
pub fn new(reader: B) -> Self {
Self {
reader,
buf_str: String::new(),
buf_iter: "".split_whitespace(),
}
}
pub fn token<T: std::str::FromStr>(&mut self) -> T {
loop {
if let Some(token) = self.buf_iter.next() {
return token.parse().ok().expect("Failed parse");
}
self.buf_str.clear();
self.reader
.read_line(&mut self.buf_str)
.expect("Failed read");
self.buf_iter = unsafe { std::mem::transmute(self.buf_str.split_whitespace()) };
}
}
}
By avoiding unnecessary allocations, this Scanner is quite fast. If we didn't care about unsafety, it can be made even faster by, instead of doing read_line() into a String, doing read_until(b'\n') into a Vec<u8>, followed by str::from_utf8_unchecked().
However, I'd also like to know what's the fastest safe solution. Is there a clever way to tell Rust that what my Scanner implementation does is actually safe, eliminating the mem::transmute? Intuitively, it seems we should think of the SplitWhitespace object as owning the buffer until it's effectively dropped after it returns None.
All else being equal, I'd like a "nice" idiomatic standard library solution, as I'm trying to present Rust to others who do programming contests.

I'm so glad you asked, as I solved this exact problem in my LibCodeJam rust implementation. Specifically, reading raw tokens from a BufRead is handled by the TokensReader type as well as some tiny related helpers.
Here's the relevant excerpt. The basic idea here is to scan the BufRead::fill_buf buffer for whitespace, and copying non-whitespace characters into a local buffer, which is reused between token calls. Once a whitespace character is found, or the stream ends, the local buffer is interpreted as UTF-8 and returned as an &str.
#[derive(Debug)]
pub enum LoadError {
Io(io::Error),
Utf8Error(Utf8Error),
OutOfTokens,
}
/// TokenBuffer is a resuable buffer into which tokens are
/// read into, one-by-one. It is cleared but not deallocated
/// between each token.
#[derive(Debug)]
struct TokenBuffer(Vec<u8>);
impl TokenBuffer {
/// Clear the buffer and start reading a new token
fn lock(&mut self) -> TokenBufferLock {
self.0.clear();
TokenBufferLock(&mut self.0)
}
}
/// TokenBufferLock is a helper type that helps manage the lifecycle
/// of reading a new token, then interpreting it as UTF-8.
#[derive(Debug, Default)]
struct TokenBufferLock<'a>(&'a mut Vec<u8>);
impl<'a> TokenBufferLock<'a> {
/// Add some bytes to a token
fn extend(&mut self, chunk: &[u8]) {
self.0.extend(chunk)
}
/// Complete the token and attempt to interpret it as UTF-8
fn complete(self) -> Result<&'a str, LoadError> {
from_utf8(self.0).map_err(LoadError::Utf8Error)
}
}
pub struct TokensReader<R: io::BufRead> {
reader: R,
token: TokenBuffer,
}
impl<R: io::BufRead> Tokens for TokensReader<R> {
fn next_raw(&mut self) -> Result<&str, LoadError> {
use std::io::ErrorKind::Interrupted;
// Clear leading whitespace
loop {
match self.reader.fill_buf() {
Err(ref err) if err.kind() == Interrupted => continue,
Err(err) => return Err(LoadError::Io(err)),
Ok([]) => return Err(LoadError::OutOfTokens),
// Got some content; scan for the next non-whitespace character
Ok(buf) => match buf.iter().position(|byte| !byte.is_ascii_whitespace()) {
Some(i) => {
self.reader.consume(i);
break;
}
None => self.reader.consume(buf.len()),
},
};
}
// If we reach this point, there is definitely a non-empty token ready to be read.
let mut token_buf = self.token.lock();
loop {
match self.reader.fill_buf() {
Err(ref err) if err.kind() == Interrupted => continue,
Err(err) => return Err(LoadError::Io(err)),
Ok([]) => return token_buf.complete(),
// Got some content; scan for the next whitespace character
Ok(buf) => match buf.iter().position(u8::is_ascii_whitespace) {
Some(i) => {
token_buf.extend(&buf[..i]);
self.reader.consume(i + 1);
return token_buf.complete();
}
None => {
token_buf.extend(buf);
self.reader.consume(buf.len());
}
},
}
}
}
}
This implementation doesn't handle parsing strings into FromStr types— that's handled separately— but it does handle correctly accumulating bytes, separating them into whitespace-separated tokens, and interpreting those tokens as UTF-8. It does assume that only ASCII whitespace will be used to separate Tokens.
It's worth noting that FromStr cannot be used directly on the fill_buf buffer, because there's no guarantee that a token doesn't straddle the boundary between two fill_buf calls, and there's no way to force a BufRead to read more bytes until the existing buffer is fully consumed. I'm assuming it's pretty obvious that once you have an Ok(&str), you can perform FromStr on it at your leisure.
This implementation is not 0-copy, but is is (amortized) 0-allocation, and it minimizes unnecessary copying or buffering. It uses a single persistent buffer that is only resized if it's too small for a single token, and it reuses this buffer between tokens. Bytes are copied into this buffer directly from the input BufRead buffer, without extra intermediary copying.

Related

Make iterator of nested iterators

How could I pack the following code into a single iterator?
use std::io::{BufRead, BufReader};
use std::fs::File;
let file = BufReader::new(File::open("sample.txt").expect("Unable to open file"));
for line in file.lines() {
for ch in line.expect("Unable to read line").chars() {
println!("Character: {}", ch);
}
}
Naively, I’d like to have something like (I skipped unwraps)
let lines = file.lines().next();
Reader {
line: lines,
char: next().chars()
}
and iterate over Reader.char till hitting None, then refreshing Reader.line to a new line and Reader.char to the first character of the line. This doesn't seem to be possible though because Reader.char depends on the temporary variable.
Please notice that the question is about nested iterators, reading text files is used as an example.
You can use the flat_map() iterator utility to create new iterator that can produce any number of items for each item in the iterator it's called on.
In this case, that's complicated by the fact that lines() returns an iterator of Results, so the Err case must be handled.
There's also the issue that .chars() references the original string to avoid an additional allocation, so you have to collect the characters into another iterable container.
Solving both issues results in this mess:
fn example() -> impl Iterator<Item=Result<char, std::io::Error>> {
let file = BufReader::new(File::open("sample.txt").expect("Unable to open file"));
file.lines().flat_map(|line| match line {
Err(e) => vec![Err(e)],
Ok(line) => line.chars().map(Ok).collect(),
})
}
If String gave us an into_chars() method we could avoid collect() here, but then we'd have differently-typed iterators and would need to use either Box<dyn Iterator> or something like either::Either.
Since you already use .expect() here, you can simplify a bit by using .expect() within the closure to avoid handling the Err case:
fn example() -> impl Iterator<Item=char> {
let file = BufReader::new(File::open("sample.txt").expect("Unable to open file"));
file.lines().flat_map(|line|
line.expect("Unable to read line").chars().collect::<Vec<_>>()
)
}
In the general case, flat_map() is usually quite easy. You just need to be mindful of whether you are iterating owned vs borrowed values; both cases have some sharp corners. In this case, iterating over owned String values makes using .chars() problematic. If we could iterate over borrowed str slices we wouldn't have to .collect().
Drawing on the answer from #cdhowie and this answer that suggests using IntoIter to get an iterator of owned chars, I was able to come up with this solution that is the closest to what I expected:
use std::fs::File;
use std::io;
use std::io::{BufRead, BufReader, Lines};
use std::vec::IntoIter;
struct Reader {
lines: Lines<BufReader<File>>,
iter: IntoIter<char>,
}
impl Reader {
fn new(filename: &str) -> Self {
let file = BufReader::new(File::open(filename).expect("Unable to open file"));
let mut lines = file.lines();
let iter = Reader::char_iter(lines.next().expect("Unable to read file"));
Reader { lines, iter }
}
fn char_iter(line: io::Result<String>) -> IntoIter<char> {
line.unwrap().chars().collect::<Vec<_>>().into_iter()
}
}
impl Iterator for Reader {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
match self.iter.next() {
None => {
self.iter = match self.lines.next() {
None => return None,
Some(line) => Reader::char_iter(line),
};
Some('\n')
}
Some(val) => Some(val),
}
}
}
it works as expected:
let reader = Reader::new("src/main.rs");
for ch in reader {
print!("{}", ch);
}

Read binary file in units of f64 in Rust

Assuming you have a binary file example.bin and you want to read that file in units of f64, i.e. the first 8 bytes give a float, the next 8 bytes give a number, etc. (assuming you know endianess) How can this be done in Rust?
I know that one can use std::fs::read("example.bin") to get a Vec<u8> of the data, but then you have to do quite a bit of "gymnastics" to convert always 8 of the bytes to a f64, i.e.
fn eight_bytes_to_array(barry: &[u8]) -> &[u8; 8] {
barry.try_into().expect("slice with incorrect length")
}
let mut file_content = std::fs::read("example.bin").expect("Could not read file!");
let nr = eight_bytes_to_array(&file_content[0..8]);
let nr = f64::from_be_bytes(*nr_dp_per_spectrum);
I saw this post, but its from 2015 and a lot of changes have happend in Rust since then, so I was wondering if there is a better/faster way these days?
Example without proper error handling and checking for cases when file contains not divisible amount of bytes.
use std::fs::File;
use std::io::{BufReader, Read};
fn main() {
// Using BufReader because files in std is unbuffered by default
// And reading by 8 bytes is really bad idea.
let mut input = BufReader::new(
File::open("floats.bin")
.expect("Failed to open file")
);
let mut floats = Vec::new();
loop {
use std::io::ErrorKind;
// You may use 8 instead of `size_of` but size_of is less error-prone.
let mut buffer = [0u8; std::mem::size_of::<f64>()];
// Using read_exact because `read` may return less
// than 8 bytes even if there are bytes in the file.
// This, however, prevents us from handling cases
// when file size cannot be divided by 8.
let res = input.read_exact(&mut buffer);
match res {
// We detect if we read until the end.
// If there were some excess bytes after last read, they are lost.
Err(error) if error.kind() == ErrorKind::UnexpectedEof => break,
// Add more cases of errors you want to handle.
_ => {}
}
// You should do better error-handling probably.
// This simply panics.
res.expect("Unexpected error during read");
// Use `from_be_bytes` if numbers in file is big-endian
let f = f64::from_le_bytes(buffer);
floats.push(f);
}
}
I would create a generic iterator that returns f64 for flexibility and reusability.
struct F64Reader<R: io::BufRead> {
inner: R,
}
impl<R: io::BufRead> F64Reader<R> {
pub fn new(inner: R) -> Self {
Self{
inner
}
}
}
impl<R: io::BufRead> Iterator for F64Reader<R> {
type Item = f64;
fn next(&mut self) -> Option<Self::Item> {
let mut buff: [u8; 8] = [0;8];
self.inner.read_exact(&mut buff).ok()?;
Some(f64::from_be_bytes(buff))
}
}
This means if the file is large, you can loop through the values without storing it all in memory
let input = fs::File::open("example.bin")?;
for f in F64Reader::new(io::BufReader::new(input)) {
println!("{}", f)
}
Or if you want all the values you can collect them
let input = fs::File::open("example.bin")?;
let values : Vec<f64> = F64Reader::new(io::BufReader::new(input)).collect();

What is an idiomatic way to fill a user-supplied buffer when reading bytes?

Read::read returns the number of bytes that it actually read, which can be less than the requested buffer. In many cases, it is acceptable to make multiple calls to read in order to completely fill the buffer.
I have this code, but it seems pretty ungainly:
use std::io::{self, Read};
fn read_complete<R>(mut rdr: R, buf: &mut [u8]) -> io::Result<()>
where R: Read
{
let mut total_read = 0;
loop {
let window = &mut buf[total_read..];
let bytes_read = try!(rdr.read(window));
// Completely filled the buffer
if window.len() == bytes_read {
return Ok(());
}
// Unable to read anything
if bytes_read == 0 {
return Err(io::Error::new(io::ErrorKind::Other, "Unable to read complete buffer"));
}
// Partial read, continue
total_read += bytes_read;
}
}
fn main() {}
Is there a function in the standard library that will abstract this work away for me?
This answer applies to versions of Rust before 1.6.0
Not as far as I know.
Looking at the byteorder crate's source, there's a read_all method defined there, too:
fn read_full<R: io::Read + ?Sized>(rdr: &mut R, buf: &mut [u8]) -> Result<()> {
let mut nread = 0usize;
while nread < buf.len() {
match rdr.read(&mut buf[nread..]) {
Ok(0) => return Err(Error::UnexpectedEOF),
Ok(n) => nread += n,
Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {},
Err(e) => return Err(From::from(e))
}
}
Ok(())
}
Note that this deals with interrupted IO operations.
There's also a proposed RFC, that was submitted several months ago, went to final comment period, then changed enough that it was taken out of final comment period and is waiting for another go-around.
It turns out that this is unexpectedly complicated. :P
Since the RFC mentioned in the other answer is accepted, implemented, and available in Rust 1.6.0, you can just use the Reader::read_exact() method:
try!(r.read_exact(&mut buf))
Or, using the ? operator introduced in Rust 1.13.0:
r.read_exact(&mut buf)?

Does Read::read guarantee to append data and not overwrite any existing one?

I'm working on an SMTP library that reads lines over the network using a buffered reader.
I want a nice, safe way to read data from the network, without depending on Rust internals to make sure the code works as expected. Specifically, I'm wondering if the Read trait guarantees that data read with Read::read is appended to the buffer passed as an argument rather than overwriting the buffer entirely.
At the moment, I use a Range to make sure existing data is not overwritten without depending on Rust internals.
However, given that Rust used to have a nice way to do what I want, I'm wondering if the current code can be improved, possibly removing the unsafe blocks too.
No, it does not guarantee that:
use std::io::prelude::*;
use std::str;
fn main() {
let mut source1 = "hello, world!".as_bytes();
let mut source2 = "moo".as_bytes();
let mut dest = [0; 128];
source1.read(&mut dest).unwrap();
source2.read(&mut dest).unwrap();
let s = str::from_utf8(&dest[..16]).unwrap();
println!("{:?}", s)
}
This prints
"moolo, world!\u{0}\u{0}\u{0}"
Specifically, it cannot do what you want, based purely on the type signature:
fn read(&mut self, buf: &mut [u8]) -> Result<usize>;
All that the read method has access to is your mutable slice - there's nowhere to store information like "how far in the buffer you are". Furthermore, you aren't allowed to "extend" a mutable slice with more elements - you are only allowed to mutate the values within the slice.
For your particular case, you may want to look at BufRead::read_until. Here's a barely-tested example:
use std::io::{BufRead,BufReader};
use std::str;
fn main() {
let source1 = "header 1\r\nheader 2\r\n".as_bytes();
let mut reader = BufReader::new(source1);
let mut buf = vec![];
buf.reserve(128); // Maybe more efficient?
loop {
match reader.read_until(b'\n', &mut buf) {
Ok(0) => break,
Ok(_) => {},
Err(_) => panic!("Handle errors"),
}
if buf.len() < 2 { continue }
if buf[buf.len() - 2] == b'\r' {
{
let s = str::from_utf8(&buf).unwrap();
println!("Got a header {:?}", s);
}
buf.clear();
}
}
}

Reading Bytes From a Reader

I'm writing something to process stdin in blocks of bytes, but can't seem to work out a simple way to do it (though I suspect there is one).
fn run() -> int {
// Doesn't compile: types differ
let mut buffer = [0, ..100];
loop {
let block = match stdio::stdin().read(buffer) {
Ok(bytes_read) => buffer.slice_to(bytes_read),
// This captures the Err from the end of the file,
// but also actual errors while reading from stdin.
Err(message) => return 0
};
process(block).unwrap();
}
}
fn process(block: &[u8]) -> Result<(), IoError> {
// do things
}
My questions:
What's the "standard" way to do this? (I've been trying/hoping to use and_then()/or_else())
How can I differentiate between the Err(IoError) from end of the file, and the Err that's actually an error?
The previously accepted answer was outdated (Rust v1.0). EOF is no longer considered an error. You can do it like this:
use std::io::{self, Read};
fn main() {
let mut buffer = [0; 100];
while let Ok(bytes_read) = io::stdin().read(&mut buffer) {
if bytes_read == 0 { break; }
process(&buffer[..bytes_read]).unwrap();
}
}
fn process(block: &[u8]) -> Result<(), io::Error> {
Ok(()) // do things
}
Note that this may not result in the expected behavior: read doesn't have to fill the buffer, but may return with any number of bytes read. In the case of stdin the read implementation returns every time a newline is detected (pressing enter in terminal).
Rust API documentation states that:
Note that end-of-file is considered an error, and can be inspected for
in the error's kind field.
The IoError struct looks like this:
pub struct IoError {
pub kind: IoErrorKind,
pub desc: &'static str,
pub detail: Option<String>,
}
The list is all kinds is at http://doc.rust-lang.org/std/io/enum.IoErrorKind.html
You can match it like this:
match stdio::stdin().read(buffer) {
Ok(_) => println!("ok"),
Err(io::IoError{kind:io::EndOfFile, ..}) => println!("end of file"),
_ => println!("error")
}

Resources