I've been working on some code that reads data from a Read type (the input) in chunks and does some processing on each chunk. The issue is that the final chunk needs to be processed with a different function. As far as I can tell, there's a couple of ways to detect EOF from a Read, but none of them feel particularly ergonomic for this case. I'm looking for a more idiomatic solution.
My current approach is to maintain two buffers, so that the previous read result can be maintained if the next read reads zero bytes, which indicates EOF in this case, since the buffer is of non-zero length:
use std::io::{Read, Result};
const BUF_SIZE: usize = 0x1000;
fn process_stream<I: Read>(mut input: I) -> Result<()> {
// Stores a chunk of input to be processed
let mut buf = [0; BUF_SIZE];
let mut prev_buf = [0; BUF_SIZE];
let mut prev_read = input.read(&mut prev_buf)?;
loop {
let bytes_read = input.read(&mut buf)?;
if bytes_read == 0 {
break;
}
// Some function which processes the contents of a chunk
process_chunk(&prev_buf[..prev_read]);
prev_read = bytes_read;
prev_buf.copy_from_slice(&buf[..]);
}
// Some function used to process the final chunk differently from all other messages
process_final_chunk(&prev_buf[..prev_read]);
Ok(())
}
This strikes me as a very ugly way to do this, I shouldn't need to use two buffers here.
An alternative I can think of would be to impose Seek on input and use input.read_exact(). I could then check for an UnexpectedEof errorkind to determine that we've hit the end of input, and seek backwards to read the final chunk again (the seek & read again is necessary here because the contents of the buffer are undefined in the case of an UnexpectedEof error). But this doesn't seem idiomatic at all: Encountering an error, seeking back, and reading again just to detect we're at the end of a file is very strange.
My ideal solution would be something like this, using an imaginary input.feof() function that returns true if the last input.read() call reached EOF, like the feof syscall in C:
fn process_stream<I: Read>(mut input: I) -> Result<()> {
// Stores a chunk of input to be processed
let mut buf = [0; BUF_SIZE];
let mut bytes_read = 0;
loop {
bytes_read = input.read(&mut buf)?;
if input.feof() {
break;
}
process_chunk(&buf[..bytes_read]);
}
process_final_chunk(&buf[..bytes_read]);
Ok(())
}
Can anyone suggest a way to implement this that is more idiomatic? Thanks!
When read of std::io::Read returns Ok(n), not only does that mean that the buffer buf has been filled in with n bytes of data from this source., but it also indicates that the bytes after index n (inclusive) are left untouched. With this in mind, you actually don't need a prev_buf at all, because when n is 0, all bytes of the buffer would be left untoutched (leaving them to be those bytes of the previous read).
prog-fh's solution is what you want to go with for the kind of processing you want to do, because it will only hand off full chunks to process_chunk. With read potentially returning a value between 0 and BUF_SIZE, this is needed. For more info, see this part of the above link:
It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
However, I advise that you think about what should happen when you get a Ok(0) from read that does not represent end of file forever. See this part:
If n is 0, then it can indicate one of two scenarios:
This reader has reached its “end of file” and will likely no longer be able to produce bytes. Note that this does not mean that the reader will always no longer be able to produce bytes.
So if you were to get a sequence of reads that returned Ok(BUF_SIZE), Ok(BUF_SIZE), 0, Ok(BUF_SIZE) (which is entirely possible, it just represents a hitch in the IO), would you want to not consider the last Ok(BUF_SIZE) as a read chunk? If you treat Ok(0) as EOF forever, that may be a mistake here.
The only way to reliably determine what should be considered as the last chunk is to have the expected length (in bytes, not # of chunks) sent beforehand as part of the protocol. Given a variable expected_len, you could then determine the start index of the last chunk through expected_len - expected_len % BUF_SIZE, and the end index just being expected_len itself.
Since you consider read_exact() as a possible solution, then we can consider that a non-final chunk contains exactly BUF_SIZE bytes.
Then why not just read as much as we can to fill such a buffer and process it with a function, then, when it's absolutely not possible (because EOF is reached), process the incomplete last chunk with another function?
Note that feof() in C does not guess that EOF will be reached on the next read attempt; it just reports the EOF flag that could have been set during the previous read attempt.
Thus, for EOF to be set and feof() to report it, a read attempt returning 0 must have been encountered first (as in the example below).
use std::fs::File;
use std::io::{Read, Result};
const BUF_SIZE: usize = 0x1000;
fn process_chunk(bytes: &[u8]) {
println!("process_chunk {}", bytes.len());
}
fn process_final_chunk(bytes: &[u8]) {
println!("process_final_chunk {}", bytes.len());
}
fn process_stream<I: Read>(mut input: I) -> Result<()> {
// Stores a chunk of input to be processed
let mut buf = [0; BUF_SIZE];
loop {
let mut bytes_read = 0;
while bytes_read < BUF_SIZE {
let r = input.read(&mut buf[bytes_read..])?;
if r == 0 {
break;
}
bytes_read += r;
}
if bytes_read == BUF_SIZE {
process_chunk(&buf);
} else {
process_final_chunk(&buf[..bytes_read]);
break;
}
}
Ok(())
}
fn main() {
let file = File::open("data.bin").unwrap();
process_stream(file).unwrap();
}
/*
$ dd if=/dev/random of=data.bin bs=1024 count=10
$ cargo run
process_chunk 4096
process_chunk 4096
process_final_chunk 2048
$ dd if=/dev/random of=data.bin bs=1024 count=8
$ cargo run
process_chunk 4096
process_chunk 4096
process_final_chunk 0
*/
Related
This question already has answers here:
What is the correct way to read a binary file in chunks of a fixed size and store all of those chunks into a Vec?
(2 answers)
Closed last month.
This post was edited and submitted for review last month and failed to reopen the post:
Original close reason(s) were not resolved
I have a Read instance (in this case, a file). I want to read at most some number of bytes N, meaning that if the file has length greater than N I read exactly N bytes, but if the file length is less than N, I read to the end of the file.
I can't use read_exact, because that might return UnexpectedEof, which means I dont't know what size to truncate the buffer to. I also don't want to just use a single read call, since that is OS-dependent and may read less than N.
I tried writing this, using Read::take:
const N: usize = 4096;
// Pretend this is a 20 byte file
let bytes = vec![3; 20];
let read = std::io::Cursor::new(&bytes);
let mut buf = vec![0; N];
let n = read.take(N as u64).read_to_end(&mut buf).unwrap();
buf.truncate(n);
assert_eq!(buf, bytes);
I would expect buf to be equal to bytes after the read_to_end call, but the assertion fails because buf ends up being only zeroes. The buffer does end up being the correct length, however.
read_to_end() expects a empty vector, you are providing it with one which is full with zeros. To fix your issue, rewrite your code using Vec::with_capacity which preallocates but does not fill the vector.
const N: usize = 4096;
let bytes = vec![3; 20];
let read = std::io::Cursor::new(&bytes);
// Use vec::with_capacity() to allocate without filling the vec
let mut buf = Vec::with_capacity(N);
let n = read.take(N as u64).read_to_end(&mut buf).unwrap();
buf.truncate(n);
assert_eq!(buf, bytes);
You should use std::io::Read::read:
use std::io::Read;
fn main() {
const N: usize = 4096;
// Pretend this is a 20 byte file
let bytes = vec![3; 20];
let mut read = std::io::Cursor::new(&bytes);
let mut buf = vec![0; N];
let n = read.read(&mut buf).unwrap();
buf.truncate (n);
assert_eq!(buf, bytes);
}
Playground
Note however that this may read less than even the full file size, although in practice this shouldn't be an issue for small files since file transfers are done in blocks, usually 4kB, and interruptions only happen on block boundaries, so any file smaller than a block should be read entirely.
I have this code in C that writes in a fd the command. My issue is that I can't represent the same behaviour in Rust language because apparently the .write() doesn't takes the same parameters as C's write(). The function is the following:
static void set_address(int32_t fd, uint64_t start_address, uint64_t len){
uint64_t command[3];
int64_t bytes;
command[0] = SET_ADDRESS_AREA;
command[1] = start_address;
command[2] = len;
bytes = write(fd, command, (ssize_t)LEN_SET_ADDRESS_AREA);
if (bytes != LEN_SET_ADDRESS_AREA)
{
printf("\nError\n");
exit(-1);
}
}
So my code is:
for (i,val) in ref_memory.iter().enumerate().step_by(DIM_SLICE){
let mut start_address = val ;
let p = std::ptr::addr_of!(start_address);
println!("the address index of val is {:?}",p);
let mut command = (SET_ADDRESS_AREA,start_address,DIM_SLICE);
let file_buffer = File::create(_path);
let bytes_written = file_buffer.unwrap().write(command);
}
}
Writing this
let bytes_written = file_buffer.unwrap().write(command);
I get the error:
Mismatched types: expected reference &[u8] and found tuple (u8, &u8, u8)
Should I create a struct to pass just one reference of type &u8?
Alternatively, is there a crate that offers this feature?
Its not clear why you're diverged so much from the C code when converting it to Rust. Why the for loop? Why the addr_of? Why create the file in the function when the original clearly already has the file descriptor? Why create a tuple instead of an array?
The direct conversion is mostly straight-forward.
fn set_address(file: &mut File, start_address: u64, len: u64) {
let command: [u64; 3] = [
SET_ADDRESS_AREA,
start_address,
len
];
let bytes = file.write(bytemuck::cast_slice(&command)).unwrap();
if bytes != LEN_SET_ADDRESS_AREA {
println!("Error");
std::process::exit(-1);
}
}
The only tricky part here is my use of the bytemuck crate to convert a [u64] into a [u8]. You can do without it, but is a bit more annoying.
Here is a full example on the playground that includes the above and two other methods.
Should I create a struct to pass just one reference of type &u8?
You don't need to create anything. write(2) takes 3 parameters because it needs an fd, a pointer to a buffer to write, and an amount of data to write.
In Rust, the fd is the object on which the method is called (file_buffer), and a slice (&[u8]) has a length so it provides both the "data to write" buffer and the amount of data to write.
What you should do is either just write ref_memory directly (it's not clear why you even split it across multiple writes, especially since you write them all to the same file anyway), or use chunks to split the input buffer into sub-slices, that you can then write directly
let p = std::ptr::addr_of!(start_address);
That makes absolutely no sense. That's a raw pointer to the start_address local variable, which is a copy of val.
Your C code is also... not right. Partial writes are perfectly valid and legal, there's lots of reasons why they might happen (e.g. hardware drivers, network buffer sizes, ...), a write(2) error is signalled by a return value of -1. Following which you'd generally read errno or use perror(3) or strerror(3) to report why the call failed.
I want to write a Rust program that takes everything in stdin and copies it to stdout. So far I have this
fn main() {
let mut stdin: io::Stdin = io::stdin();
let mut stdout: io::Stdout = io::stdout();
let mut buffer: [u8; 1_000_000] = [0; 1_000_000];
let mut n_bytes_read: usize = 0;
let mut uninitialized: bool = true;
while uninitialized || n_bytes_read > 0
{
n_bytes_read = stdin.read(&mut buffer).expect("Could not read from STDIN.");
uninitialized = false;
}
}
I'm copying everything into a buffer of size one million so as not to blow up the memory if someone feeds my program a 3 gigabyte file. So now I want to copy this to stdout, but the only primitive write operation I can find is stdout.write(&mut buffer) - but this writes the whole buffer! I would need a way to write a specific number of bytes, like stdout.write_only(&mut buffer, n_bytes_read).
I'd like to do this in the most basic way possible, with a minimum of standard library imports.
If all you wanted to do was copy from stdin to stdout without using much memory, just use std::io::copy. It streams the data from a reader to a writer.
If your goal is to write part of a buffer, then take a slice of that buffer and pass that to write:
stdout.write(&buffer[0..n_bytes_read]);
A slice does not copy the data so you will not use any more memory.
Note however that write may not write everything you have asked - it returns the number of bytes actually written. If you use write_all it will write the whole slice.
My plan is to write a simple method which does exactly what std::cin >> from the C++ standard library does:
use std::io::BufRead;
pub fn input<T: std::str::FromStr>(handle: &std::io::Stdin) -> Result<T, T::Err> {
let mut x = String::new();
let mut guard = handle.lock();
loop {
let mut trimmed = false;
let available = guard.fill_buf().unwrap();
let l = match available.iter().position(|&b| !(b as char).is_whitespace()) {
Some(i) => {
trimmed = true;
i
}
None => available.len(),
};
guard.consume(l);
if trimmed {
break;
}
}
let available = guard.fill_buf().unwrap();
let l = match available.iter().position(|&b| (b as char).is_whitespace()) {
Some(i) => i,
None => available.len(),
};
x.push_str(std::str::from_utf8(&available[..l]).unwrap());
guard.consume(l);
T::from_str(&x)
}
The loop is meant to trim away all the whitespace before valid input begins. The match block outside the loop is where the length of the valid input (that is, before trailing whitespaces begin or EOF is reached) is calculated.
Here is an example using the above method.
let handle = std::io::stdin();
let x: i32 = input(&handle).unwrap();
println!("x: {}", x);
let y: String = input(&handle).unwrap();
println!("y: {}", y);
When I tried a few simple tests, the method works as intended. However, when I use this in online programming judges like the one in codeforces, I get a complaint telling that the program sometimes stays idle or that the wrong input has been taken, among other issues, which leads to suspecting that I missed a corner case or something like that. This usually happens when the input is a few hundreds of lines long.
What input is going to break the method? What is the correction?
After a lot of experimentation, I noticed a lag when reading each input, which added up as the number of inputs were increased. The function doesn't make use of a buffer. It tries to access the stream every time it needs to fill a variable, which is slow and hence the lag.
Lesson learnt: Always use a buffer with a good capacity.
However, the idleness issue still persisted, until I replaced the fill_buf, consume pairs with something like read_line or read_string.
I have something that is Read; currently it's a File. I want to read a number of bytes from it that is only known at runtime (length prefix in a binary data structure).
So I tried this:
let mut vec = Vec::with_capacity(length);
let count = file.read(vec.as_mut_slice()).unwrap();
but count is zero because vec.as_mut_slice().len() is zero as well.
[0u8;length] of course doesn't work because the size must be known at compile time.
I wanted to do
let mut vec = Vec::with_capacity(length);
let count = file.take(length).read_to_end(vec).unwrap();
but take's receiver parameter is a T and I only have &mut T (and I'm not really sure why it's needed anyway).
I guess I can replace File with BufReader and dance around with fill_buf and consume which sounds complicated enough but I still wonder: Have I overlooked something?
Like the Iterator adaptors, the IO adaptors take self by value to be as efficient as possible. Also like the Iterator adaptors, a mutable reference to a Read is also a Read.
To solve your problem, you just need Read::by_ref:
use std::io::Read;
use std::fs::File;
fn main() {
let mut file = File::open("/etc/hosts").unwrap();
let length = 5;
let mut vec = Vec::with_capacity(length);
file.by_ref().take(length as u64).read_to_end(&mut vec).unwrap();
let mut the_rest = Vec::new();
file.read_to_end(&mut the_rest).unwrap();
}
1. Fill-this-vector version
Your first solution is close to work. You identified the problem but did not try to solve it! The problem is that whatever the capacity of the vector, it is still empty (vec.len() == 0). Instead, you could actually fill it with empty elements, such as:
let mut vec = vec![0u8; length];
The following full code works:
#![feature(convert)] // needed for `as_mut_slice()` as of 2015-07-19
use std::fs::File;
use std::io::Read;
fn main() {
let mut file = File::open("/usr/share/dict/words").unwrap();
let length: usize = 100;
let mut vec = vec![0u8; length];
let count = file.read(vec.as_mut_slice()).unwrap();
println!("read {} bytes.", count);
println!("vec = {:?}", vec);
}
Of course, you still have to check whether count == length, and read more data into the buffer if that's not the case.
2. Iterator version
Your second solution is better because you won't have to check how many bytes have been read, and you won't have to re-read in case count != length. You need to use the bytes() function on the Read trait (implemented by File). This transform the file into a stream (i.e an iterator). Because errors can still happen, you don't get an Iterator<Item=u8> but an Iterator<Item=Result<u8, R::Err>>. Hence you need to deal with failures explicitly within the iterator. We're going to use unwrap() here for simplicity:
use std::fs::File;
use std::io::Read;
fn main() {
let file = File::open("/usr/share/dict/words").unwrap();
let length: usize = 100;
let vec: Vec<u8> = file
.bytes()
.take(length)
.map(|r: Result<u8, _>| r.unwrap()) // or deal explicitly with failure!
.collect();
println!("vec = {:?}", vec);
}
You can always use a bit of unsafe to create a vector of uninitialized memory. It is perfectly safe to do with primitive types:
let mut v: Vec<u8> = Vec::with_capacity(length);
unsafe { v.set_len(length); }
let count = file.read(vec.as_mut_slice()).unwrap();
This way, vec.len() will be set to its capacity, and all bytes in it will be uninitialized (likely zeros, but possibly some garbage). This way you can avoid zeroing the memory, which is pretty safe for primitive types.
Note that read() method on Read is not guaranteed to fill the whole slice. It is possible for it to return with number of bytes less than the slice length. There are several RFCs on adding methods to fill this gap, for example, this one.