How does one stream data from a reader to a write in Rust?
My end goal is actually to write out some gzipped data in a streaming fashion. It seems like what I am missing is a function to iterate over data from a reader and write it out to a file.
This task would be easy to accomplish with read_to_string, etc. But my requirement is to stream the data to keep memory usage down. I have not been able to find a simple way to do this that doesn't make lots of buffer allocations.
use std::io;
use std::io::prelude::*;
use std::io::{BufReader};
use std::fs::File;
use flate2::read::{GzEncoder};
use flate2::{Compression};
pub fn gzipped<R: Read>(file: String, stream: R) -> io::Result<()> {
let file = File::create(file)?;
let gz = BufReader::new(GzEncoder::new(stream, Compression::Default));
read_write(gz, file)
}
pub fn read_write<R: BufRead, W: Write>(mut r: R, mut w: W) -> io::Result<()> {
// ?
}
Your read_write function sounds exactly like io::copy. So this would be
pub fn gzipped<R: Read>(file: String, stream: R) -> io::Result<u64> {
let mut file = File::create(file)?;
let mut gz = BufReader::new(GzEncoder::new(stream, Compression::Default));
io::copy(&mut gz, &mut file)
}
The only difference is that io::copy takes mutable references, and returns Result<u64>.
Related
In Python it is possible to write
from io import StringIO
with StringIO("some text...") as stream:
for line in stream:
# work with the data
process_line(line)
Is there a way I can do the same thing, treat some string as a file object, and apply Read trait to it?
Yes, you can use std::io::Cursor:
use std::io::{Read, Cursor};
fn use_read_trait(s: String, buff: &mut [u8]) -> usize {
let mut c = Cursor::new(s);
c.read(buff).unwrap()
}
Background (Skippable)
On linux, the file /var/run/utmp contains several utmp structures, each in raw binary format, following each other in a file. utmp itself is a relatively large (384 bytes on my machine). I am trying to read this file to it's raw data, and them implement checks after the fact that the data makes sense. I'm not new to rust, but this is my first real experience with the unsafe side of things.
Problem Statement
I have a file that contains several c sturct utmps (docs). In rust, I would like to read the entire file into an array of Vec<libc::utmpx>. More specifically, given a reader open to this file, how could I read one struct utmp?
What I have so far
Below are three different implementations of read_raw, which accepts a reader and returns a RawEntry(my alias for struct utmp). Which method is most correct? I am trying to write as performant code as possible, and I am worried that read_raw0 might be slower than the others if it involves memcpys. What is the best/fastest way to accomplish this behavior?
use std::io::Read;
use libc::utmpx as RawEntry;
const RawEntrySize = std::mem::size_of::<RawEntry>();
type RawEntryBuffer = [u8; RawEntrySize];
/// Read a raw utmpx struct
// After testing, this method doesn't work
pub fn read_raw0<R: Read>(reader: &mut R) -> RawEntry {
let mut entry: RawEntry = unsafe { std::mem::zeroed() };
unsafe {
let mut entry_buf = std::mem::transmute::<RawEntry, RawEntryBuffer>(entry);
reader.read_exact(&mut entry_buf[..]);
}
return entry;
}
/// Read a raw utmpx struct
pub fn read_raw1<R: Read>(reader: &mut R) -> RawEntry {
// Worried this could cause alignment issues, or maybe it's okay
// because transmute copies
let mut buffer: RawEntryBuffer = [0; RawEntrySize];
reader.read_exact(&mut buffer[..]);
let entry = unsafe {
std::mem::transmute::<RawEntryBuffer, RawEntry>(buffer)
};
return entry;
}
/// Read a raw utmpx struct
pub fn read_raw2<R: Read>(reader: &mut R) -> RawEntry {
let mut entry: RawEntry = unsafe { std::mem::zeroed() };
unsafe {
let entry_ptr = std::mem::transmute::<&mut RawEntry, *mut u8>(&mut entry);
let entry_slice = std::slice::from_raw_parts_mut(entry_ptr, RawEntrySize);
reader.read_exact(entry_slice);
}
return entry;
}
Note: After more testing, it appears read_raw0 doesn't work. I believe this is because transmute creates a new buffer instead of referencing the struct.
This is what I came up with, which I imagine should be about as fast as it gets to read a single entry. It follows the spirit of your last entry, but avoids the transmute (Transmuting &mut T to *mut u8 can be done with two casts: t as *mut T as *mut u8). Also it uses MaybeUninit instead of zeroed to be a bit more explicit (The assembly is likely the same once optimized). Lastly, the function will be unsafe either way, so we may as well mark it as such and do away with the unsafe blocks.
use std::io::{self, Read};
use std::slice::from_raw_parts_mut;
use std::mem::{MaybeUninit, size_of};
pub unsafe fn read_raw_struct<R: Read, T: Sized>(src: &mut R) -> io::Result<T> {
let mut buffer = MaybeUninit::uninit();
let buffer_slice = from_raw_parts_mut(buffer.as_mut_ptr() as *mut u8, size_of::<T>());
src.read_exact(buffer_slice)?;
Ok(buffer.assume_init())
}
To read the bytes of a PNG file, I want to create a function called read_8_bytes which will read the next 8 bytes in the file each time it's called.
fn main(){
let png = File::open("test.png").expect("1");
let mut png_reader = BufReader::new(png);
let mut byteBuffer: Vec<u8> = vec![0;8];
png_reader.read_exact(&mut byteBuffer).expect("2");
}
This works fine and if I keep calling read_exact from main I can read the next 8 bytes. I tried to create a function to do this and the solution just seems needlessly complicated. I'm wondering if there is a better way.
I thought I have to pass the BufReader to the function, but due to how Rust works this makes things complicated and I end up working out I need to do something like:
fn read_eight_bytes<R: BufRead>(fd: &mut R)
This compiles but I'm not happy because I don't understand why this needed to be done and seems complex. Is there a simple way of having a function I can pass a file descriptor type thing to and have it store the position like in C without having to do this?
Looking at your question, I think you are trying to say that you are confused as to why the <R: BufRead> is necessary or furthermore why this even works.
In your example, this generic is not strictly necessary. One could implement the function you describe like so:
use std::{fs, io};
fn main() -> io::Result<()> {
let mut file = fs::File::open("./path/to/file")?;
let bytes = read_eight_bytes(&mut file)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes(file: &mut fs::File) -> io::Result<[u8; 8]> {
use io::Read;
let mut bytes = [0; 8];
file.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
This is perfectly valid and hopefully should make sense.
But then, why does fn read_eight_bytes<R: BufRead>(file: &mut R) -> [u8; 8] work? First of all, I assume you understand the following concepts:
Generics
Traits
Given an understanding of the above concepts, you should know that this syntax means that the function read_eight_bytes is a generic function with a generic type named R. You should then also understand that the generic has a trait bound, requiring the type R to implement BufRead. And that this function takes a parameter which is a mutable reference to the variable file, which is of the type R.
Now taking a look at the definition of BufRead: we see that it contains several functions. But surprisingly there is no read_exact function! Why does a function like this compile?
use std::{fs, io};
use io::BufRead;
fn main() -> io::Result<()> {
let file = fs::File::open("./path/to/file")?;
let mut reader = io::BufReader::new(file);
let bytes = read_eight_bytes(&mut reader)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes<R: BufRead>(reader: &mut R) -> io::Result<[u8; 8]> {
let mut bytes = [0; 8];
reader.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
Note: I have altered the return type to io::Result<...>. This is considered to be a better practice compared to unwraping every Result.
I have also changed the function call to use a BufReader because BufReader implements BufRead whilst File does not. I will cover the difference a little further below.
The reason this works is because BufRead is a Super Trait. This means that any type that implements BufRead must also implement Read too. And thus it must have the read_exact function!
Given our function never requires the functions on BufRead we could change the trait bound to only require Read:
use std::{fs, io};
use io::Read;
fn main() -> io::Result<()> {
let file = fs::File::open("./path/to/file")?;
let mut reader = io::BufReader::new(file);
let bytes = read_eight_bytes(&mut reader)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes<R: Read>(reader: &mut R) -> io::Result<[u8; 8]> {
let mut bytes = [0; 8];
reader.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
Now here is something interesting about this change. The read_eight_bytes function can now be called in (at least) two different ways:
use std::{fs, io};
use io::Read;
fn main() -> io::Result<()> {
let mut file = fs::File::open("./path/to/file")?;
let bytes = read_eight_bytes(&mut file)?;
println!("{:?}", bytes);
let file = fs::File::open("./path/to/file")?;
let mut reader = io::BufReader::new(file);
let bytes = read_eight_bytes(&mut reader)?;
println!("{:?}", bytes);
Ok(())
}
fn read_eight_bytes<R: Read>(reader: &mut R) -> io::Result<[u8; 8]> {
let mut bytes = [0; 8];
reader.read_exact(&mut bytes)?;
Ok(bytes)
}
Playground
Why is this? This is because both File and BufReader implement the Read trait. And thus can both be used with the read_eight_bytes function!
So then why would someone want to use either File or BufReader over the other?
Well the BufReader documentation explains this:
The BufReader struct adds buffering to any reader.
It can be excessively inefficient to work directly with a Read
instance. For example, every call to read on TcpStream results in a
system call. A BufReader performs large, infrequent reads on the
underlying Read and maintains an in-memory buffer of the results.
BufReader can improve the speed of programs that make small and
repeated read calls to the same file or network socket. It does not
help when reading very large amounts at once, or reading just one or a
few times. It also provides no advantage when reading from a source
that is already in memory, like a Vec.
Now, remember how before we wrote this function just for the File type? The primary reason why one would want to write it with generics would be such that a caller can make the choice presented above. This is common practice in libraries where such a choice really does matter. However, generics come at the cost of increased compile times (when used excessively) and increased code complexity.
I'm downloading an XZ file with hyper, and I would like to save it to disk in decompressed form by extracting as much as possible from each incoming Chunk and writing results to disk immediately, as opposed to first downloading the entire file and then decompressing.
There is the xz2 crate that implements the XZ format. However, its XzDecoder does not seem to support a Python-like decompressobj model, where a caller repeatedly feeds partial input and gets partial output.
Instead, XzDecoder receives input bytes via a Read parameter, and I'm not sure how to glue these two things together. Is there a way to feed a Response to XzDecoder?
The only clue I found so far is this issue, which contains a reference to a private ReadableChunks type, which I could in theory replicate in my code - but maybe there is an easier way?
XzDecoder does not seem to support a Python-like decompressobj model, where a caller repeatedly feeds partial input and gets partial output
there's xz2::stream::Stream which does exactly what you want. Very rough untested code, needs proper error handling, etc, but I hope you'll get the idea:
fn process(body: hyper::body::Body) {
let mut decoder = xz2::stream::Stream::new_stream_decoder(1000, 0).unwrap();
body.for_each(|chunk| {
let mut buf: Vec<u8> = Vec::new();
if let Ok(_) = decoder.process_vec(&chunk, &mut buf, Action::Run) {
// write buf to disk
}
Ok(())
}).wait().unwrap();
}
Based on #Laney's answer, I came up with the following working code:
extern crate failure;
extern crate hyper;
extern crate tokio;
extern crate xz2;
use std::fs::File;
use std::io::Write;
use std::u64;
use failure::Error;
use futures::future::done;
use futures::stream::Stream;
use hyper::{Body, Chunk, Response};
use hyper::rt::Future;
use hyper_tls::HttpsConnector;
use tokio::runtime::Runtime;
fn decode_chunk(file: &mut File, xz: &mut xz2::stream::Stream, chunk: &Chunk)
-> Result<(), Error> {
let end = xz.total_in() as usize + chunk.len();
let mut buf = Vec::with_capacity(8192);
while (xz.total_in() as usize) < end {
buf.clear();
xz.process_vec(
&chunk[chunk.len() - (end - xz.total_in() as usize)..],
&mut buf,
xz2::stream::Action::Run)?;
file.write_all(&buf)?;
}
Ok(())
}
fn decode_response(mut file: File, response: Response<Body>)
-> impl Future<Item=(), Error=Error> {
done(xz2::stream::Stream::new_stream_decoder(u64::MAX, 0)
.map_err(Error::from))
.and_then(|mut xz| response
.into_body()
.map_err(Error::from)
.for_each(move |chunk| done(
decode_chunk(&mut file, &mut xz, &chunk))))
}
fn main() -> Result<(), Error> {
let client = hyper::Client::builder().build::<_, hyper::Body>(
HttpsConnector::new(1)?);
let file = File::create("hello-2.7.tar")?;
let mut runtime = Runtime::new()?;
runtime.block_on(client
.get("https://ftp.gnu.org/gnu/hello/hello-2.7.tar.xz".parse()?)
.map_err(Error::from)
.and_then(|response| decode_response(file, response)))?;
runtime.shutdown_now();
Ok(())
}
I need a completely in-memory object that I can give to BufReader and BufWriter. Something like Python's StringIO. I want to write to and read from such an object using methods ordinarily used with Files.
Is there a way to do this using the standard library?
In fact there is a way: Cursor<T>!
(please also read Shepmaster's answer on why often it's even easier)
In the documentation you can see that there are the following impls:
impl<T> Seek for Cursor<T> where T: AsRef<[u8]>
impl<T> Read for Cursor<T> where T: AsRef<[u8]>
impl Write for Cursor<Vec<u8>>
impl<T> AsRef<[T]> for Vec<T>
From this you can see that you can use the type Cursor<Vec<u8>> just as an ordinary file, because Read, Write and Seek are implemented for that type!
Little example (Playground):
use std::io::{Cursor, Read, Seek, SeekFrom, Write};
// Create fake "file"
let mut c = Cursor::new(Vec::new());
// Write into the "file" and seek to the beginning
c.write_all(&[1, 2, 3, 4, 5]).unwrap();
c.seek(SeekFrom::Start(0)).unwrap();
// Read the "file's" contents into a vector
let mut out = Vec::new();
c.read_to_end(&mut out).unwrap();
println!("{:?}", out);
For a more useful example, check the documentation linked above.
You don't need a Cursor most of the time.
object that I can give to BufReader and BufWriter
BufReader requires a value that implements Read:
impl<R: Read> BufReader<R> {
pub fn new(inner: R) -> BufReader<R>
}
BufWriter requires a value that implements Write:
impl<W: Write> BufWriter<W> {
pub fn new(inner: W) -> BufWriter<W> {}
}
If you view the implementors of Read you will find impl<'a> Read for &'a [u8].
If you view the implementors of Write, you will find impl Write for Vec<u8>.
use std::io::{Read, Write};
fn main() {
// Create fake "file"
let mut file = Vec::new();
// Write into the "file"
file.write_all(&[1, 2, 3, 4, 5]).unwrap();
// Read the "file's" contents into a new vector
let mut out = Vec::new();
let mut c = file.as_slice();
c.read_to_end(&mut out).unwrap();
println!("{:?}", out);
}
Writing to a Vec will always append to the end. We also take a slice to the Vec that we can update. Each read of c will advance the slice further and further until it is empty.
The main differences from Cursor:
Cannot seek the data, so you cannot easily re-read data
Cannot write to anywhere but the end
If you want to use BufReader with an in-memory String, you can use the as_bytes() method:
use std::io::BufRead;
use std::io::BufReader;
use std::io::Read;
fn read_buff<R: Read>(mut buffer: BufReader<R>) {
let mut data = String::new();
let _ = buffer.read_line(&mut data);
println!("read_buff got {}", data);
}
fn main() {
read_buff(BufReader::new("Potato!".as_bytes()));
}
This prints read_buff got Potato!. There is no need to use a cursor for this case.
To use an in-memory String with BufWriter, you can use the as_mut_vec method. Unfortunately it is unsafe and I have not found any other way. I don't like the Cursor approach since it consumes the vector and I have not found a way yet to use the Cursor together with BufWriter.
use std::io::BufWriter;
use std::io::Write;
pub fn write_something<W: Write>(mut buf: BufWriter<W>) {
buf.write("potato".as_bytes());
}
#[cfg(test)]
mod tests {
use super::*;
use std::io::{BufWriter};
#[test]
fn testing_bufwriter_and_string() {
let mut s = String::new();
write_something(unsafe { BufWriter::new(s.as_mut_vec()) });
assert_eq!("potato", &s);
}
}