I'm downloading an XZ file with hyper, and I would like to save it to disk in decompressed form by extracting as much as possible from each incoming Chunk and writing results to disk immediately, as opposed to first downloading the entire file and then decompressing.
There is the xz2 crate that implements the XZ format. However, its XzDecoder does not seem to support a Python-like decompressobj model, where a caller repeatedly feeds partial input and gets partial output.
Instead, XzDecoder receives input bytes via a Read parameter, and I'm not sure how to glue these two things together. Is there a way to feed a Response to XzDecoder?
The only clue I found so far is this issue, which contains a reference to a private ReadableChunks type, which I could in theory replicate in my code - but maybe there is an easier way?
XzDecoder does not seem to support a Python-like decompressobj model, where a caller repeatedly feeds partial input and gets partial output
there's xz2::stream::Stream which does exactly what you want. Very rough untested code, needs proper error handling, etc, but I hope you'll get the idea:
fn process(body: hyper::body::Body) {
let mut decoder = xz2::stream::Stream::new_stream_decoder(1000, 0).unwrap();
body.for_each(|chunk| {
let mut buf: Vec<u8> = Vec::new();
if let Ok(_) = decoder.process_vec(&chunk, &mut buf, Action::Run) {
// write buf to disk
}
Ok(())
}).wait().unwrap();
}
Based on #Laney's answer, I came up with the following working code:
extern crate failure;
extern crate hyper;
extern crate tokio;
extern crate xz2;
use std::fs::File;
use std::io::Write;
use std::u64;
use failure::Error;
use futures::future::done;
use futures::stream::Stream;
use hyper::{Body, Chunk, Response};
use hyper::rt::Future;
use hyper_tls::HttpsConnector;
use tokio::runtime::Runtime;
fn decode_chunk(file: &mut File, xz: &mut xz2::stream::Stream, chunk: &Chunk)
-> Result<(), Error> {
let end = xz.total_in() as usize + chunk.len();
let mut buf = Vec::with_capacity(8192);
while (xz.total_in() as usize) < end {
buf.clear();
xz.process_vec(
&chunk[chunk.len() - (end - xz.total_in() as usize)..],
&mut buf,
xz2::stream::Action::Run)?;
file.write_all(&buf)?;
}
Ok(())
}
fn decode_response(mut file: File, response: Response<Body>)
-> impl Future<Item=(), Error=Error> {
done(xz2::stream::Stream::new_stream_decoder(u64::MAX, 0)
.map_err(Error::from))
.and_then(|mut xz| response
.into_body()
.map_err(Error::from)
.for_each(move |chunk| done(
decode_chunk(&mut file, &mut xz, &chunk))))
}
fn main() -> Result<(), Error> {
let client = hyper::Client::builder().build::<_, hyper::Body>(
HttpsConnector::new(1)?);
let file = File::create("hello-2.7.tar")?;
let mut runtime = Runtime::new()?;
runtime.block_on(client
.get("https://ftp.gnu.org/gnu/hello/hello-2.7.tar.xz".parse()?)
.map_err(Error::from)
.and_then(|response| decode_response(file, response)))?;
runtime.shutdown_now();
Ok(())
}
Related
Background (Skippable)
On linux, the file /var/run/utmp contains several utmp structures, each in raw binary format, following each other in a file. utmp itself is a relatively large (384 bytes on my machine). I am trying to read this file to it's raw data, and them implement checks after the fact that the data makes sense. I'm not new to rust, but this is my first real experience with the unsafe side of things.
Problem Statement
I have a file that contains several c sturct utmps (docs). In rust, I would like to read the entire file into an array of Vec<libc::utmpx>. More specifically, given a reader open to this file, how could I read one struct utmp?
What I have so far
Below are three different implementations of read_raw, which accepts a reader and returns a RawEntry(my alias for struct utmp). Which method is most correct? I am trying to write as performant code as possible, and I am worried that read_raw0 might be slower than the others if it involves memcpys. What is the best/fastest way to accomplish this behavior?
use std::io::Read;
use libc::utmpx as RawEntry;
const RawEntrySize = std::mem::size_of::<RawEntry>();
type RawEntryBuffer = [u8; RawEntrySize];
/// Read a raw utmpx struct
// After testing, this method doesn't work
pub fn read_raw0<R: Read>(reader: &mut R) -> RawEntry {
let mut entry: RawEntry = unsafe { std::mem::zeroed() };
unsafe {
let mut entry_buf = std::mem::transmute::<RawEntry, RawEntryBuffer>(entry);
reader.read_exact(&mut entry_buf[..]);
}
return entry;
}
/// Read a raw utmpx struct
pub fn read_raw1<R: Read>(reader: &mut R) -> RawEntry {
// Worried this could cause alignment issues, or maybe it's okay
// because transmute copies
let mut buffer: RawEntryBuffer = [0; RawEntrySize];
reader.read_exact(&mut buffer[..]);
let entry = unsafe {
std::mem::transmute::<RawEntryBuffer, RawEntry>(buffer)
};
return entry;
}
/// Read a raw utmpx struct
pub fn read_raw2<R: Read>(reader: &mut R) -> RawEntry {
let mut entry: RawEntry = unsafe { std::mem::zeroed() };
unsafe {
let entry_ptr = std::mem::transmute::<&mut RawEntry, *mut u8>(&mut entry);
let entry_slice = std::slice::from_raw_parts_mut(entry_ptr, RawEntrySize);
reader.read_exact(entry_slice);
}
return entry;
}
Note: After more testing, it appears read_raw0 doesn't work. I believe this is because transmute creates a new buffer instead of referencing the struct.
This is what I came up with, which I imagine should be about as fast as it gets to read a single entry. It follows the spirit of your last entry, but avoids the transmute (Transmuting &mut T to *mut u8 can be done with two casts: t as *mut T as *mut u8). Also it uses MaybeUninit instead of zeroed to be a bit more explicit (The assembly is likely the same once optimized). Lastly, the function will be unsafe either way, so we may as well mark it as such and do away with the unsafe blocks.
use std::io::{self, Read};
use std::slice::from_raw_parts_mut;
use std::mem::{MaybeUninit, size_of};
pub unsafe fn read_raw_struct<R: Read, T: Sized>(src: &mut R) -> io::Result<T> {
let mut buffer = MaybeUninit::uninit();
let buffer_slice = from_raw_parts_mut(buffer.as_mut_ptr() as *mut u8, size_of::<T>());
src.read_exact(buffer_slice)?;
Ok(buffer.assume_init())
}
How to iterate over a gziped file which contains a single text file (csv)?
Searching crates.io I found flate2 which has the following code example for decompression:
extern crate flate2;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let mut d = GzDecoder::new("...".as_bytes()).unwrap();
let mut s = String::new();
d.read_to_string(&mut s).unwrap();
println!("{}", s);
}
How to stream a gzip csv file?
For stream io operations rust has the Read and Write traits. To iterate over input by lines you usually want the BufRead trait, which you can always get by wrapping a Read implementation in BufReader::new.
flate2 already operates with these traits; GzDecoder implements Read, and GzDecoder::new takes anything that implements Read.
Example decoding stdin (doesn't work well on playground of course):
extern crate flate2;
use std::io;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let stdin = io::stdin();
let stdin = stdin.lock(); // or just open any normal file
let d = GzDecoder::new(stdin).expect("couldn't decode gzip stream");
for line in io::BufReader::new(d).lines() {
println!("{}", line.unwrap());
}
}
You can then decode your lines with your usual ("without gzip") logic; perhaps make it generic by taking any input implementing BufRead.
I'm using the fdpass crate to send file descriptors from one process to another over an unix socket (I don't care about compatibility, unix only is fine).
Using mio I manage to listen for events on those file descriptors :
let fd = fdpass::recv_fd(&mut client, vec!(0u8)).unwrap();
let efd = EventedFd(&fd.into_raw_fd());
poll.register(&efd, Token(0), Ready::readable(), PollOpt::level()).unwrap();
That works perfectly fine, but I'd like to use a BufReader to read that file descriptor line by line. I've been trying to figure out a way to use from_raw_fd() on something that would implement BufReader unsucessfully. It seems to exist only for things like files or network streams. The only other thing is Stdio which does not implement Read, required for BufRead.
Any suggestions as to how I could get a BufReader from a raw fd without making mio unsafe to use?
The file descriptors by the way are not files (although they might be at some point) so I can't use File::, right now I'm just sending the client's stdin as a raw fd through fdpass.
Unfortunately FromRawFd is only implemented for a hand full of structs. You need to know beforehand what kind of "file" you want to read or you risk undefined behavoir (because Rust assumes that a FD is a type that it isn't).
You can however implement your own struct which does nothing else than reading which is fine for all file descriptor. This can be done by a function call to man (2) read.
use libc;
use std::ffi::OsStr;
use std::io::{Error, Read, Result};
use std::os::unix::ffi::OsStrExt;
use std::os::unix::io::{FromRawFd, RawFd};
pub struct RawFdReader {
fd: RawFd,
}
impl FromRawFd for RawFdReader {
unsafe fn from_raw_fd(fd: RawFd) -> Self {
Self { fd }
}
}
impl Read for RawFdReader {
fn read(&mut self, buf: &mut [u8]) -> Result<usize> {
assert!(buf.len() <= isize::max_value() as usize);
match unsafe { libc::read(self.fd, buf.as_mut_ptr() as _, buf.len()) } {
x if x < 0 => Err(Error::last_os_error()),
x => Ok(x as usize),
}
}
}
fn main() -> Result<()> {
let mut reader = unsafe { RawFdReader::from_raw_fd(0) };
let mut buffer = vec![0; 10];
let len = reader.read(&mut buffer)?;
println!("{:?}", OsStr::from_bytes(&buffer[..len]));
Ok(())
}
I am trying to write the contents of an HTTP Response to a file.
extern crate reqwest;
use std::io::Write;
use std::fs::File;
fn main() {
let mut resp = reqwest::get("https://www.rust-lang.org").unwrap();
assert!(resp.status().is_success());
// Write contents to disk.
let mut f = File::create("download_file").expect("Unable to create file");
f.write_all(resp.bytes());
}
But I get the following compile error:
error[E0308]: mismatched types
--> src/main.rs:12:17
|
12 | f.write_all(resp.bytes());
| ^^^^^^^^^^^^ expected &[u8], found struct `std::io::Bytes`
|
= note: expected type `&[u8]`
found type `std::io::Bytes<reqwest::Response>`
You cannot. Checking the docs for io::Bytes, there are no appropriate methods. That's because io::Bytes is an iterator that returns things byte-by-byte so there may not even be a single underlying slice of data.
It you only had io::Bytes, you would need to collect the iterator into a Vec:
let data: Result<Vec<_>, _> = resp.bytes().collect();
let data = data.expect("Unable to read data");
f.write_all(&data).expect("Unable to write data");
However, in most cases you have access to the type that implements Read, so you could instead use Read::read_to_end:
let mut data = Vec::new();
resp.read_to_end(&mut data).expect("Unable to read data");
f.write_all(&data).expect("Unable to write data");
In this specific case, you can use io::copy to directly copy from the Request to the file because Request implements io::Read and File implements io::Write:
extern crate reqwest;
use std::io;
use std::fs::File;
fn main() {
let mut resp = reqwest::get("https://www.rust-lang.org").unwrap();
assert!(resp.status().is_success());
// Write contents to disk.
let mut f = File::create("download_file").expect("Unable to create file");
io::copy(&mut resp, &mut f).expect("Unable to copy data");
}
I need a completely in-memory object that I can give to BufReader and BufWriter. Something like Python's StringIO. I want to write to and read from such an object using methods ordinarily used with Files.
Is there a way to do this using the standard library?
In fact there is a way: Cursor<T>!
(please also read Shepmaster's answer on why often it's even easier)
In the documentation you can see that there are the following impls:
impl<T> Seek for Cursor<T> where T: AsRef<[u8]>
impl<T> Read for Cursor<T> where T: AsRef<[u8]>
impl Write for Cursor<Vec<u8>>
impl<T> AsRef<[T]> for Vec<T>
From this you can see that you can use the type Cursor<Vec<u8>> just as an ordinary file, because Read, Write and Seek are implemented for that type!
Little example (Playground):
use std::io::{Cursor, Read, Seek, SeekFrom, Write};
// Create fake "file"
let mut c = Cursor::new(Vec::new());
// Write into the "file" and seek to the beginning
c.write_all(&[1, 2, 3, 4, 5]).unwrap();
c.seek(SeekFrom::Start(0)).unwrap();
// Read the "file's" contents into a vector
let mut out = Vec::new();
c.read_to_end(&mut out).unwrap();
println!("{:?}", out);
For a more useful example, check the documentation linked above.
You don't need a Cursor most of the time.
object that I can give to BufReader and BufWriter
BufReader requires a value that implements Read:
impl<R: Read> BufReader<R> {
pub fn new(inner: R) -> BufReader<R>
}
BufWriter requires a value that implements Write:
impl<W: Write> BufWriter<W> {
pub fn new(inner: W) -> BufWriter<W> {}
}
If you view the implementors of Read you will find impl<'a> Read for &'a [u8].
If you view the implementors of Write, you will find impl Write for Vec<u8>.
use std::io::{Read, Write};
fn main() {
// Create fake "file"
let mut file = Vec::new();
// Write into the "file"
file.write_all(&[1, 2, 3, 4, 5]).unwrap();
// Read the "file's" contents into a new vector
let mut out = Vec::new();
let mut c = file.as_slice();
c.read_to_end(&mut out).unwrap();
println!("{:?}", out);
}
Writing to a Vec will always append to the end. We also take a slice to the Vec that we can update. Each read of c will advance the slice further and further until it is empty.
The main differences from Cursor:
Cannot seek the data, so you cannot easily re-read data
Cannot write to anywhere but the end
If you want to use BufReader with an in-memory String, you can use the as_bytes() method:
use std::io::BufRead;
use std::io::BufReader;
use std::io::Read;
fn read_buff<R: Read>(mut buffer: BufReader<R>) {
let mut data = String::new();
let _ = buffer.read_line(&mut data);
println!("read_buff got {}", data);
}
fn main() {
read_buff(BufReader::new("Potato!".as_bytes()));
}
This prints read_buff got Potato!. There is no need to use a cursor for this case.
To use an in-memory String with BufWriter, you can use the as_mut_vec method. Unfortunately it is unsafe and I have not found any other way. I don't like the Cursor approach since it consumes the vector and I have not found a way yet to use the Cursor together with BufWriter.
use std::io::BufWriter;
use std::io::Write;
pub fn write_something<W: Write>(mut buf: BufWriter<W>) {
buf.write("potato".as_bytes());
}
#[cfg(test)]
mod tests {
use super::*;
use std::io::{BufWriter};
#[test]
fn testing_bufwriter_and_string() {
let mut s = String::new();
write_something(unsafe { BufWriter::new(s.as_mut_vec()) });
assert_eq!("potato", &s);
}
}