How to iterate over a gziped file which contains a single text file (csv)?
Searching crates.io I found flate2 which has the following code example for decompression:
extern crate flate2;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let mut d = GzDecoder::new("...".as_bytes()).unwrap();
let mut s = String::new();
d.read_to_string(&mut s).unwrap();
println!("{}", s);
}
How to stream a gzip csv file?
For stream io operations rust has the Read and Write traits. To iterate over input by lines you usually want the BufRead trait, which you can always get by wrapping a Read implementation in BufReader::new.
flate2 already operates with these traits; GzDecoder implements Read, and GzDecoder::new takes anything that implements Read.
Example decoding stdin (doesn't work well on playground of course):
extern crate flate2;
use std::io;
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let stdin = io::stdin();
let stdin = stdin.lock(); // or just open any normal file
let d = GzDecoder::new(stdin).expect("couldn't decode gzip stream");
for line in io::BufReader::new(d).lines() {
println!("{}", line.unwrap());
}
}
You can then decode your lines with your usual ("without gzip") logic; perhaps make it generic by taking any input implementing BufRead.
Related
In C, I can use rewind back to the start, but I didn't found a similar way in Rust.
I want to open an existed file, and let the file pointer go back to the start point, write new words to it and cover the old one.
But now I can only write something after the last line of the original file and don't know how to change the file pointer.
I known that rust has a crate libc::rewind, but how to use it, or any other ways?
Use seek.
use std::io::{self, Seek, SeekFrom};
use std::fs::File;
fn main() -> io::Result<()> {
let mut file = File::open("foo.bar")?;
file.seek(SeekFrom::Start(0))?;
Ok(())
}
You can use rewind(). It is a syntactic wrapper around SeekFrom::Start(0):
use std::io::{self, Seek};
use std::fs::File;
fn main() -> io::Result<()> {
let mut file = File::open("foo.bar")?;
file.rewind()?;
Ok(())
}
I want to write some code that can read bytes from:
stdin
files
a string
TCP
and maybe others. What is the best way to do this in Rust?
I thought the std::io::Read trait was the way to go, but it seems to be lacking implementations for string at least (I just needed this particular one for testing - maybe I can use something else)?
You may find it help to use the impl<'_> Read for &' [u8] trait for reading bytes from a string. As the type indicates, you'll have to first convert your nice string into a slice of bytes. Here is a short, dumb example.
use std::fs::File;
use std::io::{Error, Read, BufReader};
fn whoo<T: Read>(mut readable: T) {
let mut buffer = [0; 10];
readable.read(&mut buffer).expect("panic");
println!("{:?}", buffer);
}
fn main() -> Result<(), Error> {
whoo("hello there".as_bytes());
whoo("".as_bytes());
let dict = File::open("/usr/share/dict/words")?;
let reader = BufReader::new(dict);
whoo(reader);
Ok(())
}
I'm downloading an XZ file with hyper, and I would like to save it to disk in decompressed form by extracting as much as possible from each incoming Chunk and writing results to disk immediately, as opposed to first downloading the entire file and then decompressing.
There is the xz2 crate that implements the XZ format. However, its XzDecoder does not seem to support a Python-like decompressobj model, where a caller repeatedly feeds partial input and gets partial output.
Instead, XzDecoder receives input bytes via a Read parameter, and I'm not sure how to glue these two things together. Is there a way to feed a Response to XzDecoder?
The only clue I found so far is this issue, which contains a reference to a private ReadableChunks type, which I could in theory replicate in my code - but maybe there is an easier way?
XzDecoder does not seem to support a Python-like decompressobj model, where a caller repeatedly feeds partial input and gets partial output
there's xz2::stream::Stream which does exactly what you want. Very rough untested code, needs proper error handling, etc, but I hope you'll get the idea:
fn process(body: hyper::body::Body) {
let mut decoder = xz2::stream::Stream::new_stream_decoder(1000, 0).unwrap();
body.for_each(|chunk| {
let mut buf: Vec<u8> = Vec::new();
if let Ok(_) = decoder.process_vec(&chunk, &mut buf, Action::Run) {
// write buf to disk
}
Ok(())
}).wait().unwrap();
}
Based on #Laney's answer, I came up with the following working code:
extern crate failure;
extern crate hyper;
extern crate tokio;
extern crate xz2;
use std::fs::File;
use std::io::Write;
use std::u64;
use failure::Error;
use futures::future::done;
use futures::stream::Stream;
use hyper::{Body, Chunk, Response};
use hyper::rt::Future;
use hyper_tls::HttpsConnector;
use tokio::runtime::Runtime;
fn decode_chunk(file: &mut File, xz: &mut xz2::stream::Stream, chunk: &Chunk)
-> Result<(), Error> {
let end = xz.total_in() as usize + chunk.len();
let mut buf = Vec::with_capacity(8192);
while (xz.total_in() as usize) < end {
buf.clear();
xz.process_vec(
&chunk[chunk.len() - (end - xz.total_in() as usize)..],
&mut buf,
xz2::stream::Action::Run)?;
file.write_all(&buf)?;
}
Ok(())
}
fn decode_response(mut file: File, response: Response<Body>)
-> impl Future<Item=(), Error=Error> {
done(xz2::stream::Stream::new_stream_decoder(u64::MAX, 0)
.map_err(Error::from))
.and_then(|mut xz| response
.into_body()
.map_err(Error::from)
.for_each(move |chunk| done(
decode_chunk(&mut file, &mut xz, &chunk))))
}
fn main() -> Result<(), Error> {
let client = hyper::Client::builder().build::<_, hyper::Body>(
HttpsConnector::new(1)?);
let file = File::create("hello-2.7.tar")?;
let mut runtime = Runtime::new()?;
runtime.block_on(client
.get("https://ftp.gnu.org/gnu/hello/hello-2.7.tar.xz".parse()?)
.map_err(Error::from)
.and_then(|response| decode_response(file, response)))?;
runtime.shutdown_now();
Ok(())
}
I've poked the serde-yaml and yaml-rust crates a bit, but I haven't seen any examples.
serde-yaml's documentation has the following 4 functions:
from_reader — Deserialize an instance of type T from an IO stream of YAML.
from_slice — Deserialize an instance of type T from bytes of YAML text.
from_str — Deserialize an instance of type T from a string of YAML text.
from_value — Interpret a serde_yaml::Value as an instance of type T.
Using from_reader as an example:
use serde_yaml; // 0.8.7
fn main() -> Result<(), Box<dyn std::error::Error>> {
let f = std::fs::File::open("something.yaml")?;
let d: String = serde_yaml::from_reader(f)?;
println!("Read YAML string: {}", d);
Ok(())
}
something.yaml:
"I am YAML"
You can deserialize into the looser-typed Value if you don't know your format (String in this example), but be sure to read the Serde guide for full details of how to do type-directed serialization and deserialization instead.
See also:
How do I parse a JSON File?
Deserializing TOML into vector of enum with values
In general, using any Serde format is pretty much the same as all the rest.
This example uses the yaml_rust crate
use std::fs::File;
use std::io::prelude::*;
use yaml_rust::yaml::{Hash, Yaml};
use yaml_rust::YamlLoader;
fn main() {
println!("Hello, Yaml");
let file = "./etc/my_yaml_file.yaml";
load_file(file);
}
fn load_file(file: &str) {
let mut file = File::open(file).expect("Unable to open file");
let mut contents = String::new();
file.read_to_string(&mut contents)
.expect("Unable to read file");
let docs = YamlLoader::load_from_str(&contents).unwrap();
// iterate / process doc[s] ..
}
The answer from Shepmaster is great if you want to do it properly. Here's a complete example to get started with.
data['foo']['bar'].as_str() returns an Option<str>.
fn example() -> Result<String> {
let f = std::fs::File::open("something.yaml")?;
let data: serde_yaml::Value = serde_yaml::from_reader(f)?;
data["foo"]["bar"]
.as_str()
.map(|s| s.to_string())
.ok_or(anyhow!("Could not find key foo.bar in something.yaml"))
}
Is there an idiomatic way to process a file one character at a time in Rust?
This seems to be roughly what I'm after:
let mut f = io::BufReader::new(try!(fs::File::open("input.txt")));
for c in f.chars() {
println!("Character: {}", c.unwrap());
}
But Read::chars is still unstable as of Rust v1.6.0.
I considered using Read::read_to_string, but the file may be large and I don't want to read it all into memory.
Let's compare 4 approaches.
1. Read::chars
You could copy Read::chars implementation, but it is marked unstable with
the semantics of a partial read/write of where errors happen is currently unclear and may change
so some care must be taken. Anyway, this seems to be the best approach.
2. flat_map
The flat_map alternative does not compile:
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
for c in f.lines().flat_map(|l| l.expect("lines failed").chars()) {
println!("Character: {}", c);
}
}
The problems is that chars borrows from the string, but l.expect("lines failed") lives only inside the closure, so compiler gives the error borrowed value does not live long enough.
3. Nested for
This code
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
for line in f.lines() {
for c in line.expect("lines failed").chars() {
println!("Character: {}", c);
}
}
}
works, but it keeps allocation a string for each line. Besides, if there is no line break on the input file, the whole file would be load to the memory.
4. BufRead::read_until
A memory efficient alternative to approach 3 is to use Read::read_until, and use a single string to read each line:
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
let mut buf = Vec::<u8>::new();
while f.read_until(b'\n', &mut buf).expect("read_until failed") != 0 {
// this moves the ownership of the read data to s
// there is no allocation
let s = String::from_utf8(buf).expect("from_utf8 failed");
for c in s.chars() {
println!("Character: {}", c);
}
// this returns the ownership of the read data to buf
// there is no allocation
buf = s.into_bytes();
buf.clear();
}
}
I cannot use lines() because my file could be a single line that is gigabytes in size. This an improvement on #malbarbo's recommendation of copying Read::chars from the an old version of Rust. The utf8-chars crate already adds .chars() to BufRead for you.
Inspecting their repository, it doesn't look like they load more than 4 bytes at a time.
Your code will look the same as it did before Rust removed Read::chars:
use std::io::stdin;
use utf8_chars::BufReadCharsExt;
fn main() {
for c in stdin().lock().chars().map(|x| x.unwrap()) {
println!("{}", c);
}
}
Add the following to your Cargo.toml:
[dependencies]
utf8-chars = "1.0.0"
There are two solutions that make sense here.
First, you could copy the implementation of Read::chars() and use it; that would make it completely trivial to move your code over to the standard library implementation if/when it stabilizes.
On the other hand, you could simply iterate line by line (using f.lines()) and then use line.chars() on each line to get the chars. This is a little more hacky, but it will definitely work.
If you only wanted one loop, you could use flat_map() with a lambda like |line| line.chars().