Reading ZIP file in Rust causes data owned by the current function - rust

I'm new to Rust and am likely have a huge knowledge gap. Basically, I'm hoping to be create a utility function that would except a regular text file or a ZIP file and return a BufRead where the caller can start processing line by line. It is working well for non ZIP files but I am not understanding how to achieve the same for the ZIP files. The ZIP files will only contain a single file within the archive which is why I'm only processing the first file in the ZipArchive.
I'm running into the the following error.
error[E0515]: cannot return value referencing local variable `archive_contents`
--> src/file_reader.rs:30:9
|
27 | let archive_file: zip::read::ZipFile = archive_contents.by_index(0).unwrap();
| ---------------- `archive_contents` is borrowed here
...
30 | Ok(Box::new(BufReader::with_capacity(128 * 1024, archive_file)))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returns a value referencing data owned by the current function
It seems the archive_contents is preventing the BufRead object from returning to the caller. I'm just not sure how to work around this.
file_reader.rs
use std::ffi::OsStr;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::path::Path;
pub struct FileReader {
pub file_reader: Result<Box<BufRead>, &'static str>,
}
pub fn file_reader(filename: &str) -> Result<Box<BufRead>, &'static str> {
let path = Path::new(filename);
let file = match File::open(&path) {
Ok(file) => file,
Err(why) => panic!(
"ERROR: Could not open file, {}: {}",
path.display(),
why.to_string()
),
};
if path.extension() == Some(OsStr::new("zip")) {
// Processing ZIP file.
let mut archive_contents: zip::read::ZipArchive<std::fs::File> =
zip::ZipArchive::new(file).unwrap();
let archive_file: zip::read::ZipFile = archive_contents.by_index(0).unwrap();
// ERRORS: returns a value referencing data owned by the current function
Ok(Box::new(BufReader::with_capacity(128 * 1024, archive_file)))
} else {
// Processing non-ZIP file.
Ok(Box::new(BufReader::with_capacity(128 * 1024, file)))
}
}
main.rs
mod file_reader;
use std::io::BufRead;
fn main() {
let mut files: Vec<String> = Vec::new();
files.push("/tmp/text_file.txt".to_string());
files.push("/tmp/zip_file.zip".to_string());
for f in files {
let mut fr = match file_reader::file_reader(&f) {
Ok(fr) => fr,
Err(e) => panic!("Error reading file."),
};
fr.lines().for_each(|l| match l {
Ok(l) => {
println!("{}", l);
}
Err(e) => {
println!("ERROR: Failed to read line:\n {}", e);
}
});
}
}
Any help is greatly appreciated!

It seems the archive_contents is preventing the BufRead object from returning to the caller. I'm just not sure how to work around this.
You have to restructure the code somehow. The issue here is that, well, the archive data is part of the archive. So unlike file, archive_file is not an independent item, it is rather a pointer of sort into the archive itself. Which means the archive needs to live longer than archive_file for this code to be correct.
In a GC'd language this isn't an issue, archive_file has a reference to archive and will keep it alive however long it needs. Not so for Rust.
A simple way to fix this would be to just copy the data out of archive_file and into an owned buffer you can return to the parent. An other option might be to return a wrapper for (archive_contents, item_index), which would delegate the reading (might be somewhat tricky though). Yet another would be to not have file_reader.

Thanks to #Masklinn for the direction! Here's the working solution using their suggestion.
file_reader.rs
use std::ffi::OsStr;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::io::Cursor;
use std::io::Error;
use std::io::Read;
use std::path::Path;
use zip::read::ZipArchive;
pub fn file_reader(filename: &str) -> Result<Box<dyn BufRead>, Error> {
let path = Path::new(filename);
let file = match File::open(&path) {
Ok(file) => file,
Err(why) => return Err(why),
};
if path.extension() == Some(OsStr::new("zip")) {
let mut archive_contents = ZipArchive::new(file)?;
let mut archive_file = archive_contents.by_index(0)?;
// Read the contents of the file into a vec.
let mut data = Vec::new();
archive_file.read_to_end(&mut data)?;
// Wrap vec in a std::io::Cursor.
let cursor = Cursor::new(data);
Ok(Box::new(cursor))
} else {
// Processing non-ZIP file.
Ok(Box::new(BufReader::with_capacity(128 * 1024, file)))
}
}

While the solution you have settled on does work, it has a few disadvantages. One is that when you read from a zip file, you have to read the contents of the file you want to process into memory before proceeding, which might be impractical for a large file. Another is that you have to heap allocate the BufReader in either case.
Another possibly more idiomatic solution is to restructure your code, such that the BufReader does not need to be returned from the function at all - rather, structure your code so that it has a function that opens the file, which in turn calls a function that processes the file:
use std::ffi::OsStr;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::path::Path;
pub fn process_file(filename: &str) -> Result<usize, String> {
let path = Path::new(filename);
let file = match File::open(&path) {
Ok(file) => file,
Err(why) => return Err(format!(
"ERROR: Could not open file, {}: {}",
path.display(),
why.to_string()
)),
};
if path.extension() == Some(OsStr::new("zip")) {
// Handling a zip file
let mut archive_contents=zip::ZipArchive::new(file).unwrap();
let mut buf_reader = BufReader::with_capacity(128 * 1024,archive_contents.by_index(0).unwrap());
process_reader(&mut buf_reader)
} else {
// Handling a plain file.
process_reader(&mut BufReader::with_capacity(128 * 1024, file))
}
}
pub fn process_reader(reader: &mut dyn BufRead) -> Result<usize, String> {
// Example, just count the number of lines
return Ok(reader.lines().count());
}
fn main() {
let mut files: Vec<String> = Vec::new();
files.push("/tmp/text_file.txt".to_string());
files.push("/tmp/zip_file.zip".to_string());
for f in files {
match process_file(&f) {
Ok(count) => println!("File {} Count: {}", &f, count),
Err(e) => println!("Error reading file: {}", e),
};
}
}
This way, you don't need any Boxes and you don't need to read the file into memory before processing it.
A drawback to this solution would if you had multiple functions that need to be able to read from zip files. One way to handle that would be to define process_file to take a callback function to do the processing. First you would change the definition of process_file to be:
pub fn process_file<C>(filename: &str, process_reader: C) -> Result<usize, String>
where C: FnOnce(&mut dyn BufRead)->Result<usize, String>
The rest of the function body can be left unchanged. Now, process_reader can be passed into the function, like this:
process_file(&f, count_lines)
where count_lines would be the original simple function to count the lines, for instance.
This would also allow you to pass in a closure:
process_file(&f, |reader| Ok(reader.lines().count()))

Related

Make iterator of nested iterators

How could I pack the following code into a single iterator?
use std::io::{BufRead, BufReader};
use std::fs::File;
let file = BufReader::new(File::open("sample.txt").expect("Unable to open file"));
for line in file.lines() {
for ch in line.expect("Unable to read line").chars() {
println!("Character: {}", ch);
}
}
Naively, I’d like to have something like (I skipped unwraps)
let lines = file.lines().next();
Reader {
line: lines,
char: next().chars()
}
and iterate over Reader.char till hitting None, then refreshing Reader.line to a new line and Reader.char to the first character of the line. This doesn't seem to be possible though because Reader.char depends on the temporary variable.
Please notice that the question is about nested iterators, reading text files is used as an example.
You can use the flat_map() iterator utility to create new iterator that can produce any number of items for each item in the iterator it's called on.
In this case, that's complicated by the fact that lines() returns an iterator of Results, so the Err case must be handled.
There's also the issue that .chars() references the original string to avoid an additional allocation, so you have to collect the characters into another iterable container.
Solving both issues results in this mess:
fn example() -> impl Iterator<Item=Result<char, std::io::Error>> {
let file = BufReader::new(File::open("sample.txt").expect("Unable to open file"));
file.lines().flat_map(|line| match line {
Err(e) => vec![Err(e)],
Ok(line) => line.chars().map(Ok).collect(),
})
}
If String gave us an into_chars() method we could avoid collect() here, but then we'd have differently-typed iterators and would need to use either Box<dyn Iterator> or something like either::Either.
Since you already use .expect() here, you can simplify a bit by using .expect() within the closure to avoid handling the Err case:
fn example() -> impl Iterator<Item=char> {
let file = BufReader::new(File::open("sample.txt").expect("Unable to open file"));
file.lines().flat_map(|line|
line.expect("Unable to read line").chars().collect::<Vec<_>>()
)
}
In the general case, flat_map() is usually quite easy. You just need to be mindful of whether you are iterating owned vs borrowed values; both cases have some sharp corners. In this case, iterating over owned String values makes using .chars() problematic. If we could iterate over borrowed str slices we wouldn't have to .collect().
Drawing on the answer from #cdhowie and this answer that suggests using IntoIter to get an iterator of owned chars, I was able to come up with this solution that is the closest to what I expected:
use std::fs::File;
use std::io;
use std::io::{BufRead, BufReader, Lines};
use std::vec::IntoIter;
struct Reader {
lines: Lines<BufReader<File>>,
iter: IntoIter<char>,
}
impl Reader {
fn new(filename: &str) -> Self {
let file = BufReader::new(File::open(filename).expect("Unable to open file"));
let mut lines = file.lines();
let iter = Reader::char_iter(lines.next().expect("Unable to read file"));
Reader { lines, iter }
}
fn char_iter(line: io::Result<String>) -> IntoIter<char> {
line.unwrap().chars().collect::<Vec<_>>().into_iter()
}
}
impl Iterator for Reader {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
match self.iter.next() {
None => {
self.iter = match self.lines.next() {
None => return None,
Some(line) => Reader::char_iter(line),
};
Some('\n')
}
Some(val) => Some(val),
}
}
}
it works as expected:
let reader = Reader::new("src/main.rs");
for ch in reader {
print!("{}", ch);
}

Read lines in Rust with match result

I'm pretty new with Rust and trying to implement some basic stuff.
So one of the samples from the doc is about reading lines from a text file: https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html
The sample code works (obviously...) but I wanted to modify it to add some error handling but the compiler complains about the result I define
// the result after -> is not valid...
pub fn read_lines<P>(filename: P) -> std::result::Result<std::io::Lines<std::io::BufReader<std::fs::File>>, std::io::Error>
where P: AsRef<Path>, {
let file = File::open(filename);
let file = match file {
Ok(file) => io::BufReader::new(file).lines(),
Err(error) => panic!("Problem opening the file: {:?}", error),
};
}
// this is fine
pub fn read_lines2<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
I've tried different suggestions from the intellisense but no luck.
Any idea about how to define the result when there is an Ok/Err ?
If I understand the intent of your code correctly, here is a more canonical version:
use std::fs::File;
use std::io::{self, prelude::*};
use std::path::Path;
pub fn read_lines(filename: &Path) -> io::Lines<io::BufReader<File>> {
let file = File::open(filename).expect("Problem opening the file");
io::BufReader::new(file).lines()
}
If you want the caller to handle any errors, return something like io::Result<T> (an alias for std::result::Result<T, io::Error>), indicating a possible failure. However, if you decide to handle the error inside the function using something like panic!() or Result::expect(), then there is no need to return a Result to the caller since the function only returns to the caller if no error occurred.
Is this what you were looking for?
pub fn read_lines<P>(filename: P) -> std::result::Result<std::io::Lines<std::io::BufReader<std::fs::File>>, std::io::Error>
where P: AsRef<Path>, {
let file = File::open(filename);
match file {
Ok(file) => Ok(io::BufReader::new(file).lines()),
Err(error) => panic!("Problem opening the file: {:?}", error),
}
}
I removed the let file = and ; to enable implicit return, and added an Ok() wrapper around the happy path.

How do I create a single stream from a vector of Files?

I'm trying to read multiple files and I'd like to create a stream from a vector of filepaths. I've been fighting with the compiler for some time and I'm not sure how to make this work:
fn formatted_tags_stream(
args: &[&str],
files: &Vec<PathBuf>,
) -> Result<impl Iterator<Item = String>> {
// Transform list of paths into a list of files
let files: Vec<File> = files.into_iter().map(|f| File::open(f)).flatten().collect();
// Here is where I'm stuck :/
let stream =
files
.into_iter()
.skip(1)
.fold(BufReader::new(files[0]), |acc, mut f| acc.chain(f));
Ok(BufReader::new(stream).lines().filter_map(|line| {
line.ok().and_then(|tag| {
if let Ok(tag) = serde_json::from_str::<TagInfo>(&tag) {
Some(tag.format())
} else {
None
}
})
}))
}
If you use fold, the function must always return the same type. However, if you'd use it as you write, the function would take a type F and return a Chain<F, ...> in the first pass. The next pass would need to take a Chain<F, ...> and return a Chain<Chain<F, ...>, ...> - leading to "different types per iteration". This would not work as Rust wants to know the exact type, and the type has to stay the same.
However, you could type-erase the thing and hide it behind a pointer (i.e. a Box, doint "trait objects"). See here (I did some minor adjustments to make it compile):
use std::path::PathBuf;
use std::fs::File;
use std::io::BufReader;
use std::io::Read;
use std::io::BufRead;
fn formatted_tags_stream(
args: &[&str],
files: &Vec<PathBuf>,
) -> Result<impl Iterator<Item = String>, ()> {
// Transform list of paths into a list of files
let files: Vec<File> = files.into_iter().map(|f| File::open(f)).flatten().collect();
// Here is where I'm stuck :/
let bufreader = Box::new(std::io::empty()) as Box<dyn Read>;
let stream =
files
.into_iter()
.fold(bufreader, |acc, f| Box::new(acc.chain(f)) as Box<dyn Read>);
Ok(BufReader::new(stream).lines().filter_map(|line| {
line.ok().and_then(|tag| {
if let Ok(_tag) = tag.parse::<usize>() {
Some(tag)
} else {
None
}
})
}))
}
Note that using a Box<dyn Read> incurs some runtime overhead, as it leads to dynamic dispatch.

Does Read::read guarantee to append data and not overwrite any existing one?

I'm working on an SMTP library that reads lines over the network using a buffered reader.
I want a nice, safe way to read data from the network, without depending on Rust internals to make sure the code works as expected. Specifically, I'm wondering if the Read trait guarantees that data read with Read::read is appended to the buffer passed as an argument rather than overwriting the buffer entirely.
At the moment, I use a Range to make sure existing data is not overwritten without depending on Rust internals.
However, given that Rust used to have a nice way to do what I want, I'm wondering if the current code can be improved, possibly removing the unsafe blocks too.
No, it does not guarantee that:
use std::io::prelude::*;
use std::str;
fn main() {
let mut source1 = "hello, world!".as_bytes();
let mut source2 = "moo".as_bytes();
let mut dest = [0; 128];
source1.read(&mut dest).unwrap();
source2.read(&mut dest).unwrap();
let s = str::from_utf8(&dest[..16]).unwrap();
println!("{:?}", s)
}
This prints
"moolo, world!\u{0}\u{0}\u{0}"
Specifically, it cannot do what you want, based purely on the type signature:
fn read(&mut self, buf: &mut [u8]) -> Result<usize>;
All that the read method has access to is your mutable slice - there's nowhere to store information like "how far in the buffer you are". Furthermore, you aren't allowed to "extend" a mutable slice with more elements - you are only allowed to mutate the values within the slice.
For your particular case, you may want to look at BufRead::read_until. Here's a barely-tested example:
use std::io::{BufRead,BufReader};
use std::str;
fn main() {
let source1 = "header 1\r\nheader 2\r\n".as_bytes();
let mut reader = BufReader::new(source1);
let mut buf = vec![];
buf.reserve(128); // Maybe more efficient?
loop {
match reader.read_until(b'\n', &mut buf) {
Ok(0) => break,
Ok(_) => {},
Err(_) => panic!("Handle errors"),
}
if buf.len() < 2 { continue }
if buf[buf.len() - 2] == b'\r' {
{
let s = str::from_utf8(&buf).unwrap();
println!("Got a header {:?}", s);
}
buf.clear();
}
}
}

Delegating the creation of data structures

I’m very new to Rust. While trying out small things, I have written the following code. It simply scans files (given as arguments) for a specific string (“Started “) and prints out the matching lines:
use std::os;
use std::io::BufferedReader;
use std::io::File;
fn main() {
for target in os::args().iter() {
scan_file(target);
}
}
fn scan_file(path_str: &String) {
let path = Path::new(path_str.as_bytes());
let file = File::open(&path);
let mut reader = BufferedReader::new(file);
for line in reader.lines() {
match line {
Ok(s) => {
if s.as_slice().contains("Started ") {
print!("{}", s);
}
}
Err(_) => return,
}
}
}
My question is: how can I refactor the function scan_file so that it looks something like this (or similar enough)?:
fn scan_file(path_str: &String) {
for line in each_line_in_file_with_path(path_str) {
match line {
Ok(s) => {
if s.as_slice().contains("Started ") {
print!("{}", s);
}
}
Err(_) => return,
}
}
}
In this new version of the function, the three variable declarations are gone. Instead, the function each_line_in_file_with_path is expected to handle all the “turn a path into lines”, returning an iterator.
I’ve tried a number of things unsuccessfully, always due to variables going out of scope too early for my needs. I understand the problems I have (I think), but can’t find anywhere a good explanation of how this should be handled.
It is not possible to implement a working each_line_in_file_with_path function — at least, not without adding some overhead and unsafe code.
Let's look at the values involved and their types. First is path, of type Path (either posix::Path or windows::Path). The constructors for these types receive a BytesContainer by value, therefore they take ownership of it. No issues here.
Next is file, of type IoResult<File>. File::open() clones the path it receives, so again, no issues here.
Next is reader, of type BufferedReader<IoResult<File>>. Just like Path, the constructor for BufferedReader takes its argument by value and takes ownership of it.
The problem is with reader.lines(). This value is of type Lines<'r, T: 'r>. As the type signature suggests, this struct contains a borrowed reference. The signature of lines shows the relationship between the loaner and the borrower:
fn lines<'r>(&'r mut self) -> Lines<'r, Self>
How do we define each_line_in_file_with_path now? each_line_in_file_with_path cannot return a Lines directly. You probably tried writing the function like this:
fn each_line_in_file_with_path<'a, T>(path: &T) -> Lines<'a, BufferedReader<IoResult<File>>>
where T: BytesContainer {
let path = Path::new(path);
let file = File::open(&path);
let reader = BufferedReader::new(file);
reader.lines()
}
This gives a compilation error:
main.rs:46:5: 46:11 error: `reader` does not live long enough
main.rs:46 reader.lines()
^~~~~~
main.rs:42:33: 47:2 note: reference must be valid for the lifetime 'a as defined on the block at 42:32...
main.rs:42 where T: BytesContainer {
main.rs:43 let path = Path::new(path);
main.rs:44 let file = File::open(&path);
main.rs:45 let reader = BufferedReader::new(file);
main.rs:46 reader.lines()
main.rs:47 }
main.rs:42:33: 47:2 note: ...but borrowed value is only valid for the block at 42:32
main.rs:42 where T: BytesContainer {
main.rs:43 let path = Path::new(path);
main.rs:44 let file = File::open(&path);
main.rs:45 let reader = BufferedReader::new(file);
main.rs:46 reader.lines()
main.rs:47 }
error: aborting due to previous error
That's because we're trying to return a Lines that refers to a BufferedReader that ceases to exist when the function returns (the Lines would contain a dangling pointer).
Now, one might think, “I'll just return the BufferedReader along with the Lines”.
struct LinesInFileIterator<'a> {
reader: BufferedReader<IoResult<File>>,
lines: Lines<'a, BufferedReader<IoResult<File>>>
}
impl<'a> Iterator<IoResult<String>> for LinesInFileIterator<'a> {
fn next(&mut self) -> Option<IoResult<String>> {
self.lines.next()
}
}
fn each_line_in_file_with_path<'a, T>(path: &T) -> LinesInFileIterator<'a>
where T: BytesContainer {
let path = Path::new(path);
let file = File::open(&path);
let reader = BufferedReader::new(file);
LinesInFileIterator {
reader: reader,
lines: reader.lines()
}
}
This doesn't work either:
main.rs:46:16: 46:22 error: `reader` does not live long enough
main.rs:46 lines: reader.lines()
^~~~~~
main.rs:40:33: 48:2 note: reference must be valid for the lifetime 'a as defined on the block at 40:32...
main.rs:40 where T: BytesContainer {
main.rs:41 let path = Path::new(path);
main.rs:42 let file = File::open(&path);
main.rs:43 let reader = BufferedReader::new(file);
main.rs:44 LinesInFileIterator {
main.rs:45 reader: reader,
...
main.rs:40:33: 48:2 note: ...but borrowed value is only valid for the block at 40:32
main.rs:40 where T: BytesContainer {
main.rs:41 let path = Path::new(path);
main.rs:42 let file = File::open(&path);
main.rs:43 let reader = BufferedReader::new(file);
main.rs:44 LinesInFileIterator {
main.rs:45 reader: reader,
...
main.rs:46:16: 46:22 error: use of moved value: `reader`
main.rs:46 lines: reader.lines()
^~~~~~
main.rs:45:17: 45:23 note: `reader` moved here because it has type `std::io::buffered::BufferedReader<core::result::Result<std::io::fs::File, std::io::IoError>>`, which is non-copyable
main.rs:45 reader: reader,
^~~~~~
error: aborting due to 2 previous errors
Basically, we can't have a struct that contains a borrowed reference that points to another member of the struct, because when the struct is moved, the reference would become invalid.
There are 2 solutions:
Make a function that returns a BufferedReader from a file path, and call .lines() on it in your for loop.
Make a function that accepts a closure that receives each line.
fn main() {
for target in os::args().iter() {
scan_file(target.as_slice());
}
}
fn for_each_line_in_file_with_path_do(path: &str, action: |IoResult<String>|) {
let path = Path::new(path.as_bytes());
let file = File::open(&path);
let mut reader = BufferedReader::new(file);
for line in reader.lines() {
action(line);
}
}
fn scan_file(path_str: &str) {
for_each_line_in_file_with_path_do(path_str, |line| {
match line {
Ok(s) => {
if s.as_slice().contains("Started ") {
print!("{}", s);
}
}
Err(_) => return,
}
});
}
You won't be able to do it without some boilerplate. You need to have some source of data, and because iterators return their data in chunks, they either have to contain the data or to have a reference into some other source of this data (this also includes iterators which return data from external source, e.g. lines in a file).
However, because you want to "encapsulate" your iterator into a function call, this iterator cannot be of the second kind, i.e. it cannot contain references, because all references it could contain would point to this function call stack. Consequently, the iterator's source can only be contained in this iterator.
And this is the boilerplate problem - in general there is no such iterator in the standard library. You will need to create it yourself. In this particular case, though, you can get away without implementing Iterator trait manually. You only need to create some simple structural wrapper:
use std::os;
use std::io::{BufferedReader, File, Lines};
fn main() {
for target in os::args().iter() {
scan_file(target.as_slice());
}
}
struct FileLines {
source: BufferedReader<File>
}
impl FileLines {
fn new(path_str: &str) -> FileLines {
let path = Path::new(path_str.as_bytes());
let file = File::open(&path).unwrap();
let reader = BufferedReader::new(file);
FileLines { source: reader }
}
fn lines(&mut self) -> Lines<BufferedReader<File>> {
self.source.lines()
}
}
fn scan_file(path_str: &str) {
for line in FileLines::new(path_str).lines() {
match line {
Ok(s) => {
if s.as_slice().contains("Started ") {
print!("{}", s);
}
}
Err(_) => return,
}
}
}
(I also changed &String to &str because it is more idiomatic and general)
The FileLines structure owns the data and encapsulates all of the complex logic in its constructor. Then its lines() method just returns an iterator into its internals. This is rather common pattern in Rust, and usually you will be able to find the main owner of your data and build your program around it with methods which return iterators/references into this owner.
This is not exactly what you wanted (there are two function calls in for loop initializer - new() and lines()), but I believe that for all practical purposes they have the same expressiveness and usability.

Resources