read file(not utf-8) line by line? - rust

Is it possible to read file line by line, if it is not in utf-8 encoding with std::io::File and std::io::BufReader?
I look at std::io::Lines and it return Result<String>, so
I worry, have I implement my own BufReader that do the same, but return Vec<u8> instead, or I can reuse std::io::BufReader in some way?

You do not have to re-implement BufReader itself, it provides exactly the method you need for your usecase read_until:
fn read_until(&mut self, byte: u8, buf: &mut Vec<u8>) -> Result<usize>
You supply your own Vec<u8> and the content of the file will be appended until byte is encountered (0x0A being LF).
There are several potential gotchas:
the buffer may end not only with a LF byte, but with a CR LF sequence,
it is up to you to clear buf between subsequent calls.
A simple while let Ok(_) = reader.read_until(0x0A as u8, buffer) should let you read your file easily enough.
You may consider implement a std::io::Lines equivalent which converts from the encoding to UTF-8 to provide a nice API, though it will have a performance cost.

To read files (utf-8 or not utf-8) line by line, I use:
/*
// Cargo.toml:
[dependencies]
encoding_rs = "0.8"
encoding_rs_io = "0.1.7"
...
*/
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
use std::{
fs,
io::{Read, BufRead, BufReader, Error},
};
fn main() {
let file_path: &str = "/tmp/file.txt";
let buffer: Box<dyn BufRead> = read_file(file_path);
for (index, result_vec_bytes) in buffer.split(b'\n').enumerate() {
let line_number: usize = index + 1;
let line_utf8: String = get_string_utf8(result_vec_bytes, line_number, file_path);
println!("{line_utf8}");
}
}
Such that:
fn read_file(file_path: &str) -> Box<dyn BufRead> {
let file = match fs::File::open(file_path) {
Ok(f) => f,
Err(why) => panic!("Problem opening the file: \"{file_path}\"\n{why:?}"),
};
Box::new(BufReader::new(file))
}
And
fn get_string_utf8(result_vec_bytes: Result<Vec<u8>, std::io::Error>, line_number: usize) -> String {
let vec_bytes: Vec<u8> = match result_vec_bytes {
Ok(values) => values,
Err(why) => panic!("Failed to read line nº {line_number}: {why}"),
};
// from_utf8() checks to ensure that the bytes are valid UTF-8
let line_utf8: String = match std::str::from_utf8(&vec_bytes) {
Ok(str) => str.to_string(),
Err(_) => {
let mut data = DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(vec_bytes.as_slice());
let mut buffer = String::new();
let _number_of_bytes = match data.read_to_string(&mut buffer) {
Ok(num) => num,
Err(why) => {
eprintln!("Problem reading data from file in buffer!");
eprintln!("Line nº {line_number}");
eprintln!("Used encoding type: WINDOWS_1252.");
eprintln!("Try another encoding type!");
panic!("Failed to convert data to UTF-8!: {why}")
},
};
buffer
}
};
// remove Window new line: "\r\n"
line_utf8.trim_end_matches('\r').to_string()
}

Related

How to read a text File in Rust and read mutliple Values per line

So basically, I have a text file with the following syntax:
String int
String int
String int
I have an idea how to read the Values if there is only one entry per line, but if there are multiple, I do not know how to do it.
In Java, I would do something simple with while and Scanner but in Rust I have no clue.
I am fairly new to Rust so please help me.
Thanks for your help in advance
Solution
Here is my modified Solution of #netwave 's code:
use std::fs;
use std::io::{BufRead, BufReader, Error};
fn main() -> Result<(), Error> {
let buff_reader = BufReader::new(fs::File::open(file)?);
for line in buff_reader.lines() {
let parsed = sscanf::scanf!(line?, "{} {}", String, i32);
println!("{:?}\n", parsed);
}
Ok(())
}
You can use the BuffRead trait, which has a read_line method. Also you can use lines.
For doing so the easiest option would be to wrap the File instance with a BuffReader:
use std::fs;
use std::io::{BufRead, BufReader};
...
let buff_reader = BufReader::new(fs::File::open(path)?);
loop {
let mut buff = String::new();
buff_reader.read_line(&mut buff)?;
println!("{}", buff);
}
Playground
Once you have each line you can easily use sscanf crate to parse the line to the types you need:
let parsed = sscanf::scanf!(buff, "{} {}", String, i32);
Based on: https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html
For data.txt to contain:
str1 100
str2 200
str3 300
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
fn main() {
// File hosts must exist in current path before this produces output
if let Ok(lines) = read_lines("./data.txt") {
// Consumes the iterator, returns an (Optional) String
for line in lines {
if let Ok(data) = line {
let values: Vec<&str> = data.split(' ').collect();
match values.len() {
2 => {
let strdata = values[0].parse::<String>();
let intdata = values[1].parse::<i32>();
println!("Got: {:?} {:?}", strdata, intdata);
},
_ => panic!("Invalid input line {}", data),
};
}
}
}
}
// The output is wrapped in a Result to allow matching on errors
// Returns an Iterator to the Reader of the lines of the file.
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
Outputs:
Got: Ok("str1") Ok(100)
Got: Ok("str2") Ok(200)
Got: Ok("str3") Ok(300)

How to use actix field stream by two consumers?

I have an actix web service and would like to parse the contents of a multipart field while streaming with async-gcode and in addition store the contents e.g. in a database.
However, I have no clue how to feed in the stream to the Parser and at the same time collect the bytes into a Vec<u8> or a String.
The first problem I face is that field is a stream of actix::web::Bytes and not of u8.
#[post("/upload")]
pub async fn upload_job(
mut payload: Multipart,
) -> Result<HttpResponse, Error> {
let mut contents : Vec<u8> = Vec::new();
while let Ok(Some(mut field)) = payload.try_next().await {
let content_disp = field.content_disposition().unwrap();
match content_disp.get_name().unwrap() {
"file" => {
while let Some(chunk) = field.next().await {
contents.append(&mut chunk.unwrap().to_vec());
// already parse the contents
// and additionally store contents somewhere
}
}
_ => (),
}
}
Ok(HttpResponse::Ok().finish())
}
Any hint or suggestion is very much appreciated.
One of the options is to wrap field in a struct and implement Stream trait for it.
use actix_web::{HttpRequest, HttpResponse, Error};
use futures_util::stream::Stream;
use std::pin::Pin;
use actix_multipart::{Multipart, Field};
use futures::stream::{self, StreamExt};
use futures_util::TryStreamExt;
use std::task::{Context, Poll};
use async_gcode::{Parser, Error as PError};
use bytes::BytesMut;
use std::cell::RefCell;
pub struct Wrapper {
field: Field,
buffer: RefCell<BytesMut>,
index: usize,
}
impl Wrapper {
pub fn new(field: Field, buffer: RefCell<BytesMut>) -> Self {
buffer.borrow_mut().truncate(0);
Wrapper {
field,
buffer,
index: 0
}
}
}
impl Stream for Wrapper {
type Item = Result<u8, PError>;
fn poll_next(
mut self: Pin<&mut Self>,
cx: &mut Context<'_>,
) -> Poll<Option<Result<u8, PError>>> {
if self.index == self.buffer.borrow().len() {
match Pin::new(&mut self.field).poll_next(cx) {
Poll::Ready(Some(Ok(chunk))) => self.buffer.get_mut().extend_from_slice(&chunk),
Poll::Pending => return Poll::Pending,
Poll::Ready(None) => return Poll::Ready(None),
Poll::Ready(Some(Err(_))) => return Poll::Ready(Some(Err(PError::BadNumberFormat/* ??? */)))
};
} else {
let b = self.buffer.borrow()[self.index];
self.index += 1;
return Poll::Ready(Some(Ok(b)));
}
Poll::Ready(None)
}
}
#[post("/upload")]
pub async fn upload_job(
mut payload: Multipart,
) -> Result<HttpResponse, Error> {
while let Ok(Some(field)) = payload.try_next().await {
let content_disp = field.content_disposition().unwrap();
match content_disp.get_name().unwrap() {
"file" => {
let mut contents: RefCell<BytesMut> = RefCell::new(BytesMut::new());
let mut w = Wrapper::new(field, contents.clone());
let mut p = Parser::new(w);
while let Some(res) = p.next().await {
// Do something with results
};
// Do something with the buffer
let a = contents.get_mut()[0];
}
_ => (),
}
}
Ok(HttpResponse::Ok().finish())
}
Copying the Bytes from the Field won't be necessary when
Bytes::try_unsplit will be implemented. (https://github.com/tokio-rs/bytes/issues/287)
The answer from dmitryvm (thanks for your effort) showed me that there are actually two problems. At first, flatten the Bytes into u8's and, secondly, to "split" the stream into a buffer for later storage and the async-gcode parser.
This shows how I solved it:
#[post("/upload")]
pub async fn upload_job(
mut payload: Multipart,
) -> Result<HttpResponse, Error> {
let mut contents : Vec<u8> = Vec::new();
while let Ok(Some(mut field)) = payload.try_next().await {
let content_disp = field.content_disposition().unwrap();
match content_disp.get_name().unwrap() {
"file" => {
let field_stream = field
.map_err(|_| async_gcode::Error::BadNumberFormat) // Translate error
.map_ok(|y| { // Translate Bytes into stream with Vec<u8>
contents.extend_from_slice(&y); // Copy and store for later usage
stream::iter(y).map(Result::<_, async_gcode::Error>::Ok)
})
.try_flatten(); // Flatten the streams of u8's
let mut parser = Parser::new(field_stream);
while let Some(gcode) = parser.next().await {
// Process result from parser
}
}
_ => (),
}
}
Ok(HttpResponse::Ok().finish())
}

How do I convert from &alloc::string::String to a string literal?

The title says it all really. I need to convert from &alloc::string::String to a string literal (&str I think) according to the error I am getting when I am trying to write to a file. How do I convert what I have into that?
The overall goal here is to read from one file and line by line append to another.
Full code:
use std::{
fs::File,
io::{self, BufRead, BufReader},
fs::OpenOptions,
fs::write,
any::type_name,
path::Path,
io::Write,
};
fn type_of<T>(_: T) -> &'static str {
type_name::<T>()
}
fn main(){
let inpath = Path::new("tool_output.txt");
let outpath = Path::new("test_output.txt");
let indisplay = inpath.display();
let outdisplay = outpath.display();
let mut infile = match File::open(&inpath) {
Err(why) => panic!("couldn't open {}: {}", indisplay, why),
Ok(infile) => infile,
};
let mut outfile = match OpenOptions::new().write(true).append(true).open(&outpath) {
Err(why) => panic!("couldn't open {}: {}", outdisplay, why),
Ok(outfile) => outfile,
};
let reader = BufReader::new(infile);
for line in reader.lines() {
let format_line = String::from(line.unwrap()); // <- I thought this would fix the error but it didnt.
println!("Type = {}", type_of(&format_line));
let _ = writeln!(outfile, &format_line).expect("Unable to write to file"); <- this is currently causing the error.
//write("test_output.txt", line.unwrap()).expect("Unable to write to file");
}
}
error:
error: format argument must be a string literal
--> text_edit.rs:36:28
|
36 | let _ = writeln!(outfile, format_line).expect("Unable to write to file");
| ^^^^^^^^^^^
|
A string literal is what it says - a literal so "literal" is a string literal. To use writeln macro to write a string, you have to do writeln!(outfile, "{}", line) and here "{}" is the format string literal. If you’ve ever used println macro, it’s basically that but you specify what stream to print to.

Reading ZIP file in Rust causes data owned by the current function

I'm new to Rust and am likely have a huge knowledge gap. Basically, I'm hoping to be create a utility function that would except a regular text file or a ZIP file and return a BufRead where the caller can start processing line by line. It is working well for non ZIP files but I am not understanding how to achieve the same for the ZIP files. The ZIP files will only contain a single file within the archive which is why I'm only processing the first file in the ZipArchive.
I'm running into the the following error.
error[E0515]: cannot return value referencing local variable `archive_contents`
--> src/file_reader.rs:30:9
|
27 | let archive_file: zip::read::ZipFile = archive_contents.by_index(0).unwrap();
| ---------------- `archive_contents` is borrowed here
...
30 | Ok(Box::new(BufReader::with_capacity(128 * 1024, archive_file)))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returns a value referencing data owned by the current function
It seems the archive_contents is preventing the BufRead object from returning to the caller. I'm just not sure how to work around this.
file_reader.rs
use std::ffi::OsStr;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::path::Path;
pub struct FileReader {
pub file_reader: Result<Box<BufRead>, &'static str>,
}
pub fn file_reader(filename: &str) -> Result<Box<BufRead>, &'static str> {
let path = Path::new(filename);
let file = match File::open(&path) {
Ok(file) => file,
Err(why) => panic!(
"ERROR: Could not open file, {}: {}",
path.display(),
why.to_string()
),
};
if path.extension() == Some(OsStr::new("zip")) {
// Processing ZIP file.
let mut archive_contents: zip::read::ZipArchive<std::fs::File> =
zip::ZipArchive::new(file).unwrap();
let archive_file: zip::read::ZipFile = archive_contents.by_index(0).unwrap();
// ERRORS: returns a value referencing data owned by the current function
Ok(Box::new(BufReader::with_capacity(128 * 1024, archive_file)))
} else {
// Processing non-ZIP file.
Ok(Box::new(BufReader::with_capacity(128 * 1024, file)))
}
}
main.rs
mod file_reader;
use std::io::BufRead;
fn main() {
let mut files: Vec<String> = Vec::new();
files.push("/tmp/text_file.txt".to_string());
files.push("/tmp/zip_file.zip".to_string());
for f in files {
let mut fr = match file_reader::file_reader(&f) {
Ok(fr) => fr,
Err(e) => panic!("Error reading file."),
};
fr.lines().for_each(|l| match l {
Ok(l) => {
println!("{}", l);
}
Err(e) => {
println!("ERROR: Failed to read line:\n {}", e);
}
});
}
}
Any help is greatly appreciated!
It seems the archive_contents is preventing the BufRead object from returning to the caller. I'm just not sure how to work around this.
You have to restructure the code somehow. The issue here is that, well, the archive data is part of the archive. So unlike file, archive_file is not an independent item, it is rather a pointer of sort into the archive itself. Which means the archive needs to live longer than archive_file for this code to be correct.
In a GC'd language this isn't an issue, archive_file has a reference to archive and will keep it alive however long it needs. Not so for Rust.
A simple way to fix this would be to just copy the data out of archive_file and into an owned buffer you can return to the parent. An other option might be to return a wrapper for (archive_contents, item_index), which would delegate the reading (might be somewhat tricky though). Yet another would be to not have file_reader.
Thanks to #Masklinn for the direction! Here's the working solution using their suggestion.
file_reader.rs
use std::ffi::OsStr;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::io::Cursor;
use std::io::Error;
use std::io::Read;
use std::path::Path;
use zip::read::ZipArchive;
pub fn file_reader(filename: &str) -> Result<Box<dyn BufRead>, Error> {
let path = Path::new(filename);
let file = match File::open(&path) {
Ok(file) => file,
Err(why) => return Err(why),
};
if path.extension() == Some(OsStr::new("zip")) {
let mut archive_contents = ZipArchive::new(file)?;
let mut archive_file = archive_contents.by_index(0)?;
// Read the contents of the file into a vec.
let mut data = Vec::new();
archive_file.read_to_end(&mut data)?;
// Wrap vec in a std::io::Cursor.
let cursor = Cursor::new(data);
Ok(Box::new(cursor))
} else {
// Processing non-ZIP file.
Ok(Box::new(BufReader::with_capacity(128 * 1024, file)))
}
}
While the solution you have settled on does work, it has a few disadvantages. One is that when you read from a zip file, you have to read the contents of the file you want to process into memory before proceeding, which might be impractical for a large file. Another is that you have to heap allocate the BufReader in either case.
Another possibly more idiomatic solution is to restructure your code, such that the BufReader does not need to be returned from the function at all - rather, structure your code so that it has a function that opens the file, which in turn calls a function that processes the file:
use std::ffi::OsStr;
use std::fs::File;
use std::io::BufRead;
use std::io::BufReader;
use std::path::Path;
pub fn process_file(filename: &str) -> Result<usize, String> {
let path = Path::new(filename);
let file = match File::open(&path) {
Ok(file) => file,
Err(why) => return Err(format!(
"ERROR: Could not open file, {}: {}",
path.display(),
why.to_string()
)),
};
if path.extension() == Some(OsStr::new("zip")) {
// Handling a zip file
let mut archive_contents=zip::ZipArchive::new(file).unwrap();
let mut buf_reader = BufReader::with_capacity(128 * 1024,archive_contents.by_index(0).unwrap());
process_reader(&mut buf_reader)
} else {
// Handling a plain file.
process_reader(&mut BufReader::with_capacity(128 * 1024, file))
}
}
pub fn process_reader(reader: &mut dyn BufRead) -> Result<usize, String> {
// Example, just count the number of lines
return Ok(reader.lines().count());
}
fn main() {
let mut files: Vec<String> = Vec::new();
files.push("/tmp/text_file.txt".to_string());
files.push("/tmp/zip_file.zip".to_string());
for f in files {
match process_file(&f) {
Ok(count) => println!("File {} Count: {}", &f, count),
Err(e) => println!("Error reading file: {}", e),
};
}
}
This way, you don't need any Boxes and you don't need to read the file into memory before processing it.
A drawback to this solution would if you had multiple functions that need to be able to read from zip files. One way to handle that would be to define process_file to take a callback function to do the processing. First you would change the definition of process_file to be:
pub fn process_file<C>(filename: &str, process_reader: C) -> Result<usize, String>
where C: FnOnce(&mut dyn BufRead)->Result<usize, String>
The rest of the function body can be left unchanged. Now, process_reader can be passed into the function, like this:
process_file(&f, count_lines)
where count_lines would be the original simple function to count the lines, for instance.
This would also allow you to pass in a closure:
process_file(&f, |reader| Ok(reader.lines().count()))

Reading Bytes From a Reader

I'm writing something to process stdin in blocks of bytes, but can't seem to work out a simple way to do it (though I suspect there is one).
fn run() -> int {
// Doesn't compile: types differ
let mut buffer = [0, ..100];
loop {
let block = match stdio::stdin().read(buffer) {
Ok(bytes_read) => buffer.slice_to(bytes_read),
// This captures the Err from the end of the file,
// but also actual errors while reading from stdin.
Err(message) => return 0
};
process(block).unwrap();
}
}
fn process(block: &[u8]) -> Result<(), IoError> {
// do things
}
My questions:
What's the "standard" way to do this? (I've been trying/hoping to use and_then()/or_else())
How can I differentiate between the Err(IoError) from end of the file, and the Err that's actually an error?
The previously accepted answer was outdated (Rust v1.0). EOF is no longer considered an error. You can do it like this:
use std::io::{self, Read};
fn main() {
let mut buffer = [0; 100];
while let Ok(bytes_read) = io::stdin().read(&mut buffer) {
if bytes_read == 0 { break; }
process(&buffer[..bytes_read]).unwrap();
}
}
fn process(block: &[u8]) -> Result<(), io::Error> {
Ok(()) // do things
}
Note that this may not result in the expected behavior: read doesn't have to fill the buffer, but may return with any number of bytes read. In the case of stdin the read implementation returns every time a newline is detected (pressing enter in terminal).
Rust API documentation states that:
Note that end-of-file is considered an error, and can be inspected for
in the error's kind field.
The IoError struct looks like this:
pub struct IoError {
pub kind: IoErrorKind,
pub desc: &'static str,
pub detail: Option<String>,
}
The list is all kinds is at http://doc.rust-lang.org/std/io/enum.IoErrorKind.html
You can match it like this:
match stdio::stdin().read(buffer) {
Ok(_) => println!("ok"),
Err(io::IoError{kind:io::EndOfFile, ..}) => println!("end of file"),
_ => println!("error")
}

Resources