Streaming version for Rust (nom 7) multiline parser - rust

I'm trying to learn NOM, but I'm also new in Rust.
I have a text file with words on each line.
I want to split it to 2 files: valid ASCII (without control codes) and everything else.
extern crate nom;
use std::fs::File;
use std::io::{prelude::*, BufReader};
fn main() -> std::io::Result<()> {
let file = File::open("/words.txt")?;
let reader = BufReader::new(file);
for line in reader.lines().filter_map(|result| result.ok()) {
parse(line);
}
Ok(())
}
fn parse(line: String) {
for c in line.chars() {
if c.is_ascii_control() | !c.is_ascii() {
println!("C> {}", line);
return;
}
}
if line.len() > 0 {
println!("A{}> {}", line.len(), line);
}
}
But input file is too large for in-memory processing and I should use Streaming functionality, like this.
How to modify this code to combine streaming buffer with limited capacity (1000 chars) and line_ending check?

Related

How to read a text File in Rust and read mutliple Values per line

So basically, I have a text file with the following syntax:
String int
String int
String int
I have an idea how to read the Values if there is only one entry per line, but if there are multiple, I do not know how to do it.
In Java, I would do something simple with while and Scanner but in Rust I have no clue.
I am fairly new to Rust so please help me.
Thanks for your help in advance
Solution
Here is my modified Solution of #netwave 's code:
use std::fs;
use std::io::{BufRead, BufReader, Error};
fn main() -> Result<(), Error> {
let buff_reader = BufReader::new(fs::File::open(file)?);
for line in buff_reader.lines() {
let parsed = sscanf::scanf!(line?, "{} {}", String, i32);
println!("{:?}\n", parsed);
}
Ok(())
}
You can use the BuffRead trait, which has a read_line method. Also you can use lines.
For doing so the easiest option would be to wrap the File instance with a BuffReader:
use std::fs;
use std::io::{BufRead, BufReader};
...
let buff_reader = BufReader::new(fs::File::open(path)?);
loop {
let mut buff = String::new();
buff_reader.read_line(&mut buff)?;
println!("{}", buff);
}
Playground
Once you have each line you can easily use sscanf crate to parse the line to the types you need:
let parsed = sscanf::scanf!(buff, "{} {}", String, i32);
Based on: https://doc.rust-lang.org/rust-by-example/std_misc/file/read_lines.html
For data.txt to contain:
str1 100
str2 200
str3 300
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
fn main() {
// File hosts must exist in current path before this produces output
if let Ok(lines) = read_lines("./data.txt") {
// Consumes the iterator, returns an (Optional) String
for line in lines {
if let Ok(data) = line {
let values: Vec<&str> = data.split(' ').collect();
match values.len() {
2 => {
let strdata = values[0].parse::<String>();
let intdata = values[1].parse::<i32>();
println!("Got: {:?} {:?}", strdata, intdata);
},
_ => panic!("Invalid input line {}", data),
};
}
}
}
}
// The output is wrapped in a Result to allow matching on errors
// Returns an Iterator to the Reader of the lines of the file.
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
Outputs:
Got: Ok("str1") Ok(100)
Got: Ok("str2") Ok(200)
Got: Ok("str3") Ok(300)

How to do simple math with a list of numbers from a file and print out the result in Rust?

use std::fs::File;
use std::io::prelude::*;
use std::io::BufReader;
use std::iter::Iterator;
fn main() -> std::io::Result<()> {
let file = File::open("input")?; // file is input
let mut buf_reader = BufReader::new(file);
let mut contents = String::new();
buf_reader.read_to_string(&mut contents)?;
for i in contents.parse::<i32>() {
let i = i / 2;
println!("{}", i);
}
Ok(())
}
list of numbers:
50951
69212
119076
124303
95335
65069
109778
113786
124821
103423
128775
111918
138158
141455
92800
50908
107279
77352
129442
60097
84670
143682
104335
105729
87948
59542
81481
147508
str::parse::<i32> can only parse a single number at a time, so you will need to split the text first and then parse each number one by one. For example if you have one number per line and no extra whitespace, you can use BufRead::lines to process the text line by line:
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() -> std::io::Result<()> {
let file = File::open("input")?; // file is input
let mut buf_reader = BufReader::new(file);
for line in buf_reader.lines() {
let value = line?
.parse::<i32>()
.expect("Not able to parse: Content is malformed !");
println!("{}", value / 2);
}
Ok(())
}
As an extra bonus this avoids reading the whole file into memory, which can be important if the file is big.
For tiny examples like this, I'd read the entire string at once, then split it up on lines.
use std::fs;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let contents = fs::read_to_string("input")?;
for line in contents.trim().lines() {
let i: i32 = line.trim().parse()?;
let i = i / 2;
println!("{}", i);
}
Ok(())
}
See also:
What's the de-facto way of reading and writing files in Rust 1.x?
For tightly-controlled examples like this, I'd ignore errors occurring while parsing:
use std::fs;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let contents = fs::read_to_string("input")?;
for i in contents.trim().lines().flat_map(|l| l.trim().parse::<i32>()) {
let i = i / 2;
println!("{}", i);
}
Ok(())
}
See also:
Why does `Option` support `IntoIterator`?
For fixed-input examples like this, I'd avoid opening the file at runtime at all, pushing the error to compile-time:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let contents = include_str!("../input");
for i in contents.trim().lines().flat_map(|l| l.trim().parse::<i32>()) {
let i = i / 2;
println!("{}", i);
}
Ok(())
}
See also:
Is there a good way to include external resource data into Rust source code?
If I wanted to handle failures to parse but treat the iterator as if errors were impossible, I'd use Itertools::process_results:
use itertools; // 0.8.2
fn main() -> Result<(), Box<dyn std::error::Error>> {
let contents = include_str!("../input");
let numbers = contents.trim().lines().map(|l| l.trim().parse::<i32>());
let sum = itertools::process_results(numbers, |i| i.sum::<i32>());
println!("{:?}", sum);
Ok(())
}
See also:
How do I perform iterator computations over iterators of Results without collecting to a temporary vector?
How do I stop iteration and return an error when Iterator::map returns a Result::Err?

How do I read OS-compatible strings from stdin?

I'm trying to write a Rust program that gets a separated list of filenames on stdin.
On Windows, I might invoke it from a cmd window with something like:
dir /b /s | findstr .*,v$ | rust-prog -n
On Unix I'd use something like:
find . -name '*,v' -print0 | rust-prog -0
I'm having trouble converting what I receive on stdin into something that can be used by std::path::Path. As I understand it, to get something that will compile on Windows or Unix, I'm going to need to use conditional compilation, and std::os::windows::ffi or std::os::unix::ffi as appropriate.
Furthermore, It seems on Windows I'll need to use kernel32::MultiByteToWideChar using the current code page to create something usable by std::os::windows::ffi::OsStrExt.
Is there an easier way to do this? Does what I'm suggesting even seem workable?
As an example, it's easy to convert a string to a path, so I tried to use the string handling functions of stdin:
use std::io::{self, Read};
fn main() {
let mut buffer = String::new();
match io::stdin().read_line(&mut buffer) {
Ok(n) => println!("{}", buffer),
Err(error) => println!("error: {}", error)
}
}
On Windows, if I have a directory with a single file called ¿.txt (that's 0xbf). and pipe the name into stdin. I get: error: stream did not contain valid UTF-8.
Here's a reasonable looking version for Windows. Convert the console supplied string to a wide string using win32api functions then wrap it in an OsString using OsString::from_wide.
I'm not convinced it uses the correct code page yet. dir seems to use OEM code page, so maybe that should be the default. There's also a distinction between input code page and output code page in a console.
In my Cargo.toml
[dependencies]
winapi = "0.2"
kernel32-sys = "0.2.2"
Code to read a list of filenames piped through stdin on Windows as per the question.
extern crate kernel32;
extern crate winapi;
use std::io::{self, Read};
use std::ptr;
use std::fs::metadata;
use std::ffi::OsString;
use std::os::windows::ffi::OsStringExt;
/// Convert windows console input to wide string that can
/// be used by OS functions
fn wide_from_console_string(bytes: &[u8]) -> Vec<u16> {
assert!(bytes.len() < std::i32::MAX as usize);
let mut wide;
let mut len;
unsafe {
let cp = kernel32::GetConsoleCP();
len = kernel32::MultiByteToWideChar(cp, 0, bytes.as_ptr() as *const i8, bytes.len() as i32, ptr::null_mut(), 0);
wide = Vec::with_capacity(len as usize);
len = kernel32::MultiByteToWideChar(cp, 0, bytes.as_ptr() as *const i8, bytes.len() as i32, wide.as_mut_ptr(), len);
wide.set_len(len as usize);
}
wide
}
/// Extract paths from a list supplied as Cr LF
/// separated wide string
/// Would use a generic split on substring if it existed
fn paths_from_wide(wide: &[u16]) -> Vec<OsString> {
let mut r = Vec::new();
let mut start = 0;
let mut i = start;
let len = wide.len() - 1;
while i < len {
if wide[i] == 13 && wide[i + 1] == 10 {
if i > start {
r.push(OsString::from_wide(&wide[start..i]));
}
start = i + 2;
i = i + 2;
} else {
i = i + 1;
}
}
if i > start {
r.push(OsString::from_wide(&wide[start..i]));
}
r
}
fn main() {
let mut bytes = Vec::new();
if let Ok(_) = io::stdin().read_to_end(&mut bytes) {
let pathlist = wide_from_console_string(&bytes[..]);
let paths = paths_from_wide(&pathlist[..]);
for path in paths {
match metadata(&path) {
Ok(stat) => println!("{:?} is_file: {}", &path, stat.is_file()),
Err(e) => println!("Error: {:?} for {:?}", e, &path)
}
}
}
}

How can I read a file line-by-line, eliminate duplicates, then write back to the same file?

I want to read a file, eliminate all duplicates and write the rest back into the file - like a duplicate cleaner.
Vec because a normal array has a fixed size but my .txt is flexible (am I doing this right?).
Read, lines in Vec + delete duplices:
Missing write back to file.
use std::io;
fn main() {
let path = Path::new("test.txt");
let mut file = io::BufferedReader::new(io::File::open(&path, R));
let mut lines: Vec<String> = file.lines().map(|x| x.unwrap()).collect();
// dedup() deletes all duplicates if sort() before
lines.sort();
lines.dedup();
for e in lines.iter() {
print!("{}", e.as_slice());
}
}
Read + write to file (untested but should work I guess).
Missing lines to Vec because it doesn't work without BufferedReader as it seems (or I'm doing something else wrong, also a good chance).
use std::io;
fn main() {
let path = Path::new("test.txt");
let mut file = match io::File::open_mode(&path, io::Open, io::ReadWrite) {
Ok(f) => f,
Err(e) => panic!("file error: {}", e),
};
let mut lines: Vec<String> = file.lines().map(|x| x.unwrap()).collect();
lines.sort();
// dedup() deletes all duplicates if sort() before
lines.dedup();
for e in lines.iter() {
file.write("{}", e);
}
}
So .... how do I get those 2 together? :)
Ultimately, you are going to run into a problem: you are trying to write to the same file you are reading from. In this case, it's safe because you are going to read the entire file, so you don't need it after that. However, if you did try to write to the file, you'd see that opening a file for reading doesn't allow writing! Here's the code to do that:
use std::{
fs::File,
io::{BufRead, BufReader, Write},
};
fn main() {
let mut file = File::open("test.txt").expect("file error");
let reader = BufReader::new(&mut file);
let mut lines: Vec<_> = reader
.lines()
.map(|l| l.expect("Couldn't read a line"))
.collect();
lines.sort();
lines.dedup();
for line in lines {
file.write_all(line.as_bytes())
.expect("Couldn't write to file");
}
}
Here's the output:
% cat test.txt
a
a
b
a
% cargo run
thread 'main' panicked at 'Couldn't write to file: Os { code: 9, kind: Other, message: "Bad file descriptor" }', src/main.rs:12:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
You could open the file for both reading and writing:
use std::{
fs::OpenOptions,
io::{BufRead, BufReader, Write},
};
fn main() {
let mut file = OpenOptions::new()
.read(true)
.write(true)
.open("test.txt")
.expect("file error");
// Remaining code unchanged
}
But then you'd see that (a) the output is appended and (b) all the newlines are lost on the new lines because BufRead doesn't include them.
We could reset the file pointer back to the beginning, but then you'd probably leave trailing stuff at the end (deduplicating is likely to have less bytes written than read). It's easier to just reopen the file for writing, which will truncate the file. Also, let's use a set data structure to do the deduplication for us!
use std::{
collections::BTreeSet,
fs::File,
io::{BufRead, BufReader, Write},
};
fn main() {
let file = File::open("test.txt").expect("file error");
let reader = BufReader::new(file);
let lines: BTreeSet<_> = reader
.lines()
.map(|l| l.expect("Couldn't read a line"))
.collect();
let mut file = File::create("test.txt").expect("file error");
for line in lines {
file.write_all(line.as_bytes())
.expect("Couldn't write to file");
file.write_all(b"\n").expect("Couldn't write to file");
}
}
And the output:
% cat test.txt
a
a
b
a
a
b
a
b
% cargo run
% cat test.txt
a
b
The less-efficient but shorter solution is to read the entire file as one string and use str::lines:
use std::{
collections::BTreeSet,
fs::{self, File},
io::Write,
};
fn main() {
let contents = fs::read_to_string("test.txt").expect("can't read");
let lines: BTreeSet<_> = contents.lines().collect();
let mut file = File::open("test.txt").expect("can't create");
for line in lines {
writeln!(file, "{}", line).expect("can't write");
}
}
See also:
What's the de-facto way of reading and writing files in Rust 1.x?
What is the best variant for appending a new line in a text file?

How to combine reading a file line by line and iterating over each character in each line?

I started from this code, which just reads every line in a file, and which works well:
use std::io::{BufRead, BufReader};
use std::fs::File;
fn main() {
let file = File::open("chry.fa").expect("cannot open file");
let file = BufReader::new(file);
for line in file.lines() {
print!("{}", line.unwrap());
}
}
... but then I tried to also loop over each character in each line, something like this:
use std::io::{BufRead, BufReader};
use std::fs::File;
fn main() {
let file = File::open("chry.fa").expect("cannot open file");
let file = BufReader::new(file);
for line in file.lines() {
for c in line.chars() {
print!("{}", c.unwrap());
}
}
}
... but it turns out that this innermost for loop is not correct. I get the following error message:
error[E0599]: no method named `chars` found for type `std::result::Result<std::string::String, std::io::Error>` in the current scope
--> src/main.rs:8:23
|
8 | for c in line.chars() {
| ^^^^^
You need to handle the potential error that could arise from each IO operation, represented by an io::Result which can contain either the requested data or an error. There are different ways to handle errors.
One way is to just ignore them and read whatever data we can get.
The code shows how this can be done:
use std::io::{BufRead, BufReader};
use std::fs::File;
fn main() {
let file = File::open("chry.fa").expect("cannot open file");
let file = BufReader::new(file);
for line in file.lines().filter_map(|result| result.ok()) {
for c in line.chars() {
print!("{}", c);
}
}
}
The key points: file.lines() is an iterator that yields io::Result. In the filter_map, we convert the io::Result into an Option and filter any occurrences of None. We're then left with just plain lines (i.e. strings).

Resources