Rust read last x lines from file - rust

Currently, I'm using this function in my code:
fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines().map(|l| l.expect("Could not parse line")).collect()
}
How can I safely read the last x lines only in the file?

The tail crate claims to provide an efficient means of reading the final n lines from a file by means of the BackwardsReader struct, and looks fairly easy to use. I can't swear to its efficiency (it looks like it performs progressively larger reads seeking further and further back in the file, which is slightly suboptimal relative to an optimized memory map-based solution), but it's an easy all-in-one package and the inefficiencies likely won't matter in 99% of all use cases.

To avoid storing files in memory (because the files were quite large) I chose to use rev_buffer_reader and to only take x elements from the iterator
fn lines_from_file(file: &File, limit: usize) -> Vec<String> {
let buf = RevBufReader::new(file);
buf.lines().take(limit).map(|l| l.expect("Could not parse line")).collect()
}

Related

What is the most efficient way of taking a number of integer user inputs and storing it in a Vec<i32>?

I was trying to use rust for competitive coding and I was wondering what is the most efficient way of storing user input in a Vec. I have come up with a method but I am afraid that it is slow and redundant.
Here is my code:
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).expect("cant read line");
let input:Vec<&str> = input.split(" ").collect();
let input:Vec<String> = input.iter().map(|x| x.to_string()).collect();
let input:Vec<i32> = input.iter().map(|x| x.trim().parse().unwrap()).collect();
println!("{:?}", input);
}
PS: I am new to rust.
I see those ways of improving performance of the code:
Although not really relevant for std::io::stdin(), std::io::BufReader may have great effect for reading e.g. from std::fs::File. Buffer capacity can also matter.
Using locked stdin: let si = std::io::stdin(); let si = si.locked();
Avoiding allocations by keeping vectors around and using extend_from_iter instead of collect, if the code reads multiple line (unlike in the sample you posted in the question).
Maybe avoiding temporary vectors alltogether and just chaining Iterator operations together. Or using a loop like for line in input.split(...) { ... }. It may affect performance in both ways - you need to experiment to find out.
Avoiding to_string() and just storing reference to input buffer (which can also be used to parse() into i32. Note that this may invite famous Rust borrowing and lifetimes complexity.
Maybe finding some fast SIMD-enhanced string to int parser instead of libstd's parse().
Maybe streaming the result to algorithm instead of collecting everything to a Vec first. This can be beneficial especially if multiple threads can be used. For performance, you would still likely need to send data in chunks, not by one single i32.
Yeah, there are some changes you can make that will make your code more precise, simple and faster.
A better code :
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).unwrap();
let input: Vec<i32> = input.split_whitespace().map(|x| x.parse().unwrap()).collect();
println!("{:?}", input);
}
Explanation
The input.split_whitespace() returns an iterator containing elements that are seperated by any kind of whitespace including line breaks. This saves the time used in spliting by just one whitespace input.split(" ") and iterating over again with a .trim() method on each string slice to remove any surronding whitespaces.
(You can also checkout input.split_ascii_whitespace(), if you want to restrict the split over ascii whitespaces).
There was no need for the code input.iter().map(|x| x.to_string()).collect(), since you can call also call a .trim() method on a string slice.
This saves some time in both the runtime and coding process, since the .collect() method is only used once and there was just one iteration.

How do I create a streaming parser in nom?

I've created a few non-trivial parsers in nom, so I'm pretty familiar with it at this point. All the parsers I've created until now always provide the entire input slice to the parser.
I'd like to create a streaming parser, which I assume means that I can continue to feed bytes into the parser until it is complete. I've had a hard time finding any documentation or examples that illustrate this, and I also question my assumption of what a "streaming parser" is.
My questions are:
Is my understanding of what a streaming parser is correct?
If so, are there any good examples of a parser using this technique?
nom parsers neither maintain a buffer to feed more data into, nor do they maintain "state" where they previously needed more bytes.
But if you take a look at the IResult structure you see that you can return a partial result or indicate that you need more data.
There seem to be some structures provided to handle streaming: I think you are supposed to create a Consumer from a parser using the consumer_from_parser! macro, implement a Producer for your data source, and call run until it returns None (and start again when you have more data). Examples and docs seem to be mostly missing so far - see bottom of https://github.com/Geal/nom :)
Also it looks like most functions and macros in nom are not documented well (or at all) regarding their behavior when hitting the end of the input. For example take_until! returns Incomplete if the input isn't long enough to contain the substr to look for, but returns an error if the input is long enough but doesn't contain substr.
Also nom mostly uses either &[u8] or &str for input; you can't signal an actual "end of stream" through these types. You could implement your own input type (related traits: nom::{AsBytes,Compare,FindSubstring,FindToken,InputIter,InputLength,InputTake,Offset,ParseTo,Slice}) to add a "reached end of stream" flag, but the nom provided macros and functions won't be able to interpret it.
All in all I'd recommend splitting streamed input through some other means into chunks you can handle with simple non-streaming parsers (maybe even use synom instead of nom).
Here is a minimal working example. As #Stefan wrote, "I'd recommend splitting streamed input through some other means into chunks you can handle".
What somewhat works (and I'd be glad for suggestions on how to improve it), is to combine a File::bytes() method and then only take as many bytes as necessary and pass them to nom::streaming::take.
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)...;
The complete function can look like this:
/// Parse the first handful of bytes and return the bytes interpreted as UTF8
fn parse_first_bytes(file: std::fs::File, length: usize) -> Result<String> {
type B = std::result::Result<Vec<u8>, std::io::Error>;
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)
.finish()
.map_err(|nom::error::Error { input: _, code: _ }| eyre!("..."))?;
let s = String::from_utf8_lossy(chunk);
Ok(s.to_string())
}
Here is the rest of main for an implementation similar to Unix' head command.
use color_eyre::Result;
use eyre::eyre;
use nom::{bytes::streaming::take, Finish};
use std::{fs::File, io::Read, path::PathBuf};
use structopt::StructOpt;
#[derive(Debug, StructOpt)]
#[structopt(about = "A minimal example of parsing a file only partially.
This implements the POSIX 'head' utility.")]
struct Args {
/// Input File
#[structopt(parse(from_os_str))]
input: PathBuf,
/// Number of bytes to consume
#[structopt(short = "c", default_value = "32")]
num_bytes: usize,
}
fn main() -> Result<()> {
let args = Args::from_args();
let file = File::open(args.input)?;
let head = parse_first_bytes(file, args.num_bytes)?;
println!("{}", head);
Ok(())
}

Vec<PathBuf> to &[&Path] without allocations?

I have function that return Vec<PathBuf> and function that accept &[&Path], basically like this:
use std::path::{Path, PathBuf};
fn f(paths: &[&Path]) {
}
fn main() {
let a: Vec<PathBuf> = vec![PathBuf::from("/tmp/a.txt"), PathBuf::from("/tmp/b.txt")];
f(&a[..]);
}
Is it possible to convert Vec<PathBuf> to &[&Path] without memory allocations?
If not, how should I change f signature to accept slices with Path and PathBuf?
Is it possible to convert Vec<PathBuf> to &[&Path] without memory allocations?
No, as answered by How do I write a function that takes both owned and non-owned string collections?; a PathBuf and a Path have different memory layouts (the answer uses String and str; the concepts are the same).
how should I change f signature to accept slices with Path and PathBuf?
Again as suggested in How do I write a function that takes both owned and non-owned string collections?, use AsRef:
use std::path::{Path, PathBuf};
fn f<P>(paths: &[P])
where P: AsRef<Path>
{}
fn main() {
let a = vec![PathBuf::from("/tmp/a.txt")];
let b = vec![Path::new("/tmp/b.txt")];
f(&a);
f(&b);
}
This requires no additional heap allocation.
To pass around a slice, you have to also have the original data held somewhere. To have a &[&Path], then this needs to be pointing into something like a Vec<&Path>. But you don't have one of those, you have a Vec<PathBuf>.
To get this to work with your existing signatures, you can make a temporary Vec<&Path> and then take a slice of it.
fn f(paths: &[&Path]) {
}
fn main() {
let a: Vec<PathBuf> = vec![PathBuf::from("/tmp/a.txt"), PathBuf::from("/tmp/b.txt")];
let paths: Vec<&Path> = a.iter().map(PathBuf::as_path).collect();
f(&paths[..]);
}
Even though this creates a new Vec, this is just a couple of pointers on the stack - it doesn't have to actually copy any of the paths.
It's not possible to just cast without any allocations, because their layout in memory is different.
Vec<PathBuf> stores the data inline, and [&Path] stores pointers to the data (it's roughly similar to Vec<&PathBuf>).
You need to create a new vector to hold the pointers. If the size is known at compile time, you could use a stack-allocated array for it. Otherwise map+collect is needed.

Convert image to bytes and then write to new file

I'm trying to take an image that is converted into a vector of bytes and write those bytes to a new file. The first part is working, and my code is compiling, but the new file that is created ends up empty (nothing is written to it). What am I missing?
Is there a cleaner way to convert Vec<u8> into &[u8] so that it can be written? The way I'm currently doing it seems kind of ridiculous...
use std::os;
use std::io::BufferedReader;
use std::io::File;
use std::io::BufferedWriter;
fn get_file_buffer(path_str: String) -> Vec<u8> {
let path = Path::new(path_str.as_bytes());
let file = File::open(&path);
let mut reader = BufferedReader::new(file);
match reader.read_to_end() {
Ok(x) => x,
Err(_) => vec![0],
}
}
fn main() {
let file = get_file_buffer(os::args()[1].clone());
let mut new_file = File::create(&Path::new("foo.png")).unwrap();
let mut writer = BufferedWriter::new(new_file);
writer.write(String::from_utf8(file).unwrap().as_bytes()).unwrap();
writer.flush().unwrap();
}
Given a Vec<T>, you can get a &[T] out of it in two ways:
Take a reference to a dereference of it, i.e. &*file; this works because Vec<T> implements Deref<[T]>, so *file is effectively of type [T] (though doing that without borrowing it, i.e. &*file, is not legal).
Call the as_slice() method.
As the BufWriter docs say, “the buffer will be written out when the writer is dropped”, so that writer.flush().unwrap() is not strictly necessary, serving only to make handling of errors explicit.
But as for the behaviour you describe, that I mostly do not observe. So long as you do not encounter any I/O errors, the version not using the String dance will work fine, while with the String dance it will panic if the input data is not legal UTF-8 (which if you’re dealing with images it probably won’t be). String::from_utf8 returns None in such cases, and so unwrapping that panics.

Is this the right way to read lines from file and split them into words in Rust?

Editor's note: This code example is from a version of Rust prior to 1.0 and is not syntactically valid Rust 1.0 code. Updated versions of this code produce different errors, but the answers still contain valuable information.
I've implemented the following method to return me the words from a file in a 2 dimensional data structure:
fn read_terms() -> Vec<Vec<String>> {
let path = Path::new("terms.txt");
let mut file = BufferedReader::new(File::open(&path));
return file.lines().map(|x| x.unwrap().as_slice().words().map(|x| x.to_string()).collect()).collect();
}
Is this the right, idiomatic and efficient way in Rust? I'm wondering if collect() needs to be called so often and whether it's necessary to call to_string() here to allocate memory. Maybe the return type should be defined differently to be more idiomatic and efficient?
There is a shorter and more readable way of getting words from a text file.
use std::io::{BufRead, BufReader};
use std::fs::File;
let reader = BufReader::new(File::open("file.txt").expect("Cannot open file.txt"));
for line in reader.lines() {
for word in line.unwrap().split_whitespace() {
println!("word '{}'", word);
}
}
You could instead read the entire file as a single String and then build a structure of references that points to the words inside:
use std::io::{self, Read};
use std::fs::File;
fn filename_to_string(s: &str) -> io::Result<String> {
let mut file = File::open(s)?;
let mut s = String::new();
file.read_to_string(&mut s)?;
Ok(s)
}
fn words_by_line<'a>(s: &'a str) -> Vec<Vec<&'a str>> {
s.lines().map(|line| {
line.split_whitespace().collect()
}).collect()
}
fn example_use() {
let whole_file = filename_to_string("terms.txt").unwrap();
let wbyl = words_by_line(&whole_file);
println!("{:?}", wbyl)
}
This will read the file with less overhead because it can slurp it into a single buffer, whereas reading lines with BufReader implies a lot of copying and allocating, first into the buffer inside BufReader, and then into a newly allocated String for each line, and then into a newly allocated the String for each word. It will also use less memory, because the single large String and vectors of references are more compact than many individual Strings.
A drawback is that you can't directly return the structure of references, because it can't live past the stack frame the holds the single large String. In example_use above, we have to put the large String into a let in order to call words_by_line. It is possible to get around this with unsafe code and wrapping the String and references in a private struct, but that is much more complicated.

Resources