Rust and Gzipped files - rust

I'm a Python and Golang dev and have recently started learning Rust. My current project involves processing hundreds of gzipped log files, each with hundreds of thousands of JSON entries, one JSON per line. My initial attempts were surprisingly slow. Investigating this, I noticed that Python 3 performs significantly faster than the Rust implementation, even when compiled in release mode. Am I doing something wrong?
Below is my Rust implementation:
use std::io::{BufRead, BufReader};
use std::fs::File;
use libflate::gzip::Decoder;
fn main() {
let path = "/path/to/input.json.gz";
process_file(path);
}
fn process_file(path: &str) {
let x = BufReader::new(Decoder::new(File::open(path).unwrap()).unwrap())
.lines()
.count();
println!("Found {} events", x);
}
Here is the significantly faster Python code that does the same thing:
import gzip
def main():
path = "/path/to/input.json.gz";
process_file(path)
def process_file(path):
with gzip.open(path) as fp:
count = 0
for _ in fp:
count += 1
print(f"Found {count} events")
Thank you for reading and making it this far.

For maximum performance try using the flate2 crate with the zlib-ng backend.

Related

Rust read last x lines from file

Currently, I'm using this function in my code:
fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines().map(|l| l.expect("Could not parse line")).collect()
}
How can I safely read the last x lines only in the file?
The tail crate claims to provide an efficient means of reading the final n lines from a file by means of the BackwardsReader struct, and looks fairly easy to use. I can't swear to its efficiency (it looks like it performs progressively larger reads seeking further and further back in the file, which is slightly suboptimal relative to an optimized memory map-based solution), but it's an easy all-in-one package and the inefficiencies likely won't matter in 99% of all use cases.
To avoid storing files in memory (because the files were quite large) I chose to use rev_buffer_reader and to only take x elements from the iterator
fn lines_from_file(file: &File, limit: usize) -> Vec<String> {
let buf = RevBufReader::new(file);
buf.lines().take(limit).map(|l| l.expect("Could not parse line")).collect()
}

How to get size of uncompressed gzip file using flate2

I want the uncompressed size of a gzipped file using rust crate flate2.
How do I ask a GzDecoder or MultiGzDecoder for the uncompressed file size without reading the entire file?
I'm hoping for a simple function call like header.filesz():
use std::io::prelude::*;
use std::fs::File;
extern crate flate2;
use flate2::read::GzDecoder;
fn main() -> {
let file1 = File::open("file1.gz").unwrap();
let mut decoder = GzDecoder::new(file1);
let header = decoder.header().unwrap();
let filesz = header.filesz();
}
Not possible. You need to read the whole file.

Libssh2 stop reading large file

I need to read some files via ssh, but when I loop on the file for reading... it stops.
In my Cargo.toml I have this dependency:
[dependencies]
ssh2 = "0.8"
It depends from libssh2.
Code is more complex, but I managed to reproduce the problem with this dirty code:
use std::fs::File;
use std::io::Read;
use std::io::Write;
#[allow(unused_imports)]
use std::io::Error;
use std::net::TcpStream;
use ssh2::Session;
use std::path::Path;
use std::io::prelude::*;
fn main() {
println!("Parto!");
let tcp = TcpStream::connect("myhost:22").unwrap();
let mut sess = Session::new().unwrap();
sess.set_tcp_stream(tcp);
sess.handshake().unwrap();
sess.userauth_password("user", "password").unwrap();
assert!(sess.authenticated());
let (mut remote_file, stat) = sess.scp_recv(Path::new("/home/pi/dbeaver.exe")).unwrap();
println!("remote file size: {}", stat.size());
let mut buf: Vec<u8> = vec![0;1000];
loop {
let read_bytes = remote_file.read(&mut buf).unwrap();
println!("Read bytes: {}.", read_bytes);
if read_bytes < 1000 {
break;
}
} //ending loop
}
I tested both on Windows 10 and Linux Debian. The problem is always there when the file is more then few KB this read:
let read_bytes = remote_file.read(&mut buf).unwrap();
reads less than buffer size (but the file is not ended).
Tested with both binary or asci files.
No errors, it just stop reading. Odd thing: it sometimes stop at 8 MB, sometimes at 16 KB. With the same file and host involved...
Where do I need to dig or what I need to check?
Quoting from the documentation for the io::Read trait (which ssh2 implements):
It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
Since you are reading remotely, this may simply mean that the remaining bytes are still in transit in the network and not yet available on the target computer.
If you are sure that there are enough incoming data to fill the buffer, you can use read_exact, but note that this will return an error if the available data is shorter than the buffer.
Edit: or even better, use read_to_end which doesn't require knowing the size beforehand (I don't know why I couldn't see it yesterday when I wrote the original answer).

Parsing 40MB file noticeably slower than equivalent Pascal code [duplicate]

This question already has an answer here:
Why is my Rust program slower than the equivalent Java program?
(1 answer)
Closed 2 years ago.
use std::fs::File;
use std::io::Read;
fn main() {
let mut f = File::open("binary_file_path").expect("no file found");
let mut buf = vec![0u8;15000*707*4];
f.read(&mut buf).expect("Something went berserk");
let result: Vec<_> = buf.chunks(2).map(|chunk| i16::from_le_bytes([chunk[0],chunk[1]])).collect();
}
I want to read a binary file. The last line takes around 15s. I'd expect it to only take a fraction of a second. How can I optimise it?
Your code looks like the compiler should be able to optimise it decently. Make sure that you compile it in release mode using cargo build --release. Converting 40MB of data to native endianness should only take a fraction of a second.
You can simplify the code and save some unnecessary copying by using the byeteorder crate. It defines an extension trait for all implementors of Read, which allows you to directly call read_i16_into() on the file object.
use byteorder::{LittleEndian, ReadBytesExt};
use std::fs::File;
let mut f = File::open("binary_file_path").expect("no file found");
let mut result = vec![0i16; 15000 * 707 * 2];
f.read_i16_into::<LittleEndian>(&mut result).unwrap();
cargo build --release improved the performance

Is this the right way to read lines from file and split them into words in Rust?

Editor's note: This code example is from a version of Rust prior to 1.0 and is not syntactically valid Rust 1.0 code. Updated versions of this code produce different errors, but the answers still contain valuable information.
I've implemented the following method to return me the words from a file in a 2 dimensional data structure:
fn read_terms() -> Vec<Vec<String>> {
let path = Path::new("terms.txt");
let mut file = BufferedReader::new(File::open(&path));
return file.lines().map(|x| x.unwrap().as_slice().words().map(|x| x.to_string()).collect()).collect();
}
Is this the right, idiomatic and efficient way in Rust? I'm wondering if collect() needs to be called so often and whether it's necessary to call to_string() here to allocate memory. Maybe the return type should be defined differently to be more idiomatic and efficient?
There is a shorter and more readable way of getting words from a text file.
use std::io::{BufRead, BufReader};
use std::fs::File;
let reader = BufReader::new(File::open("file.txt").expect("Cannot open file.txt"));
for line in reader.lines() {
for word in line.unwrap().split_whitespace() {
println!("word '{}'", word);
}
}
You could instead read the entire file as a single String and then build a structure of references that points to the words inside:
use std::io::{self, Read};
use std::fs::File;
fn filename_to_string(s: &str) -> io::Result<String> {
let mut file = File::open(s)?;
let mut s = String::new();
file.read_to_string(&mut s)?;
Ok(s)
}
fn words_by_line<'a>(s: &'a str) -> Vec<Vec<&'a str>> {
s.lines().map(|line| {
line.split_whitespace().collect()
}).collect()
}
fn example_use() {
let whole_file = filename_to_string("terms.txt").unwrap();
let wbyl = words_by_line(&whole_file);
println!("{:?}", wbyl)
}
This will read the file with less overhead because it can slurp it into a single buffer, whereas reading lines with BufReader implies a lot of copying and allocating, first into the buffer inside BufReader, and then into a newly allocated String for each line, and then into a newly allocated the String for each word. It will also use less memory, because the single large String and vectors of references are more compact than many individual Strings.
A drawback is that you can't directly return the structure of references, because it can't live past the stack frame the holds the single large String. In example_use above, we have to put the large String into a let in order to call words_by_line. It is possible to get around this with unsafe code and wrapping the String and references in a private struct, but that is much more complicated.

Resources