Parsing 40MB file noticeably slower than equivalent Pascal code [duplicate] - rust

This question already has an answer here:
Why is my Rust program slower than the equivalent Java program?
(1 answer)
Closed 2 years ago.
use std::fs::File;
use std::io::Read;
fn main() {
let mut f = File::open("binary_file_path").expect("no file found");
let mut buf = vec![0u8;15000*707*4];
f.read(&mut buf).expect("Something went berserk");
let result: Vec<_> = buf.chunks(2).map(|chunk| i16::from_le_bytes([chunk[0],chunk[1]])).collect();
}
I want to read a binary file. The last line takes around 15s. I'd expect it to only take a fraction of a second. How can I optimise it?

Your code looks like the compiler should be able to optimise it decently. Make sure that you compile it in release mode using cargo build --release. Converting 40MB of data to native endianness should only take a fraction of a second.
You can simplify the code and save some unnecessary copying by using the byeteorder crate. It defines an extension trait for all implementors of Read, which allows you to directly call read_i16_into() on the file object.
use byteorder::{LittleEndian, ReadBytesExt};
use std::fs::File;
let mut f = File::open("binary_file_path").expect("no file found");
let mut result = vec![0i16; 15000 * 707 * 2];
f.read_i16_into::<LittleEndian>(&mut result).unwrap();

cargo build --release improved the performance

Related

Rust read last x lines from file

Currently, I'm using this function in my code:
fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines().map(|l| l.expect("Could not parse line")).collect()
}
How can I safely read the last x lines only in the file?
The tail crate claims to provide an efficient means of reading the final n lines from a file by means of the BackwardsReader struct, and looks fairly easy to use. I can't swear to its efficiency (it looks like it performs progressively larger reads seeking further and further back in the file, which is slightly suboptimal relative to an optimized memory map-based solution), but it's an easy all-in-one package and the inefficiencies likely won't matter in 99% of all use cases.
To avoid storing files in memory (because the files were quite large) I chose to use rev_buffer_reader and to only take x elements from the iterator
fn lines_from_file(file: &File, limit: usize) -> Vec<String> {
let buf = RevBufReader::new(file);
buf.lines().take(limit).map(|l| l.expect("Could not parse line")).collect()
}

Rust: initialize a static variable/reference in a lib?

I'm new to Rust. I'm trying to create a static variable DATA of Vec<u8> in a library so that it is initialized after the compilation of the lib. I then include the lib in the main code hoping to use DATA directly without calling init_data() again. Here's what I've tried:
my_lib.rs:
use lazy_static::lazy_static;
pub fn init_data() -> Vec<u8> {
// some expensive calculations
}
lazy_static! {
pub static ref DATA: Vec<u8> = init_data(); // supposed to call init_data() only once during compilation
}
main.rs:
use my_lib::DATA;
call1(&DATA); // use DATA here without calling init_data()
call2(&DATA);
But it turned out that init_data() is still called in the main.rs. What's wrong with this code?
Update: as Ivan C pointed out, lazy_static is not run at compile time. So, what's the right choice for 'pre-loading' the data?
There are two problems here: the choice of type, and performing the allocation.
It is not possible to construct a Vec, a Box, or any other type that requires heap allocation at compile time, because the heap allocator and the heap do not yet exist at that point. Instead, you must use a reference type, which can point to data allocated in the binary rather than in the run-time heap, or an array without any reference (if the data is not too large).
Next, we need a way to perform the computation. Theoretically, the cleanest option is constant evaluation — straightforwardly executing parts of your code at compile time.
static DATA: &'static [u8] = {
// code goes here
};
However, in current stable Rust versions (1.58.1 as I'm writing this), constant evaluation is very limited, because you cannot do anything that looks like dropping a value, or use any function belonging to a trait. It can still do some things, mostly integer arithmetic or constructing other "almost literal" data; for example:
const N: usize = 10;
static FIRST_N_FIBONACCI: &'static [u32; N] = &{
let mut array = [0; N];
array[1] = 1;
let mut i = 2;
while i < array.len() {
array[i] = array[i - 1] + array[i - 2];
i += 1;
}
array
};
fn main() {
dbg!(FIRST_N_FIBONACCI);
}
If your computation cannot be expressed using const evaluation, then you will need to perform it another way:
Procedural macros are effectively compiler plugins, and they can perform arbitrary computation, but their output is generated Rust syntax. So, a procedural macro could produce an array literal with the precomputed data.
The main limitation of procedural macros is that they must be defined in dedicated crates (so if your project is one library crate, it would now be two instead).
Build scripts are ordinary Rust code which can compile or generate files used by the main compilation. They don't interact with the compiler, but are run by Cargo before compilation starts.
(Unlike const evaluation, both build scripts and proc macros can't use any of the types or constants defined within the crate being built itself; they can read the source code, but they run too early to use other items in the crate in their own code.)
In your case, because you want to precompute some [u8] data, I think the simplest approach would be to add a build script which writes the data to a file, after which your normal code can embed this data from the file using include_bytes!.

Rust and Gzipped files

I'm a Python and Golang dev and have recently started learning Rust. My current project involves processing hundreds of gzipped log files, each with hundreds of thousands of JSON entries, one JSON per line. My initial attempts were surprisingly slow. Investigating this, I noticed that Python 3 performs significantly faster than the Rust implementation, even when compiled in release mode. Am I doing something wrong?
Below is my Rust implementation:
use std::io::{BufRead, BufReader};
use std::fs::File;
use libflate::gzip::Decoder;
fn main() {
let path = "/path/to/input.json.gz";
process_file(path);
}
fn process_file(path: &str) {
let x = BufReader::new(Decoder::new(File::open(path).unwrap()).unwrap())
.lines()
.count();
println!("Found {} events", x);
}
Here is the significantly faster Python code that does the same thing:
import gzip
def main():
path = "/path/to/input.json.gz";
process_file(path)
def process_file(path):
with gzip.open(path) as fp:
count = 0
for _ in fp:
count += 1
print(f"Found {count} events")
Thank you for reading and making it this far.
For maximum performance try using the flate2 crate with the zlib-ng backend.

How to skip N bytes with Read without allocation? [duplicate]

This question already has an answer here:
How to advance through data from the std::io::Read trait when Seek isn't implemented?
(1 answer)
Closed 3 years ago.
I would like to skip an arbitrary number of bytes when working with a Read instance without doing any allocations. After skipping, I need to continue reading the data that follows.
The number of bytes is not known at compile time, so I cannot create a fixed array. Read also has no skip so I need to read into something, it seems. I do not want to use BufReader and allocate unnecessary buffers and I do not want to read byte-by-byte as this is inefficient.
Any other options?
Your best bet is to also require Seek:
use std::io::{self, Read, Seek, SeekFrom};
fn example(mut r: impl Read + Seek) -> io::Result<String> {
r.seek(SeekFrom::Current(5))?;
let mut s = String::new();
r.take(5).read_to_string(&mut s)?;
Ok(s)
}
#[test]
fn it_works() -> io::Result<()> {
use std::io::Cursor;
let s = example(Cursor::new("abcdefghijklmnop"))?;
assert_eq!("fghij", s);
Ok(())
}
If you cannot use Seek, then see How to advance through data from the std::io::Read trait when Seek isn't implemented?
See also:
How to idiomatically / efficiently pipe data from Read+Seek to Write?
How to create an in-memory object that can be used as a Reader, Writer, or Seek in Rust?

Rust 0.9 -- Reading a file?

Here's what I'm trying to do: open all the command line arguments as (binary) files and read bytes from them. The constantly changing syntax here is not conductive to googling, but here's what I've figured out so far:
use std::io::{File, result};
use std::path::Path;
use std::os;
fn main() {
let args = os::args();
let mut iter = args.iter().skip(1); // skip the program name
for file_name in iter {
println(*file_name);
let path = &Path::new(*file_name);
let file = File::open(path);
}
}
Here's the issue:
test.rs:44:31: 44:41 error: cannot move out of dereference of & pointer
test.rs:44 let path = &Path::new(*file_name);
I've hit a brick wall here because while I'm fine with pointers in C, my understanding of the different pointer types in rust is practically non-existent. What can I do here?
Try &Path::new(file_name.as_slice())
Unfortunately, due to the trait argument that Path::new() takes, if you pass it a ~str or ~[u8] it will try and consume that type directly. And that's what you're passing with *file_name. Except you can't move out of a pointer dereference in Rust, which is why you're getting the error.
By using file_name.as_slice() instead (which is equivalent, in this case, to (*file_name).as_slice(), but Rust will do the dereference for you) it will convert the ~str to a &str, which can then be passed to Path::new() without a problem.

Resources