How to skip N bytes with Read without allocation? [duplicate] - rust

This question already has an answer here:
How to advance through data from the std::io::Read trait when Seek isn't implemented?
(1 answer)
Closed 3 years ago.
I would like to skip an arbitrary number of bytes when working with a Read instance without doing any allocations. After skipping, I need to continue reading the data that follows.
The number of bytes is not known at compile time, so I cannot create a fixed array. Read also has no skip so I need to read into something, it seems. I do not want to use BufReader and allocate unnecessary buffers and I do not want to read byte-by-byte as this is inefficient.
Any other options?

Your best bet is to also require Seek:
use std::io::{self, Read, Seek, SeekFrom};
fn example(mut r: impl Read + Seek) -> io::Result<String> {
r.seek(SeekFrom::Current(5))?;
let mut s = String::new();
r.take(5).read_to_string(&mut s)?;
Ok(s)
}
#[test]
fn it_works() -> io::Result<()> {
use std::io::Cursor;
let s = example(Cursor::new("abcdefghijklmnop"))?;
assert_eq!("fghij", s);
Ok(())
}
If you cannot use Seek, then see How to advance through data from the std::io::Read trait when Seek isn't implemented?
See also:
How to idiomatically / efficiently pipe data from Read+Seek to Write?
How to create an in-memory object that can be used as a Reader, Writer, or Seek in Rust?

Related

Rust read last x lines from file

Currently, I'm using this function in my code:
fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines().map(|l| l.expect("Could not parse line")).collect()
}
How can I safely read the last x lines only in the file?
The tail crate claims to provide an efficient means of reading the final n lines from a file by means of the BackwardsReader struct, and looks fairly easy to use. I can't swear to its efficiency (it looks like it performs progressively larger reads seeking further and further back in the file, which is slightly suboptimal relative to an optimized memory map-based solution), but it's an easy all-in-one package and the inefficiencies likely won't matter in 99% of all use cases.
To avoid storing files in memory (because the files were quite large) I chose to use rev_buffer_reader and to only take x elements from the iterator
fn lines_from_file(file: &File, limit: usize) -> Vec<String> {
let buf = RevBufReader::new(file);
buf.lines().take(limit).map(|l| l.expect("Could not parse line")).collect()
}

Convert struct to byte array and read this via implementing `std::io::Read` trait

I am trying to implement std::io::Read trait for a struct.
My objective is to convert obj to byte array and read it through the implementation of Read trait.
Following is the code I have written so far.
use chrono::{DateTime, Utc};
use std::io::Error;
use std::io::Read;
use std::vec::Vec;
use std::str;
use super::{Chain, Transaction};
// The struct I need to convert to byte array and add the Read impl.
#[derive(Debug)]
pub struct Block {
index: u64,
timestamp: DateTime<Utc>,
transactions: Vec<Transaction>,
proof: i64,
previous_hash: String,
}
// The Read trait implementation for Block
impl Read for Block {
fn read(&mut self, buf: &mut [u8]) -> std::result::Result<usize, Error> {
let bytes: &[u8] = unsafe { any_as_u8_slice(&self) };
buf.clone_from_slice(bytes);
Ok(bytes.len())
}
}
// Function that converts to byte array. (found on stackoverflow)
unsafe fn any_as_u8_slice<T: Sized>(p: &T) -> &[u8] {
::std::slice::from_raw_parts((p as *const T) as *const u8, ::std::mem::size_of::<T>())
}
I get an error when I execute the code this way.
let mut buffer: Vec<u8> = Vec::new();
let result = block.read(buffer.as_mut());
ERROR
thread 'main' panicked at 'destination and source slices have
different lengths',
/Users/harsh/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/slice/mod.rs:2554:9
I am new to Rust, trying to learn by porting another program in Rust.
How do I copy &[u8] to another &mut [u8] which is a vec. (Fix the Read impl for Block)?
And is there a better way to do this?
Convert object to byte array and return it from the Read implementation.
There's a few different problems here:
What you're trying to do here won't be sound in general. Rust structs might include padding bytes or otherwise initialized bytes, which means that reading them from a [u8] is undefined behavior. The name for what you're trying to do here is a Transmute and they are famously very difficult to do correctly.
It's not clear to me why specifically you're doing this in terms of the Read trait. Read is generally for i/o devices, like files or stdin, or in-memory buffers that behave like i/o devices. Even if we assume that a direct, inplace transmute to a byte slice is appropriate here, it would make more sense to just have a method on Block resembling fn as_byte_slice(&self) -> &[u8] { ... }.
Even if we set aside the above issues, it's still not clear to me that you're going to get the outcome you expect. Transmuting a struct to a byte array will convert only the raw bytes in the struct, which will work fine for primitive types like u64, but for types like Vec<T> and String will only return the underlying pointer to the allocated storage.
I'm guessing that what you actually want to have happen here is that all of the contents of the Block– including the list of transactions and the previous_hash– be converted into the byte slice. This is called serialization, and the de-facto way to do it in Rust is with serde. Serde is an abstract library that connects types (like Vec and your own Block) to data formats like json and bincode.
In your question, you've said you want to "convert the [object] to a byte array". This is a bit nonspecific; it's likely that there is actually a specific data format into which you're suppose to convert this Block. Your specific application will likely describe which specific data format is in use, and you can then look into whether there already exists a serde Serializer for that data format, or whether you'll need to write your own.

Parsing 40MB file noticeably slower than equivalent Pascal code [duplicate]

This question already has an answer here:
Why is my Rust program slower than the equivalent Java program?
(1 answer)
Closed 2 years ago.
use std::fs::File;
use std::io::Read;
fn main() {
let mut f = File::open("binary_file_path").expect("no file found");
let mut buf = vec![0u8;15000*707*4];
f.read(&mut buf).expect("Something went berserk");
let result: Vec<_> = buf.chunks(2).map(|chunk| i16::from_le_bytes([chunk[0],chunk[1]])).collect();
}
I want to read a binary file. The last line takes around 15s. I'd expect it to only take a fraction of a second. How can I optimise it?
Your code looks like the compiler should be able to optimise it decently. Make sure that you compile it in release mode using cargo build --release. Converting 40MB of data to native endianness should only take a fraction of a second.
You can simplify the code and save some unnecessary copying by using the byeteorder crate. It defines an extension trait for all implementors of Read, which allows you to directly call read_i16_into() on the file object.
use byteorder::{LittleEndian, ReadBytesExt};
use std::fs::File;
let mut f = File::open("binary_file_path").expect("no file found");
let mut result = vec![0i16; 15000 * 707 * 2];
f.read_i16_into::<LittleEndian>(&mut result).unwrap();
cargo build --release improved the performance

Cast vector of i8 to vector of u8 in Rust? [duplicate]

This question already has answers here:
How do I convert a Vec<T> to a Vec<U> without copying the vector?
(2 answers)
Closed 3 years ago.
Is there a better way to cast Vec<i8> to Vec<u8> in Rust except for these two?
creating a copy by mapping and casting every entry
using std::transmute
The (1) is slow, the (2) is "transmute should be the absolute last resort" according to the docs.
A bit of background maybe: I'm getting a Vec<i8> from the unsafe gl::GetShaderInfoLog() call and want to create a string from this vector of chars by using String::from_utf8().
The other answers provide excellent solutions for the underlying problem of creating a string from Vec<i8>. To answer the question as posed, creating a Vec<u8> from data in a Vec<i8> can be done without copying or transmuting the vector. As pointed out by #trentcl, transmuting the vector directly constitutes undefined behavior because Vec is allowed to have different layout for different types.
The correct (though still requiring the use of unsafe) way to transfer a vector's data without copying it is:
obtain the *mut i8 pointer to the data in the vector, along with its length and capacity
leak the original vector to prevent it from freeing the data
use Vec::from_raw_parts to build a new vector, giving it the pointer cast to *mut u8 - this is the unsafe part, because we are vouching that the pointer contains valid and initialized data, and that it is not in use by other objects, and so on.
This is not UB because the new Vec is given the pointer of the correct type from the start. Code (playground):
fn vec_i8_into_u8(v: Vec<i8>) -> Vec<u8> {
// ideally we'd use Vec::into_raw_parts, but it's unstable,
// so we have to do it manually:
// first, make sure v's destructor doesn't free the data
// it thinks it owns when it goes out of scope
let mut v = std::mem::ManuallyDrop::new(v);
// then, pick apart the existing Vec
let p = v.as_mut_ptr();
let len = v.len();
let cap = v.capacity();
// finally, adopt the data into a new Vec
unsafe { Vec::from_raw_parts(p as *mut u8, len, cap) }
}
fn main() {
let v = vec![-1i8, 2, 3];
assert!(vec_i8_into_u8(v) == vec![255u8, 2, 3]);
}
transmute on a Vec is always, 100% wrong, causing undefined behavior, because the layout of Vec is not specified. However, as the page you linked also mentions, you can use raw pointers and Vec::from_raw_parts to perform this correctly. user4815162342's answer shows how.
(std::mem::transmute is the only item in the Rust standard library whose documentation consists mostly of suggestions for how not to use it. Take that how you will.)
However, in this case, from_raw_parts is also unnecessary. The best way to deal with C strings in Rust is with the wrappers in std::ffi, CStr and CString. There may be better ways to work this in to your real code, but here's one way you could use CStr to borrow a Vec<c_char> as a &str:
const BUF_SIZE: usize = 1000;
let mut info_log: Vec<c_char> = vec![0; BUF_SIZE];
let mut len: usize;
unsafe {
gl::GetShaderInfoLog(shader, BUF_SIZE, &mut len, info_log.as_mut_ptr());
}
let log = Cstr::from_bytes_with_nul(info_log[..len + 1])
.expect("Slice must be nul terminated and contain no nul bytes")
.to_str()
.expect("Slice must be valid UTF-8 text");
Notice there is no unsafe code except to call the FFI function; you could also use with_capacity + set_len (as in wasmup's answer) to skip initializing the Vec to 1000 zeros, and use from_bytes_with_nul_unchecked to skip checking the validity of the returned string.
See this:
fn get_compilation_log(&self) -> String {
let mut len = 0;
unsafe { gl::GetShaderiv(self.id, gl::INFO_LOG_LENGTH, &mut len) };
assert!(len > 0);
let mut buf = Vec::with_capacity(len as usize);
let buf_ptr = buf.as_mut_ptr() as *mut gl::types::GLchar;
unsafe {
gl::GetShaderInfoLog(self.id, len, std::ptr::null_mut(), buf_ptr);
buf.set_len(len as usize);
};
match String::from_utf8(buf) {
Ok(log) => log,
Err(vec) => panic!("Could not convert compilation log from buffer: {}", vec),
}
}
See ffi:
let s = CStr::from_ptr(strz_ptr).to_str().unwrap();
Doc

Is this the right way to read lines from file and split them into words in Rust?

Editor's note: This code example is from a version of Rust prior to 1.0 and is not syntactically valid Rust 1.0 code. Updated versions of this code produce different errors, but the answers still contain valuable information.
I've implemented the following method to return me the words from a file in a 2 dimensional data structure:
fn read_terms() -> Vec<Vec<String>> {
let path = Path::new("terms.txt");
let mut file = BufferedReader::new(File::open(&path));
return file.lines().map(|x| x.unwrap().as_slice().words().map(|x| x.to_string()).collect()).collect();
}
Is this the right, idiomatic and efficient way in Rust? I'm wondering if collect() needs to be called so often and whether it's necessary to call to_string() here to allocate memory. Maybe the return type should be defined differently to be more idiomatic and efficient?
There is a shorter and more readable way of getting words from a text file.
use std::io::{BufRead, BufReader};
use std::fs::File;
let reader = BufReader::new(File::open("file.txt").expect("Cannot open file.txt"));
for line in reader.lines() {
for word in line.unwrap().split_whitespace() {
println!("word '{}'", word);
}
}
You could instead read the entire file as a single String and then build a structure of references that points to the words inside:
use std::io::{self, Read};
use std::fs::File;
fn filename_to_string(s: &str) -> io::Result<String> {
let mut file = File::open(s)?;
let mut s = String::new();
file.read_to_string(&mut s)?;
Ok(s)
}
fn words_by_line<'a>(s: &'a str) -> Vec<Vec<&'a str>> {
s.lines().map(|line| {
line.split_whitespace().collect()
}).collect()
}
fn example_use() {
let whole_file = filename_to_string("terms.txt").unwrap();
let wbyl = words_by_line(&whole_file);
println!("{:?}", wbyl)
}
This will read the file with less overhead because it can slurp it into a single buffer, whereas reading lines with BufReader implies a lot of copying and allocating, first into the buffer inside BufReader, and then into a newly allocated String for each line, and then into a newly allocated the String for each word. It will also use less memory, because the single large String and vectors of references are more compact than many individual Strings.
A drawback is that you can't directly return the structure of references, because it can't live past the stack frame the holds the single large String. In example_use above, we have to put the large String into a let in order to call words_by_line. It is possible to get around this with unsafe code and wrapping the String and references in a private struct, but that is much more complicated.

Resources