I'm using a copy of the stdlib's raw vec to build my own data structure. I'd like to reaad a chunk of a file directly into my data structure (no extra copies). RawVec has a *const u8 as it's underlying storage and I'd like to read the file directly into that.
// Goal:
// Takes a file, a pointer to read bytes into, a number of free bytes # that pointer
// and returns the number of bytes read
fn read_into_ptr(file: &mut File, ptr: *mut u8, free_space: usize) -> usize {
// read up to free_space bytes into ptr
todo!()
}
// What I have now requires an extra copy. First I read into my read buffer
// Then move copy into my destination where I actually want to data.
// How can I remove this extra copy?
fn read_into_ptr(file: &mut File, ptr: *mut u8, read_buf: &mut[u8; 4096]) -> usize {
let num_bytes = file.read(read_buf).unwrap();
unsafe {
ptr::copy_nonoverlapping(...)
}
num_bytes
}
``
Create a slice from the pointer, and read into it:
let slice = unsafe { std::slice::from_raw_parts_mut(ptr, free_space) };
file.read(slice).unwrap()
Related
Background (Skippable)
On linux, the file /var/run/utmp contains several utmp structures, each in raw binary format, following each other in a file. utmp itself is a relatively large (384 bytes on my machine). I am trying to read this file to it's raw data, and them implement checks after the fact that the data makes sense. I'm not new to rust, but this is my first real experience with the unsafe side of things.
Problem Statement
I have a file that contains several c sturct utmps (docs). In rust, I would like to read the entire file into an array of Vec<libc::utmpx>. More specifically, given a reader open to this file, how could I read one struct utmp?
What I have so far
Below are three different implementations of read_raw, which accepts a reader and returns a RawEntry(my alias for struct utmp). Which method is most correct? I am trying to write as performant code as possible, and I am worried that read_raw0 might be slower than the others if it involves memcpys. What is the best/fastest way to accomplish this behavior?
use std::io::Read;
use libc::utmpx as RawEntry;
const RawEntrySize = std::mem::size_of::<RawEntry>();
type RawEntryBuffer = [u8; RawEntrySize];
/// Read a raw utmpx struct
// After testing, this method doesn't work
pub fn read_raw0<R: Read>(reader: &mut R) -> RawEntry {
let mut entry: RawEntry = unsafe { std::mem::zeroed() };
unsafe {
let mut entry_buf = std::mem::transmute::<RawEntry, RawEntryBuffer>(entry);
reader.read_exact(&mut entry_buf[..]);
}
return entry;
}
/// Read a raw utmpx struct
pub fn read_raw1<R: Read>(reader: &mut R) -> RawEntry {
// Worried this could cause alignment issues, or maybe it's okay
// because transmute copies
let mut buffer: RawEntryBuffer = [0; RawEntrySize];
reader.read_exact(&mut buffer[..]);
let entry = unsafe {
std::mem::transmute::<RawEntryBuffer, RawEntry>(buffer)
};
return entry;
}
/// Read a raw utmpx struct
pub fn read_raw2<R: Read>(reader: &mut R) -> RawEntry {
let mut entry: RawEntry = unsafe { std::mem::zeroed() };
unsafe {
let entry_ptr = std::mem::transmute::<&mut RawEntry, *mut u8>(&mut entry);
let entry_slice = std::slice::from_raw_parts_mut(entry_ptr, RawEntrySize);
reader.read_exact(entry_slice);
}
return entry;
}
Note: After more testing, it appears read_raw0 doesn't work. I believe this is because transmute creates a new buffer instead of referencing the struct.
This is what I came up with, which I imagine should be about as fast as it gets to read a single entry. It follows the spirit of your last entry, but avoids the transmute (Transmuting &mut T to *mut u8 can be done with two casts: t as *mut T as *mut u8). Also it uses MaybeUninit instead of zeroed to be a bit more explicit (The assembly is likely the same once optimized). Lastly, the function will be unsafe either way, so we may as well mark it as such and do away with the unsafe blocks.
use std::io::{self, Read};
use std::slice::from_raw_parts_mut;
use std::mem::{MaybeUninit, size_of};
pub unsafe fn read_raw_struct<R: Read, T: Sized>(src: &mut R) -> io::Result<T> {
let mut buffer = MaybeUninit::uninit();
let buffer_slice = from_raw_parts_mut(buffer.as_mut_ptr() as *mut u8, size_of::<T>());
src.read_exact(buffer_slice)?;
Ok(buffer.assume_init())
}
I need to allocate a buffer for reading from a File, but this buffer must be aligned to the size of the cache line (64 bytes). I am looking for a function somewhat like this for Vec:
pub fn with_capacity_and_aligned(capacity: usize, alignment: u8) -> Vec<T>
which would give me the 64 byte alignment that I need. This obviously doesn't exist, but there might be some equivalences (i.e. "hacks") that I don't know about.
So, when I use this function (which will give me the desired alignment), I could write this code safely:
#[repr(C)]
struct Header {
magic: u32,
some_data1: u32,
some_data2: u64,
}
let cache_line_size = 64; // bytes
let buffer: Vec<u8> = Vec::<u8>::with_capacity_and_alignment(some_size, cache_line_size);
match file.read_to_end(&mut buffer) {
Ok(_) => {
let header: Header = {
// and since the buffer is aligned to 64 bytes, I wont get any SEGFAULT
unsafe { transmute(buffer[0..(size_of::<Header>())]) }
};
}
}
and not get any panics because of alignment issues (like launching an instruction).
You can enforce the alignment of a type to a certain size using #[repr(align(...))]. We also use repr(C) to ensure that this type has the same memory layout as an array of bytes.
You can then create a vector of the aligned type and transform it to a vector of appropriate type:
use std::mem;
#[repr(C, align(64))]
struct AlignToSixtyFour([u8; 64]);
unsafe fn aligned_vec(n_bytes: usize) -> Vec<u8> {
// Lazy math to ensure we always have enough.
let n_units = (n_bytes / mem::size_of::<AlignToSixtyFour>()) + 1;
let mut aligned: Vec<AlignToSixtyFour> = Vec::with_capacity(n_units);
let ptr = aligned.as_mut_ptr();
let len_units = aligned.len();
let cap_units = aligned.capacity();
mem::forget(aligned);
Vec::from_raw_parts(
ptr as *mut u8,
len_units * mem::size_of::<AlignToSixtyFour>(),
cap_units * mem::size_of::<AlignToSixtyFour>(),
)
}
There are no guarantees that the Vec<u8> will remain aligned if you reallocate the data. This means that you cannot reallocate so you will need to know how big to allocate up front.
The function is unsafe for the same reason. When the type is dropped, the memory must be back to its original allocation, but this function cannot control that.
Thanks to BurntSushi5 for corrections and additions.
See also:
How can I align a struct to a specifed byte boundary?
Align struct to cache lines in Rust
How do I convert a Vec<T> to a Vec<U> without copying the vector?
Because of the limitations and unsafety above, another potential idea would be to allocate a big-enough buffer (maybe with some wiggle room), and then use align_to to get a properly aligned chunk. You could use the same AlignToSixtyFour type as above, and then convert the &[AlignToSixtyFour] into a &[u8] with similar logic.
This technique could be used to give out (optionally mutable) slices that are aligned. Since they are slices, you don't have to worry about the user reallocating or dropping them. This would allow you to wrap it up in a nicer type.
All that being said, I think that relying on alignment here is inappropriate for your actual goal of reading a struct from a file. Simply read the bytes (u32, u32, u64) and build the struct:
use byteorder::{LittleEndian, ReadBytesExt}; // 1.3.4
use std::{fs::File, io};
#[derive(Debug)]
struct Header {
magic: u32,
some_data1: u32,
some_data2: u64,
}
impl Header {
fn from_reader(mut reader: impl io::Read) -> Result<Self, Box<dyn std::error::Error>> {
let magic = reader.read_u32::<LittleEndian>()?;
let some_data1 = reader.read_u32::<LittleEndian>()?;
let some_data2 = reader.read_u64::<LittleEndian>()?;
Ok(Self {
magic,
some_data1,
some_data2,
})
}
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut f = File::open("/etc/hosts")?;
let header = Header::from_reader(&mut f)?;
println!("{:?}", header);
Ok(())
}
See also:
How to read a struct from a file in Rust?
Is this the most natural way to read structs from a binary file?
Can I take a byte array and deserialize it into a struct?
Transmuting u8 buffer to struct in Rust
Can I somehow get an array from std::ptr::read?
I'd like to do something close to:
let mut v: Vec<u8> = ...
let view = &some_struct as *const _ as *const u8;
v.write(&std::ptr::read<[u8, ..30]>(view));
Which is not valid in this form (can't use the array signature).
If you want to obtain a slice from a raw pointer, use std::slice::from_raw_parts():
let slice = unsafe { std::slice::from_raw_parts(some_pointer, count_of_items) };
If you want to obtain a mutable slice from a raw pointer, use std::slice::from_raw_parts_mut():
let slice = unsafe { std::slice::from_raw_parts_mut(some_pointer, count_of_items) };
Are you sure you want read()? Without special care it will cause disaster on structs with destructors. Also, read() does not read a value of some specified type from a pointer to bytes; it reads exactly one value of the type behind the pointer (e.g. if it is *const u8 then read() will read one byte) and returns it.
If you only want to write byte contents of a structure into a vector, you can obtain a slice from the raw pointer:
use std::mem;
use std::io::Write;
struct SomeStruct {
a: i32,
}
fn main() {
let some_struct = SomeStruct { a: 32 };
let mut v: Vec<u8> = Vec::new();
let view = &some_struct as *const _ as *const u8;
let slice = unsafe { std::slice::from_raw_parts(view, mem::size_of::<SomeStruct>()) };
v.write(slice).expect("Unable to write");
println!("{:?}", v);
}
This makes your code platform-dependent and even compiler-dependent: if you use types of variable size (e.g. isize/usize) in your struct or if you don't use #[repr(C)], the data you wrote into the vector is likely to be read as garbage on another machine (and even #[repr(C)] may not lift this problem sometimes, as far as I remember).
I want to read a UTF-8 string from a file with a known offset and size,
so I wrote:
fn test(file: &mut File, offset: u64, length: usize) -> Result<String, String> {
try!(file.seek(SeekFrom::Start(offset)).map_err(|err| err.to_string()));
let mut buffer = Vec::<u8>::with_capacity(length);
buffer.resize(length, 0_u8);
try!(file.read_exact(& mut buffer).map_err(|err| err.to_string()));
let utf8_s = try!(from_utf8(&buffer).map_err(|err| "invalid utf-8 data in data".to_string()));
Result::Ok(String::from(utf8_s))
}
In my code I dislike two things:
I initialized Vec with 0, but this is useless, because on the
next line I call file.read_exact. Can I allocate memory in heap without
initializing it?
I created Vec on the heap and at the end I allocate memory again via String. I allocate the same amount of memory and copy from one location to another. Is it possible to implement this function with length memory requirements, not 2 * length?
Rust has no concept of "write-only" memory, so the only way to avoid initialising the Vec would be with unsafe code. Unless you can prove this is an actual performance problem for your program, just leave it as-is.
You could just use String::from_utf8 instead, which does the conversion in-place.
You can use read_to_string() function in combination with take() function to read the exact number of bytes:
fn test(file: &mut File, offset: u64, length: usize) -> Result<String, String> {
try!(file.seek(SeekFrom::Start(offset)).map_err(|err| err.to_string()));
let mut res = String::new();//or you can use ::with_capacity(length);
try!(file.take(length as u64)
.read_to_string(&mut res)
.map_err(|err| err.to_string())
.and_then(|x| if x!=length {Err("wrong num bytes".to_string())}else{Ok(x)})
);
Result::Ok(res)
}
Can I somehow get an array from std::ptr::read?
I'd like to do something close to:
let mut v: Vec<u8> = ...
let view = &some_struct as *const _ as *const u8;
v.write(&std::ptr::read<[u8, ..30]>(view));
Which is not valid in this form (can't use the array signature).
If you want to obtain a slice from a raw pointer, use std::slice::from_raw_parts():
let slice = unsafe { std::slice::from_raw_parts(some_pointer, count_of_items) };
If you want to obtain a mutable slice from a raw pointer, use std::slice::from_raw_parts_mut():
let slice = unsafe { std::slice::from_raw_parts_mut(some_pointer, count_of_items) };
Are you sure you want read()? Without special care it will cause disaster on structs with destructors. Also, read() does not read a value of some specified type from a pointer to bytes; it reads exactly one value of the type behind the pointer (e.g. if it is *const u8 then read() will read one byte) and returns it.
If you only want to write byte contents of a structure into a vector, you can obtain a slice from the raw pointer:
use std::mem;
use std::io::Write;
struct SomeStruct {
a: i32,
}
fn main() {
let some_struct = SomeStruct { a: 32 };
let mut v: Vec<u8> = Vec::new();
let view = &some_struct as *const _ as *const u8;
let slice = unsafe { std::slice::from_raw_parts(view, mem::size_of::<SomeStruct>()) };
v.write(slice).expect("Unable to write");
println!("{:?}", v);
}
This makes your code platform-dependent and even compiler-dependent: if you use types of variable size (e.g. isize/usize) in your struct or if you don't use #[repr(C)], the data you wrote into the vector is likely to be read as garbage on another machine (and even #[repr(C)] may not lift this problem sometimes, as far as I remember).