Read an arbitrary number of bytes from type implementing Read

Read an arbitrary number of bytes from type implementing Read - io

I have something that is Read; currently it's a File. I want to read a number of bytes from it that is only known at runtime (length prefix in a binary data structure).
So I tried this:
let mut vec = Vec::with_capacity(length);
let count = file.read(vec.as_mut_slice()).unwrap();
but count is zero because vec.as_mut_slice().len() is zero as well.
[0u8;length] of course doesn't work because the size must be known at compile time.
I wanted to do
let mut vec = Vec::with_capacity(length);
let count = file.take(length).read_to_end(vec).unwrap();
but take's receiver parameter is a T and I only have &mut T (and I'm not really sure why it's needed anyway).
I guess I can replace File with BufReader and dance around with fill_buf and consume which sounds complicated enough but I still wonder: Have I overlooked something?

Like the Iterator adaptors, the IO adaptors take self by value to be as efficient as possible. Also like the Iterator adaptors, a mutable reference to a Read is also a Read.
To solve your problem, you just need Read::by_ref:
use std::io::Read;
use std::fs::File;
fn main() {
let mut file = File::open("/etc/hosts").unwrap();
let length = 5;
let mut vec = Vec::with_capacity(length);
file.by_ref().take(length as u64).read_to_end(&mut vec).unwrap();
let mut the_rest = Vec::new();
file.read_to_end(&mut the_rest).unwrap();
}

1. Fill-this-vector version
Your first solution is close to work. You identified the problem but did not try to solve it! The problem is that whatever the capacity of the vector, it is still empty (vec.len() == 0). Instead, you could actually fill it with empty elements, such as:
let mut vec = vec![0u8; length];
The following full code works:
#![feature(convert)] // needed for `as_mut_slice()` as of 2015-07-19
use std::fs::File;
use std::io::Read;
fn main() {
let mut file = File::open("/usr/share/dict/words").unwrap();
let length: usize = 100;
let mut vec = vec![0u8; length];
let count = file.read(vec.as_mut_slice()).unwrap();
println!("read {} bytes.", count);
println!("vec = {:?}", vec);
}
Of course, you still have to check whether count == length, and read more data into the buffer if that's not the case.
2. Iterator version
Your second solution is better because you won't have to check how many bytes have been read, and you won't have to re-read in case count != length. You need to use the bytes() function on the Read trait (implemented by File). This transform the file into a stream (i.e an iterator). Because errors can still happen, you don't get an Iterator<Item=u8> but an Iterator<Item=Result<u8, R::Err>>. Hence you need to deal with failures explicitly within the iterator. We're going to use unwrap() here for simplicity:
use std::fs::File;
use std::io::Read;
fn main() {
let file = File::open("/usr/share/dict/words").unwrap();
let length: usize = 100;
let vec: Vec<u8> = file
.bytes()
.take(length)
.map(|r: Result<u8, _>| r.unwrap()) // or deal explicitly with failure!
.collect();
println!("vec = {:?}", vec);
}

You can always use a bit of unsafe to create a vector of uninitialized memory. It is perfectly safe to do with primitive types:
let mut v: Vec<u8> = Vec::with_capacity(length);
unsafe { v.set_len(length); }
let count = file.read(vec.as_mut_slice()).unwrap();
This way, vec.len() will be set to its capacity, and all bytes in it will be uninitialized (likely zeros, but possibly some garbage). This way you can avoid zeroing the memory, which is pretty safe for primitive types.
Note that read() method on Read is not guaranteed to fill the whole slice. It is possible for it to return with number of bytes less than the slice length. There are several RFCs on adding methods to fill this gap, for example, this one.

Related

Copy a slice of i32 pixels into an [u8] slice

How to copy a row of pixels in an i32 slice into an existing slice of pixels in an [u8] slice ?
Both slices are in the same memory layout (i.e. RGBA) but I don't know the unsafe syntax to copy one efficiently into the other. In C it would just be a memcpy().

You can flat_map the byte representation of each i32 into a Vec<u8>:
fn main() {
let pixels: &[i32] = &[-16776961, 16711935, 65535, -1];
let bytes: Vec<u8> = pixels
.iter()
.flat_map(|e| e.to_ne_bytes())
.collect();
println!("{bytes:?}");
}
There are different ways to handle the endianess of the system, I left to_ne_bytes to preserve the native order, but there are also to_le_bytes and to_be_bytes if that is something that needs to be controlled.
Alternatively, if you know the size of your pixel buffer ahead of time, you can use an unsafe transmute:
const BUF_LEN: usize = 4; // this is your buffer length
fn main() {
let pixels: [i32; BUF_LEN] = [-16776961, 16711935, 65535, -1];
let bytes = unsafe {
std::mem::transmute::<[i32; BUF_LEN], [u8; BUF_LEN * 4]>(pixels)
};
println!("{bytes:?}");
}

Assuming that you in fact do not need any byte reordering, the bytemuck library is the tool to use here, as it allows you to write the i32 to u8 reinterpretation without needing to consider safety (because bytemuck has checked it for you).
Specifically, bytemuck::cast_slice() will allow converting &[i32] to &[u8].
(In general, the function may panic if there is an alignment or size problem, but there never can be such a problem when converting to u8 or any other one-byte type.)

Build a Hashset from a lines iterator

I don't understand why this doesn't work:
use std::collections::HashSet;
let test = "foo\nbar\n";
let hashset: HashSet<_> = test
.lines()
.collect::<Result<HashSet<_>, _>>()
.unwrap()
I get this error:
a value of type Result<HashSet<_>, _> cannot be built from an iterator over elements of type &str
I tried to use an intermediary Vec but I didn't succeed either. I understand the error but I don't know how to elegantly fix this
This works but isn't the fastest solution:
use std::collections::HashSet;
let test = "foo\nbar\n";
let hashset = HashSet::new();
for word in test.lines() {
hashset.insert(p.to_string());
}

The lines() method cannot fail, as it operates over a &str, therefore you should collect to a HashSet<&str>.
See https://doc.rust-lang.org/std/primitive.str.html#method.lines.
For example:
let test = "foo\nbar\n";
let hashset: HashSet<&str> = test
.lines()
.collect();
See it in action in the playground.
Your confusion here seems to come from the fact that there's a similar lines method that operates on BufRead which can fail due to operating on files, or other I/O based sources.
See https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines.
Apart from this difference, BufRead.lines varies as it yields owning Strings instead of borrowed &str.
If you want to create a HashSet which owns its contents, you can modify your code as this:
let test = "foo\nbar\n";
let hashset: HashSet<String> = test
.lines()
.map(String::from)
.collect();

Cast vector of i8 to vector of u8 in Rust? [duplicate]

This question already has answers here:
How do I convert a Vec<T> to a Vec<U> without copying the vector?
(2 answers)
Closed 3 years ago.
Is there a better way to cast Vec<i8> to Vec<u8> in Rust except for these two?
creating a copy by mapping and casting every entry
using std::transmute
The (1) is slow, the (2) is "transmute should be the absolute last resort" according to the docs.
A bit of background maybe: I'm getting a Vec<i8> from the unsafe gl::GetShaderInfoLog() call and want to create a string from this vector of chars by using String::from_utf8().

The other answers provide excellent solutions for the underlying problem of creating a string from Vec<i8>. To answer the question as posed, creating a Vec<u8> from data in a Vec<i8> can be done without copying or transmuting the vector. As pointed out by #trentcl, transmuting the vector directly constitutes undefined behavior because Vec is allowed to have different layout for different types.
The correct (though still requiring the use of unsafe) way to transfer a vector's data without copying it is:
obtain the *mut i8 pointer to the data in the vector, along with its length and capacity
leak the original vector to prevent it from freeing the data
use Vec::from_raw_parts to build a new vector, giving it the pointer cast to *mut u8 - this is the unsafe part, because we are vouching that the pointer contains valid and initialized data, and that it is not in use by other objects, and so on.
This is not UB because the new Vec is given the pointer of the correct type from the start. Code (playground):
fn vec_i8_into_u8(v: Vec<i8>) -> Vec<u8> {
// ideally we'd use Vec::into_raw_parts, but it's unstable,
// so we have to do it manually:
// first, make sure v's destructor doesn't free the data
// it thinks it owns when it goes out of scope
let mut v = std::mem::ManuallyDrop::new(v);
// then, pick apart the existing Vec
let p = v.as_mut_ptr();
let len = v.len();
let cap = v.capacity();
// finally, adopt the data into a new Vec
unsafe { Vec::from_raw_parts(p as *mut u8, len, cap) }
}
fn main() {
let v = vec![-1i8, 2, 3];
assert!(vec_i8_into_u8(v) == vec![255u8, 2, 3]);
}

transmute on a Vec is always, 100% wrong, causing undefined behavior, because the layout of Vec is not specified. However, as the page you linked also mentions, you can use raw pointers and Vec::from_raw_parts to perform this correctly. user4815162342's answer shows how.
(std::mem::transmute is the only item in the Rust standard library whose documentation consists mostly of suggestions for how not to use it. Take that how you will.)
However, in this case, from_raw_parts is also unnecessary. The best way to deal with C strings in Rust is with the wrappers in std::ffi, CStr and CString. There may be better ways to work this in to your real code, but here's one way you could use CStr to borrow a Vec<c_char> as a &str:
const BUF_SIZE: usize = 1000;
let mut info_log: Vec<c_char> = vec![0; BUF_SIZE];
let mut len: usize;
unsafe {
gl::GetShaderInfoLog(shader, BUF_SIZE, &mut len, info_log.as_mut_ptr());
}
let log = Cstr::from_bytes_with_nul(info_log[..len + 1])
.expect("Slice must be nul terminated and contain no nul bytes")
.to_str()
.expect("Slice must be valid UTF-8 text");
Notice there is no unsafe code except to call the FFI function; you could also use with_capacity + set_len (as in wasmup's answer) to skip initializing the Vec to 1000 zeros, and use from_bytes_with_nul_unchecked to skip checking the validity of the returned string.

See this:
fn get_compilation_log(&self) -> String {
let mut len = 0;
unsafe { gl::GetShaderiv(self.id, gl::INFO_LOG_LENGTH, &mut len) };
assert!(len > 0);
let mut buf = Vec::with_capacity(len as usize);
let buf_ptr = buf.as_mut_ptr() as *mut gl::types::GLchar;
unsafe {
gl::GetShaderInfoLog(self.id, len, std::ptr::null_mut(), buf_ptr);
buf.set_len(len as usize);
};
match String::from_utf8(buf) {
Ok(log) => log,
Err(vec) => panic!("Could not convert compilation log from buffer: {}", vec),
}
}
See ffi:
let s = CStr::from_ptr(strz_ptr).to_str().unwrap();
Doc

Convert image to bytes and then write to new file

I'm trying to take an image that is converted into a vector of bytes and write those bytes to a new file. The first part is working, and my code is compiling, but the new file that is created ends up empty (nothing is written to it). What am I missing?
Is there a cleaner way to convert Vec<u8> into &[u8] so that it can be written? The way I'm currently doing it seems kind of ridiculous...
use std::os;
use std::io::BufferedReader;
use std::io::File;
use std::io::BufferedWriter;
fn get_file_buffer(path_str: String) -> Vec<u8> {
let path = Path::new(path_str.as_bytes());
let file = File::open(&path);
let mut reader = BufferedReader::new(file);
match reader.read_to_end() {
Ok(x) => x,
Err(_) => vec![0],
}
}
fn main() {
let file = get_file_buffer(os::args()[1].clone());
let mut new_file = File::create(&Path::new("foo.png")).unwrap();
let mut writer = BufferedWriter::new(new_file);
writer.write(String::from_utf8(file).unwrap().as_bytes()).unwrap();
writer.flush().unwrap();
}

Given a Vec<T>, you can get a &[T] out of it in two ways:
Take a reference to a dereference of it, i.e. &*file; this works because Vec<T> implements Deref<[T]>, so *file is effectively of type [T] (though doing that without borrowing it, i.e. &*file, is not legal).
Call the as_slice() method.
As the BufWriter docs say, “the buffer will be written out when the writer is dropped”, so that writer.flush().unwrap() is not strictly necessary, serving only to make handling of errors explicit.
But as for the behaviour you describe, that I mostly do not observe. So long as you do not encounter any I/O errors, the version not using the String dance will work fine, while with the String dance it will panic if the input data is not legal UTF-8 (which if you’re dealing with images it probably won’t be). String::from_utf8 returns None in such cases, and so unwrapping that panics.

Is this the right way to read lines from file and split them into words in Rust?

Editor's note: This code example is from a version of Rust prior to 1.0 and is not syntactically valid Rust 1.0 code. Updated versions of this code produce different errors, but the answers still contain valuable information.
I've implemented the following method to return me the words from a file in a 2 dimensional data structure:
fn read_terms() -> Vec<Vec<String>> {
let path = Path::new("terms.txt");
let mut file = BufferedReader::new(File::open(&path));
return file.lines().map(|x| x.unwrap().as_slice().words().map(|x| x.to_string()).collect()).collect();
}
Is this the right, idiomatic and efficient way in Rust? I'm wondering if collect() needs to be called so often and whether it's necessary to call to_string() here to allocate memory. Maybe the return type should be defined differently to be more idiomatic and efficient?

There is a shorter and more readable way of getting words from a text file.
use std::io::{BufRead, BufReader};
use std::fs::File;
let reader = BufReader::new(File::open("file.txt").expect("Cannot open file.txt"));
for line in reader.lines() {
for word in line.unwrap().split_whitespace() {
println!("word '{}'", word);
}
}

You could instead read the entire file as a single String and then build a structure of references that points to the words inside:
use std::io::{self, Read};
use std::fs::File;
fn filename_to_string(s: &str) -> io::Result<String> {
let mut file = File::open(s)?;
let mut s = String::new();
file.read_to_string(&mut s)?;
Ok(s)
}
fn words_by_line<'a>(s: &'a str) -> Vec<Vec<&'a str>> {
s.lines().map(|line| {
line.split_whitespace().collect()
}).collect()
}
fn example_use() {
let whole_file = filename_to_string("terms.txt").unwrap();
let wbyl = words_by_line(&whole_file);
println!("{:?}", wbyl)
}
This will read the file with less overhead because it can slurp it into a single buffer, whereas reading lines with BufReader implies a lot of copying and allocating, first into the buffer inside BufReader, and then into a newly allocated String for each line, and then into a newly allocated the String for each word. It will also use less memory, because the single large String and vectors of references are more compact than many individual Strings.
A drawback is that you can't directly return the structure of references, because it can't live past the stack frame the holds the single large String. In example_use above, we have to put the large String into a let in order to call words_by_line. It is possible to get around this with unsafe code and wrapping the String and references in a private struct, but that is much more complicated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read an arbitrary number of bytes from type implementing Read - io

Related

Copy a slice of i32 pixels into an [u8] slice

Build a Hashset from a lines iterator

Cast vector of i8 to vector of u8 in Rust? [duplicate]

Convert image to bytes and then write to new file

Is this the right way to read lines from file and split them into words in Rust?

Categories

Resources