How should I read the contents of a file respecting endianess?

How should I read the contents of a file respecting endianess? - io

I can see that in Rust I can read a file to a byte array with:
File::open(&Path::new("fid")).read_to_end();
I can also read just one u32 in either big endian or little endian format with:
File::open(&Path::new("fid")).read_be_u32();
File::open(&Path::new("fid")).read_le_u32();
but as far as I can see i'm going to have to do something like this (simplified):
let path = Path::new("fid");
let mut file = File::open(&path);
let mut v = vec![];
for n in range(1u64, path.stat().unwrap().size/4u64){
v.push(if big {
file.read_be_u32()
} else {
file.read_le_u32()
});
}
But that's ugly as hell and I'm just wondering if there's a nicer way to do this.
Ok so the if in the loop was a big part of what was ugly so I hoisted that as suggested, the new version is as follows:
let path = Path::new("fid");
let mut file = File::open(&path);
let mut v = vec![];
let fun = if big {
||->IoResult<u32>{file.read_be_u32()}
} else {
||->IoResult<u32>{file.read_le_u32()}
};
for n in range(1u64, path.stat().unwrap().size/4u64){
v.push(fun());
}
Learned about range_step and using _ as an index, so now I'm left with:
let path = Path::new("fid");
let mut file = File::open(&path);
let mut v = vec![];
let fun = if big {
||->IoResult<u32>{file.read_be_u32()}
} else {
||->IoResult<u32>{file.read_le_u32()}
};
for _ in range_step(0u64, path.stat().unwrap().size,4u64){
v.push(fun().unwrap());
}
Any more advice? This is already looking much better.

This solution reads the whole file into a buffer, then creates a view of the buffer as words, then maps those words into a vector, converting endianness. collect() avoids all the reallocations of growing a mutable vector. You could also mmap the file rather than reading it into a buffer.
use std::io::File;
use std::num::{Int, Num};
fn from_bytes<'a, T: Num>(buf: &'a [u8]) -> &'a [T] {
unsafe {
std::mem::transmute(std::raw::Slice {
data: buf.as_ptr(),
len: buf.len() / std::mem::size_of::<T>()
})
}
}
fn main() {
let buf = File::open(&Path::new("fid")).read_to_end().unwrap();
let words: &[u32] = from_bytes(buf.as_slice());
let big = true;
let v: Vec<u32> = words.iter().map(if big {
|&n| { Int::from_be(n) }
} else {
|&n| { Int::from_le(n) }
}).collect();
println!("{}", v);
}

Related

How do I avoid obfuscating logic in a `loop`?

Trying to respect Rust safety rules leads me to write code that is, in this case, less clear than the alternative.
It's marginal, but must be a very common pattern, so I wonder if there's any better way.
The following example doesn't compile:
async fn query_all_items() -> Vec<u32> {
let mut items = vec![];
let limit = 10;
loop {
let response = getResponse().await;
// response is moved here
items.extend(response);
// can't do this, response is moved above
if response.len() < limit {
break;
}
}
items
}
In order to satisfy Rust safety rules, we can pre-compute the break condition:
async fn query_all_items() -> Vec<u32> {
let mut items = vec![];
let limit = 10;
loop {
let response = getResponse().await;
let should_break = response.len() < limit;
// response is moved here
items.extend(response);
// meh
if should_break {
break;
}
}
items
}
Is there any other way?

I agree with Daniel's point that this should be a while rather than a loop, though I'd move the logic to the while rather than creating a boolean:
let mut len = limit;
while len >= limit {
let response = queryItems(limit).await?;
len = response.len();
items.extend(response);
}

Not that you should do this, but an async stream version is possible. However a plain old loop is much easier to read.
use futures::{future, stream, StreamExt}; // 0.3.19
use rand::{
distributions::{Distribution, Uniform},
rngs::ThreadRng,
};
use std::sync::{Arc, Mutex};
use tokio; // 1.15.0
async fn get_response(rng: Arc<Mutex<ThreadRng>>) -> Vec<u32> {
let mut rng = rng.lock().unwrap();
let range = Uniform::from(0..100);
let len_u32 = range.sample(&mut *rng);
let len_usize = usize::try_from(len_u32).unwrap();
vec![len_u32; len_usize]
}
async fn query_all_items() -> Vec<u32> {
let rng = Arc::new(Mutex::new(ThreadRng::default()));
stream::iter(0..)
.then(|_| async { get_response(Arc::clone(&rng)).await })
.take_while(|v| future::ready(v.len() >= 10))
.collect::<Vec<_>>()
.await
.into_iter()
.flatten()
.collect()
}
#[tokio::main]
async fn main() {
// [46, 46, 46, ..., 78, 78, 78], or whatever random list you get
println!("{:?}", query_all_items().await);
}

I would do this in a while loop since the while will surface the flag more easily.
fn query_all_items () -> Vec<Item> {
let items = vec![];
let limit = 10;
let mut limit_reached = false;
while limit_reached {
let response = queryItems(limit).await?;
limit_reached = response.len() >= limit;
items.extend(response);
}
items
}

Without context it's hard to advise ideal code. I would do:
fn my_body_is_ready() -> Vec<u32> {
let mut acc = vec![];
let min = 10;
loop {
let foo = vec![42];
if foo.len() < min {
acc.extend(foo);
break acc;
} else {
acc.extend(foo);
}
}
}

How can I duplicate the first and last elements of a vector?

I would like to take a vector of characters and duplicate the first letter and the last one.
The only way I managed to do that is with this ugly code:
fn repeat_ends(s: &Vec<char>) -> Vec<char> {
let mut result: Vec<char> = Vec::new();
let first = s.first().unwrap();
let last = s.last().unwrap();
result.push(*first);
result.append(&mut s.clone());
result.push(*last);
result
}
fn main() {
let test: Vec<char> = String::from("Hello world !").chars().collect();
println!("{:?}", repeat_ends(&test)); // "HHello world !!"
}
What would be a better way to do it?

I am not sure if it is "better" but one way is using slice patterns:
fn repeat_ends(s: &Vec<char>) -> Vec<char> {
match s[..] {
[first, .. , last ] => {
let mut out = Vec::with_capacity(s.len() + 2);
out.push(first);
out.extend(s);
out.push(last);
out
},
_ => panic!("whatever"), // or s.clone()
}
}
If it can be mutable:
fn repeat_ends(s: &mut Vec<char>) {
if let [first, .. , last ] = s[..] {
s.insert(0, first);
s.push(last);
}
}

If it's ok to mutate the original vector, this does the job:
fn repeat_ends(s: &mut Vec<char>) {
let first = *s.first().unwrap();
s.insert(0, first);
let last = *s.last().unwrap();
s.push(last);
}
fn main() {
let mut test: Vec<char> = String::from("Hello world !").chars().collect();
repeat_ends(&mut test);
println!("{}", test.into_iter().collect::<String>()); // "HHello world !!"
}
Vec::insert:
Inserts an element at position index within the vector, shifting all elements after it to the right.
This means the function repeat_ends would be O(n) with n being the number of characters in the vector. I'm not sure if there is a more efficient method if you need to use a vector, but I'd be curious to hear it if there is.

Rust - Multiple Calls to Iterator Methods

I have this following rust code:
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut chars = line.char_indices();
for (i, c) in chars {
match c {
'"' => {
if let Some(pos) = chars.position(|(_, x)| x == '"') {
tokens.push(&line[i..=i+pos]);
} else {
// Not a complete string
}
}
// Other options...
}
}
tokens
}
I am trying to elegantly extract a string surrounded by double quotes from the line, but since chars.position takes a mutable reference and chars is moved into the for loop, I get a compilation error - "value borrowed after move". The compiler suggests borrowing chars in the for loop but this doesn't work because an immutable reference is not an iterator (and a mutable one would cause the original problem where I can't borrow mutably again for position).
I feel like there should be a simple solution to this.
Is there an idiomatic way to do this or do I need to regress to appending characters one by one?

Because a for loop will take ownership of chars (because it calls .into_iter() on it) you can instead manually iterate through chars using a while loop:
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut chars = line.char_indices();
while let Some((i, c)) = chars.next() {
match c {
'"' => {
if let Some(pos) = chars.position(|(_, x)| x == '"') {
tokens.push(&line[i..=i+pos]);
} else {
// Not a complete string
}
}
// Other options...
}
}
}

It works if you just desugar the for-loop:
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut chars = line.char_indices();
while let Some((i, c)) = chars.next() {
match c {
'"' => {
if let Some(pos) = chars.position(|(_, x)| x == '"') {
tokens.push(&line[i..=i+pos]);
} else {
// Not a complete string
}
},
_ => {},
}
}
tokens
}
The normal for-loop prevents additional modification of the iterator because this usually leads to surprising and hard-to-read code. Doing it as a while-loop has no such protection.
If all you want to do is find quoted strings, I would not, however, go with an iterator at all here.
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut line = line;
while let Some(pos) = line.find('"') {
line = &line[(pos+1)..];
if let Some(end) = line.find('"') {
tokens.push(&line[..end]);
line = &line[(end+1)..];
} else {
// Not a complete string
}
}
tokens
}

Issues with Rust timelines and ownerships

I am trying to create a hashmap by reading a file. Below is the code that I have written. The twist is that I need to persist subset_description till the next iteration so that I can store it in the hasmap and then finally return the hashmap.
fn myfunction(filename: &Path) -> io::Result<HashMap<&str, &str>> {
let mut SIF = HashMap::new();
let file = File::open(filename).unwrap();
let mut subset_description = "";
for line in BufReader::new(file).lines() {
let thisline = line?;
let line_split: Vec<&str> = thisline.split("=").collect();
subset_description = if thisline.starts_with("a") {
let subset_description = line_split[1].trim();
subset_description
} else {
""
};
let subset_ids = if thisline.starts_with("b") {
let subset_ids = line_split[1].split(",");
let subset_ids = subset_ids.map(|s| s.trim());
subset_ids.collect()
} else {
Vec::new()
};
for k in subset_ids {
SIF.insert(k, subset_description);
println!("");
}
if thisline.starts_with("!dataset_table_begin") {
break;
}
}
Ok(SIF)
}
I am getting the below error and not able to resolve this
error[E0515]: cannot return value referencing local variable `thisline`
--> src/main.rs:73:5
|
51 | let line_split: Vec<&str> = thisline.split("=").collect();
| -------- `thisline` is borrowed here
...
73 | Ok(SIF)
| ^^^^^^^ returns a value referencing data owned by the current function

The problem lies within the guarantees the Rust makes on your behalf. The root of the problem can be seen as following. You are reading a file and manipulating it's content into a HashMap, and you are trying to return reference to the the data you read. But by returning a reference you would need to guarantee, that the strings in the file wont be changed later on, which you naturally can not do.
In Rust terms you keep trying to return references to local variables, which get dropped at the end of the function, which would efficiently leave you with dangling pointers. Here is the changes I made, even though they may not be most efficient, they do compile.
fn myfunction(filename: &Path) -> io::Result<HashMap<String, String>> {
let mut SIF = HashMap::new();
let file = File::open(filename).unwrap();
let mut subset_description = "";
for line in BufReader::new(file).lines() {
let thisline = line?;
let line_split: Vec<String> = thisline.split("=").map(|s| s.to_string()).collect();
subset_description = if thisline.starts_with("a") {
let subset_description = line_split[1].trim();
subset_description
} else {
""
};
let subset_ids = if thisline.starts_with("b") {
let subset_ids = line_split[1].split(",");
let subset_ids = subset_ids.map(|s| s.trim());
subset_ids.map(|s| s.to_string()).collect()
} else {
Vec::new()
};
for k in subset_ids {
SIF.insert(k, subset_description.to_string());
println!("");
}
if thisline.starts_with("!dataset_table_begin") {
break;
}
}
Ok(SIF)
}
As you can see, now you give away the ownership of strings in return value. This is achieved by modifying the return type and using to_string() function, to give away the ownership of local strings to HashMap.
There is an argument that to_string() is slow, so you can explore the use of into or to_owned(), but as I am not proficient with those constructs I can not assist you in optimization.

Using the same iterator multiple times in Rust

Editor's note: This code example is from a version of Rust prior to 1.0 when many iterators implemented Copy. Updated versions of this code produce a different errors, but the answers still contain valuable information.
I'm trying to write a function to split a string into clumps of letters and numbers; for example, "test123test" would turn into [ "test", "123", "test" ]. Here's my attempt so far:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek() {
None => return bits,
Some(c) => if c.is_digit() {
bits.push(iter.take_while(|c| c.is_digit()).collect());
} else {
bits.push(iter.take_while(|c| !c.is_digit()).collect());
}
}
}
return bits;
}
However, this doesn't work, looping forever. It seems that it is using a clone of iter each time I call take_while, starting from the same position over and over again. I would like it to use the same iter each time, advancing the same iterator over all the each_times. Is this possible?

As you identified, each take_while call is duplicating iter, since take_while takes self and the Peekable chars iterator is Copy. (Only true before Rust 1.0 — editor)
You want to be modifying the iterator each time, that is, for take_while to be operating on an &mut to your iterator. Which is exactly what the .by_ref adaptor is for:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek().map(|c| *c) {
None => return bits,
Some(c) => if c.is_digit(10) {
bits.push(iter.by_ref().take_while(|c| c.is_digit(10)).collect());
} else {
bits.push(iter.by_ref().take_while(|c| !c.is_digit(10)).collect());
},
}
}
}
fn main() {
println!("{:?}", split("123abc456def"))
}
Prints
["123", "bc", "56", "ef"]
However, I imagine this is not correct.
I would actually recommend writing this as a normal for loop, using the char_indices iterator:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(input[start..i].to_string());
is_digit = this_is_digit;
start = i;
}
}
bits.push(input[start..].to_string());
bits
}
This form also allows for doing this with much fewer allocations (that is, the Strings are not required), because each returned value is just a slice into the input, and we can use lifetimes to state this:
pub fn split<'a>(input: &'a str) -> Vec<&'a str> {
let mut bits = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(&input[start..i]);
is_digit = this_is_digit;
start = i;
}
}
bits.push(&input[start..]);
bits
}
All that changed was the type signature, removing the Vec<String> type hint and the .to_string calls.
One could even write an iterator like this, to avoid having to allocate the Vec. Something like fn split<'a>(input: &'a str) -> Splits<'a> { /* construct a Splits */ } where Splits is a struct that implements Iterator<&'a str>.

take_while takes self by value: it consumes the iterator. Before Rust 1.0 it also was unfortunately able to be implicitly copied, leading to the surprising behaviour that you are observing.
You cannot use take_while for what you are wanting for these reasons. You will need to manually unroll your take_while invocations.
Here is one of many possible ways of dealing with this:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
let seeking_digits = match iter.peek() {
None => return bits,
Some(c) => c.is_digit(10),
};
if seeking_digits {
bits.push(take_while(&mut iter, |c| c.is_digit(10)));
} else {
bits.push(take_while(&mut iter, |c| !c.is_digit(10)));
}
}
}
fn take_while<I, F>(iter: &mut std::iter::Peekable<I>, predicate: F) -> String
where
I: Iterator<Item = char>,
F: Fn(&char) -> bool,
{
let mut out = String::new();
loop {
match iter.peek() {
Some(c) if predicate(c) => out.push(*c),
_ => return out,
}
let _ = iter.next();
}
}
fn main() {
println!("{:?}", split("test123test"));
}
This yields a solution with two levels of looping; another valid approach would be to model it as a state machine one level deep only. Ask if you aren’t sure what I mean and I’ll demonstrate.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How should I read the contents of a file respecting endianess? - io

Related

How do I avoid obfuscating logic in a `loop`?

How can I duplicate the first and last elements of a vector?

Rust - Multiple Calls to Iterator Methods

Issues with Rust timelines and ownerships

Using the same iterator multiple times in Rust

Categories

Resources