Formatting a byte slice in Rust - rust

Using Rust, I want to take a slice of bytes from a vec and display them as hex, on the console, I can make this work using the itertools format function, and println!, but I cannot figure out how it works, here is the code, simplified...
use itertools::Itertools;
// Create a vec of bytes
let mut buf = vec![0; 1024];
... fill the vec with some data, doesn't matter how, I'm reading from a socket ...
// Create a slice into the vec
let bytes = &buf[..5];
// Print the slice, using format from itertools, example output could be: 30 27 02 01 00
println!("{:02x}", bytes.iter().format(" "));
(as an aside, I realize I can use the much simpler itertools join function, but in this case I don't want the default 0x## style formatting, as it is somewhat bulky)
How on earth does this work under the covers? I know itertools format is creating a "Format" struct, and I can see the source code here, https://github.com/rust-itertools/itertools/blob/master/src/format.rs , but I am none the wiser. I suspect the answer has to do with "macro_rules! impl_format" but that is just about where my head explodes.
Can some Rust expert explain the magic? I hate to blindly copy paste code without a clue, am I abusing itertools, maybe there a better, simpler way to go about this.

I suspect the answer has to do with "macro_rules! impl_format" but that is just about where my head explodes.
The impl_format! macro is used to implement the various formatting traits.
impl_format!{Display Debug
UpperExp LowerExp UpperHex LowerHex Octal Binary Pointer}
The author has chosen to write a macro because the implementations all look the same. The way repetitions work in macros means that macros can be very helpful even when they are used only once (here, we could do the same by invoking the macro once for each trait, but that's not true in general).
Let's expand the implementation of LowerHex for Format and look at it:
impl<'a, I> fmt::LowerHex for Format<'a, I>
where I: Iterator,
I::Item: fmt::LowerHex,
{
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
self.format(f, fmt::LowerHex::fmt)
}
}
The fmt method calls another method, format, defined in the same module.
impl<'a, I> Format<'a, I>
where I: Iterator,
{
fn format<F>(&self, f: &mut fmt::Formatter, mut cb: F) -> fmt::Result
where F: FnMut(&I::Item, &mut fmt::Formatter) -> fmt::Result,
{
let mut iter = match self.inner.borrow_mut().take() {
Some(t) => t,
None => panic!("Format: was already formatted once"),
};
if let Some(fst) = iter.next() {
cb(&fst, f)?;
for elt in iter {
if self.sep.len() > 0 {
f.write_str(self.sep)?;
}
cb(&elt, f)?;
}
}
Ok(())
}
}
format takes two arguments: the formatter (f) and a formatting function (cb for callback). The formatting function here is fmt::LowerHex::fmt. This is the fmt method from the LowerHex trait; how does the compiler figure out which LowerHex implementation to use? It's inferred from format's type signature. The type of cb is F, and F must implement FnMut(&I::Item, &mut fmt::Formatter) -> fmt::Result. Notice the type of the first argument: &I::Item (I is the type of the iterator that was passed to format). LowerHex::fmt' signature is:
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result;
For any type Self that implements LowerHex, this function will implement FnMut(&Self, &mut fmt::Formatter) -> fmt::Result. Thus, the compiler infers that Self == I::Item.
One important thing to note here is that the formatting attributes (e.g. the 02 in your formatting string) are stored in the Formatter. Implementations of e.g. LowerHex will use methods such as Formatter::width to retrieve an attribute. The trick here is that the same formatter is used to format multiple values (with the same attributes).
In Rust, methods can be called in two ways: using method syntax and using function syntax. These two functions are equivalent:
use std::fmt;
pub fn method_syntax(f: &mut fmt::Formatter) -> fmt::Result {
use fmt::LowerHex;
let x = 42u8;
x.fmt(f)
}
pub fn function_syntax(f: &mut fmt::Formatter) -> fmt::Result {
let x = 42u8;
fmt::LowerHex::fmt(&x, f)
}
When format is called with fmt::LowerHex::fmt, this means that cb refers to fmt::LowerHex::fmt. format must use function call because there's no guarantee that the callback is even a method!
am I abusing itertools
Not at all; in fact, this is precisely how format is meant to be used.
maybe there a better, simpler way to go about this
There are simpler ways, sure, but using format is very efficient because it doesn't allocate dynamic memory.

Related

Create new Vec<T> from a vector Vec<T> and data T

I want to be able to create a new vector c: Vec<T> from a: Vec<T> and b: T where c is equal to b appended to a, without using mutables. Say I have this code-
fn concat_vec<T>(a: Vec<T>, b: T) -> Vec<T> {
return Vec::<T>::new(a, b);
}
I am aware this does not work, because Vec::new() has no input parameters. What I am wondering is if there is an function or macro in std::vec that can initialize a vector from another vector and a set of additional values to append onto the vector. I know a function can be created very easily to do this, I am just wondering if functionality exists in the standard library to work with vectors without them being mutable in the first place.
Note that you can use the tap crate to push b to a in an expression that evaluates to a:
use tap::Tap;
fn concat_vec<T>(a: Vec<T>, b: T) -> Vec<T> {
a.tap_mut(|a| a.push(b))
}
This combines the performance benefits of push() (in comparison to building a brand new vector) with the elegance of having a short expression.
Vec::from_iter should work if you're willing to use iterators.
pub fn concat_vec<T>(a: Vec<T>, b: T) -> Vec<T> {
return Vec::from_iter(a.into_iter().chain(std::iter::once(b)));
}
Edit(s): Also as far as I have seen: the most common way to work with vectors, slices, and arrays is through the use of the Iterator trait and the functionality it provides.
Also on the topic of speed: the above approach is slightly faster if you avoid cloning. On my computer iterators take about 1.571µs per call whereas cloning and pushing to a vector took 1.580µs per call. (Tested by running the functions 300 000 times on a debug build.)
The universal form of the inefficient solution should look like this:
fn concat_vec<T>(a: Vec<T>, b: T) -> Vec<T> {
Vec::from_iter(a.into_iter().chain(std::iter::once(b)))
}
…but its strongly recommended to use push:
fn concat_vec<T: std::clone::Clone>(a: Vec<T>, b: T) -> Vec<T> {
let mut rslt = a.to_vec();
rslt.push(b);
rslt
}

`fold` values into a HashMap

After reading this article Learning Programming Concepts by Jumping in at the Deep End I can't seem to understand how exactly fold() is working in this context. Mainly how fold() knows to grab the word variable from split().
Here's the example:
use std::collections::HashMap;
fn count_words(text: &str) -> HashMap<&str, usize> {
text.split(' ').fold(
HashMap::new(),
|mut map, word| { *map.entry(word).or_insert(0) += 1; map }
)
}
Playground
Rust docs say:
fold() takes two arguments: an initial value, and a closure with two arguments: an ‘accumulator’, and an element. The closure returns the value that the accumulator should have for the next iteration.
Iterator - fold
So I get the mut map is the accumulator and I get that split() returns an iterator and therefore fold() is iterating over those values but how does fold know to grab that value? It's being implicitly passed but I cant seem to wrap my head around this. How is that being mapped to the word variable...
Not sure if I have the right mental model for this...
Thanks!
but how does fold know to grab that value?
fold() is a method on the iterator. That means that it has access to self which is the actual iterator, so it can call self.next() to get the next item (in this case the word, since self is of type Split, so its next() does get the next word). You could imagine fold() being implemented with the following pseudocode:
fn fold<B, F>(mut self, init: B, mut f: F) -> B
where
Self: Sized,
F: FnMut(B, Self::Item) -> B,
{
let mut accum = init;
while let Some(x) = self.next() {
accum = f(accum, x);
}
accum
}
Ok, the above is not pseudocode, it's the actual implementation.

What does this Rust Closure argument syntax mean?

I modified code found on the internet to create a function that obtains the statistical mode of any Hashable type that implements Eq, but I do not understand some of the syntax. Here is the function:
use std::hash::Hash;
use std::collections::HashMap;
pub fn mode<'a, I, T>(items: I) -> &'a T
where I: IntoIterator<Item = &'a T>,
T: Hash + Clone + Eq, {
let mut occurrences: HashMap<&T, usize> = HashMap::new();
for value in items.into_iter() {
*occurrences.entry(value).or_insert(0) += 1;
}
occurrences
.into_iter()
.max_by_key(|&(_, count)| count)
.map(|(val, _)| val)
.expect("Cannot compute the mode of zero items")
}
(I think requiring Clone may be overkill.)
The syntax I do not understand is in the closure passed to map_by_key:
|&(_, count)| count
What is the &(_, count) doing? I gather the underscore means I can ignore that parameter. Is this some sort of destructuring of a tuple in a parameter list? Does this make count take the reference of the tuple's second item?
.max_by_key(|&(_, count)| count) is equivalent to .max_by_key(f) where f is this:
fn f<T>(t: &(T, usize)) -> usize {
(*t).1
}
f() could also be written using pattern matching, like this:
fn f2<T>(&(_, count): &(T, usize)) -> usize {
count
}
And f2() is much closer to the first closure you're asking about.
The second closure is essentially the same, except there is no reference slightly complicating matters.

How do I interpret the signature of read_until and what is AsyncRead + BufRead in Tokio?

I'm trying to understand asynchronous I/O in Rust. The following code is based on a snippet from Katharina Fey's
Jan 2019 talk which works for me:
use futures::future::Future;
use std::io::BufReader;
use tokio::io::*;
fn main() {
let reader = BufReader::new(tokio::io::stdin());
let buffer = Vec::new();
println!("Type something:");
let fut = tokio::io::read_until(reader, b'\n', buffer)
.and_then(move |(stdin, buffer)| {
tokio::io::stdout()
.write_all(&buffer)
.map_err(|e| panic!(e))
})
.map_err(|e| panic!(e));
tokio::run(fut);
}
Before finding that code, I attempted to figure it out from the read_until documentation.
How do I interpret the signature of read_until to use it in a code sample like the one above?
pub fn read_until<A>(a: A, byte: u8, buf: Vec<u8>) -> ReadUntil<A>
where
A: AsyncRead + BufRead,
Specifically, how can I know from reading the documentation, what are the parameters passed into the and_then closure and the expected result?
Parameters to and_then
Unfortunately the standard layout of the Rust documentation makes futures quite hard to follow.
Starting from the read_until documentation you linked, I can see that it returns ReadUntil<A>. I'll click on that to go to the ReadUntil documentation.
This return value is described as:
A future which can be used to easily read the contents of a stream into a vector until the delimiter is reached.
I would expect it to implement the Future trait — and I can see that it does. I would also assume that the Item that the future resolves to is some sort of vector, but I don't know exactly what, so I keep digging:
First I look under "Trait implementations" and find impl<A> Future for ReadUntil<A>
I click the [+] expander
Finally I see the associated type Item = (A, Vec<u8>). This means it's a Future that's going to return a pair of values: the A, so it is presumably giving me back the original reader that I passed in, plus a vector of bytes.
When the future resolves to this tuple, I want to attach some additional processing with and_then. This is part of the Future trait, so I can scroll down further to find that function.
fn and_then<F, B>(self, f: F) -> AndThen<Self, B, F>
where
F: FnOnce(Self::Item) -> B,
B: IntoFuture<Error = Self::Error>,
Self: Sized,
The function and_then is documented as taking two parameters, but self is passed implicitly by the compiler when using dot syntax to chain functions, which tells us that we can write read_until(A, '\n', buffer).and_then(...). The second parameter in the documentation, f: F, becomes the first argument passed to and_then in our code.
I can see that f is a closure because the type F is shown as FnOnce(Self::Item) -> B (which if I click through links to the Rust book closure chapter.
The closure f that is passed in takes Self::Item as the parameter. I just found out that Item is (A, Vec<u8>), so I expect to write something like .and_then(|(reader, buffer)| { /* ... /* })
AsyncRead + BufRead
This is putting constraints on what type of reader can be read from. The created BufReader implements BufRead.
Helpfully, Tokio provides an implementation of AsyncRead for BufReader so we don't have to worry about it, we can just go ahead and use the BufReader.

How does the type deduction work in this Docopt example?

Take a look at this code using the docopt library:
const USAGE: &'static str = "...something...";
#[derive(Deserialize)]
struct Args {
flag: bool,
}
type Result<T> = result::Result<T, Box<error::Error + Send + Sync>>;
fn main() {
let mut args: Args = Docopt::new(USAGE)
.and_then(|d| d.deserialize())
.unwrap_or_else(|e| e.exit());
}
If you look at the expression to the right of equals sign, you'll see that it doesn't mention the Args struct anywhere. How does the compiler deduce the return type of this expression? Can the type information flow in opposite direction (from initialization target to initializer expression) in Rust?
"How does it work?" might be too big of a question for Stack Overflow but (along with other languages like Scala and Haskell) Rust's type system is based on the Hindley-Milner type system, albeit with many modifications and extensions.
Simplifying greatly, the idea is to treat each unknown type as a variable, and define the relationships between types as a series of constraints, which can then be solved by an algorithm. In some ways it's similar to simultaneous equations you may have solved in algebra at school.
Type inference is a feature of Rust (and other languages in the extended Hindley-Milner family) that is exploited pervasively in idiomatic code to:
reduce the noise of type annotations
improve maintainability by not hard-coding types in multiple places (DRY)
Rust's type inference is powerful and, as you say, can flow both ways. To use Vec<T> as a simpler and more familiar example, any of these are valid:
let vec = Vec::new(1_i32);
let vec = Vec::<i32>::new();
let vec: Vec<i32> = Vec::new();
The type can even be inferred just based on how a type is later used:
let mut vec = Vec::new();
// later...
vec.push(1_i32);
Another nice example is picking the correct string parser, based on the expected type:
let num: f32 = "100".parse().unwrap();
let num: i128 = "100".parse().unwrap();
let address: SocketAddr = "127.0.0.1:8080".parse().unwrap();
So what about your original example?
Docopt::new returns a Result<Docopt, Error>, which will be Result::Err<Error> if the supplied options can't be parsed as arguments. At this point, there is no knowledge of if the arguments are valid, just that they are correctly formed.
Next, and_then has the following signature:
pub fn and_then<U, F>(self, op: F) -> Result<U, E>
where
F: FnOnce(T) -> Result<U, E>,
The variable self has type Result<T, E> where T is Docopt and E is Error, deduced from step 1. U is still unknown, even after you supply the closure |d| d.deserialize().
But we know that T is Docopts, so deserialize is Docopts::deserialize, which has the signature:
fn deserialize<'a, 'de: 'a, D>(&'a self) -> Result<D, Error>
where
D: Deserialize<'de>
The variable self has type Docopts. D is still unknown, but we know it is the same type as U from step 2.
Result::unwrap_or_else has the signature:
fn unwrap_or_else<F>(self, op: F) -> T
where
F: FnOnce(E) -> T
The variable self has type Result<T, Error>. But we know that T is the same as U and D from the previous step.
We then assign to a variable of type Args, so T from the previous step is Args, which means that the D in step 3 (and U from step 2) is also Args.
The compiler can now deduce that when you wrote deserialize you meant the method <Args as Deserialize>::deserialize, which was derived automatically with the #[derive(Deserialize)] attribute.

Resources