How to apply a series of iterators to data? - rust

I'm tinkering with Rust by building some basic genetics functionality, e.g. read a file with a DNA sequence, transcribe it to RNA, translate it to an amino acid sequence, etc.
I'd like each of these transformations to accept and return iterators. That way I can string them together (like dna.transcribe().traslate()...) and only collect when necessary, so the compiler can optimize the entire chain of transormations. I'm a data scientist coming from Scala/Spark, so this pattern makes a lot of sense, but I'm not sure how to implement it Rust.
I've read this article about returning iterators but the final solution seems to be to use trait objects (with possibly large performance impact), or to hand roll iterators with associated structs (which allows me to return an iterator, yes, but I don't see how it would allow me to write a transformation that also accepts an iterator).
Any general architectural advice here?
(FYI, my code so far is available here, but I feel like I'm not using Rust idiomatically because a. still can't quite get it to compile b. this pattern of lazily chaining operations has led to unexpectedly complex and messy code that only works on Rust nightly.)

Iterator adaptors are meant to do operations which can't easily be expressed otherwise. Your two examples, .translate(), and .transcribe(), given your explanation of them, could be simplified to the following:
dna
.map(|x| x.translate())
.map(|x| x.transcribe())
// or
dna
.map(|x| x.translate().transcribe())
However, if you are intent on designing your own iterator, the following should work:
struct Transcriber<I: Iterator<Item = Dna>> {
inner: I
}
impl<I: Iterator<Item = Dna>> Iterator for Transcriber<I> {
type Item = TranscribedDna;
fn next(&mut self) -> Option<Self::Item> {
self.next().map(|x| x.transcribe())
}
}
// Extension trait to add the `.transcribe` method to existing iterators
trait TranscribeIteratorExt: Iterator<Item = Dna> {
fn transcribe(self) -> Transcriber<Self>;
}
impl<I: Iterator<Item = Dna>> TranscriberIteratorExt for I {
fn transcribe(self) -> Transcriber<Self> {
Transcriber { inner: self }
}
}
Then you can use
dna
.transcribe() // yields TranscribedDna

Related

How can I add onto the end of a large [u8] variable in rust? [duplicate]

The following compiles:
pub fn build_proverb(list: &[&str]) -> String {
if list.is_empty() {
return String::new();
}
let mut result = (0..list.len() - 1)
.map(|i| format!("For want of a {} the {} was lost.", list[i], list[i + 1]))
.collect::<Vec<String>>();
result.push(format!("And all for the want of a {}.", list[0]));
result.join("\n")
}
The following does not (see Playground):
pub fn build_proverb(list: &[&str]) -> String {
if list.is_empty() {
return String::new();
}
let mut result = (0..list.len() - 1)
.map(|i| format!("For want of a {} the {} was lost.", list[i], list[i + 1]))
.collect::<Vec<String>>()
.push(format!("And all for the want of a {}.", list[0]))
.join("\n");
result
}
The compiler tells me
error[E0599]: no method named `join` found for type `()` in the current scope
--> src/lib.rs:9:10
|
9 | .join("\n");
| ^^^^
I get the same type of error if I try to compose just with push.
What I would expect is that collect returns B, aka Vec<String>. Vec is not (), and Vec of course has the methods I want to include in the list of composed functions.
Why can't I compose these functions? The explanation might include describing the "magic" of terminating the expression after collect() to get the compiler to instantiate the Vec in a way that does not happen when I compose with push etc.
If you read the documentation for Vec::push and look at the signature of the method, you will learn that it does not return the Vec:
pub fn push(&mut self, value: T)
Since there is no explicit return type, the return type is the unit type (). There is no method called join on (). You will need to write your code in multiple lines.
See also:
What is the purpose of the unit type in Rust?
I'd write this more functionally:
use itertools::Itertools; // 0.8.0
pub fn build_proverb(list: &[&str]) -> String {
let last = list
.get(0)
.map(|d| format!("And all for the want of a {}.", d));
list.windows(2)
.map(|d| format!("For want of a {} the {} was lost.", d[0], d[1]))
.chain(last)
.join("\n")
}
fn main() {
println!("{}", build_proverb(&["nail", "shoe"]));
}
See also:
What's an idiomatic way to print an iterator separated by spaces in Rust?
Thank you to everyone for the useful interactions. Everything stated in the previous response is precisely correct. And, there is a bigger picture as I'm learning Rust.
Coming from Haskell (with C training years ago), I bumped into the OO method chaining approach that uses a pointer to chain between method calls; no need for pure functions (i.e., what I was doing with let mut result = ..., which was then used/required to change the value of the Vec using push in result.push(...)). What I believe is a more general observation is that, in OO, it is "aok" to return unit because method chaining does not require a return value.
The custom code below defines push as a trait; it uses the same inputs as the "OO" push, but returns the updated self. Perhaps only as a side comment, this makes the function pure (output depends on input) but in practice, means the push defined as a trait enables the FP composition of functions I had come to expect was a norm (fair enough I thought at first given how much Rust borrows from Haskell).
What I was trying to accomplish, and at the heart of the question, is captured by the code solution that #Stargateur, #E_net4 and #Shepmaster put forward. With only the smallest edits is as follows:
(see playground)
pub fn build_proverb(list: &[&str]) -> String {
if list.is_empty() {
return String::new();
}
list.windows(2)
.map(|d| format!("For want of a {} the {} was lost.", d[0], d[1]))
.collect::<Vec<_>>()
.push(format!("And all for the want of a {}.", list[0]))
.join("\n")
}
The solution requires that I define push as a trait that return the self,
type Vec in this instance.
trait MyPush<T> {
fn push(self, x: T) -> Vec<T>;
}
impl<T> MyPush<T> for Vec<T> {
fn push(mut self, x: T) -> Vec<T> {
Vec::push(&mut self, x);
self
}
}
Final observation, in surveying many of the Rust traits, I could not find a trait function that returns () (modulo e.g., Write that returns Result ()).
This contrasts with what I learned here to expect with struct and enum methods. Both traits and the OO methods have access to self and thus have each been described as "methods", but there seems to be an inherent difference worth noting: OO methods use a reference to enable sequentially changing self, FP traits (if you will) uses function composition that relies on the use of "pure", state-changing functions to accomplish the same (:: (self, newValue) -> self).
Perhaps as an aside, where Haskell achieves referential transparency in this situation by creating a new copy (modulo behind the scenes optimizations), Rust seems to accomplish something similar in the custom trait code by managing ownership (transferred to the trait function, and handed back by returning self).
A final piece to the "composing functions" puzzle: For composition to work, the output of one function needs to have the type required for the input of the next function. join worked both when I was passing it a value, and when I was passing it a reference (true with types that implement IntoIterator). So join seems to have the capacity to work in both the method chaining and function composition styles of programming.
Is this distinction between OO methods that don't rely on a return value and traits generally true in Rust? It seems to be the case "here and there". Case in point, in contrast to push where the line is clear, join seems to be on its way to being part of the standard library defined as both a method for SliceConcatExt and a trait function for SliceConcatExt (see rust src and the rust-lang issue discussion). The next question, would unifying the approaches in the standard library be consistent with the Rust design philosophy? (pay only for what you use, safe, performant, expressive and a joy to use)

Why does Vec<T> expect &T as the argument to binary_search?

This question is predicated on the assumption that a friendly/ergonomic API in rust should prefer a reference to a type Q where T: Borrow<Q> instead of expecting &T directly. In my experience working with the API of other collection types like HashMap, this definitely seems to be the case. That said...
How come the binary_search method on Vec<T> is not defined that way? Currently, on stable, the binary_search implementation is as follows:
pub fn binary_search(&self, x: &T) -> Result<usize, usize>
where
T: Ord,
{
self.binary_search_by(|p| p.cmp(x))
}
It seems like the following would be a better implementation:
pub fn binary_search_modified<Q>(&self, x: &Q) -> Result<usize, usize>
where
T: Borrow<Q>,
Q: Ord + ?Sized,
{
self.binary_search_by(|p| p.borrow().cmp(x))
}
A comparison of the two APIs above:
let mut v: Vec<String> = Vec::new();
v.push("A".into());
v.push("B".into());
v.push("D".into());
let _ = v.binary_search("C"); // Compilation error!
let _ = v.binary_search(&String::from("C")); // Fine allocate and convert it to the exact type, I guess
let _ = v.binary_search_modified("C"); // Far nicer API, does the same thing
let _ = v.binary_search_modified(&String::from("C")); // Backwards compatible
As a more general question, what considerations go into deciding if a method should accept &T or &Q ... where T: Borrow<Q>?
You are right that binary_search() and a few other methods like contains() could be generalized to accept any type that can be borrowed as T, but unfortunately Rust 1.0 was released with the less general signature. And while it looks like using Borrow is strictly more general, attempts to implement that change broke type inference in too many cases.
There are countless Github issues, PRs and forum discussion on this topic. If you want to follow the history of the attempts to fix this, I suggest starting at the PR finally reverting the attempts to make binary_search() more general and working your way backwards.
Regarding your more general question, my advice would be the same as for any API design question: think about the use cases. Using an additional type parameter makes the code somewhat more complex, and the documentation and compiler errors become less obvious. For methods on a trait, the type parameter will render the trait unusuable for trait objects. So if you can think of convincing use cases for the more general version using Borrow, go for it, but in the absence of a convincing use case it's better to avoid it.

How to properly pass Iterators to a function in Rust

I want to pass Iterators to a function, which then computes some value from these iterators.
I am not sure how a robust signature to such a function would look like.
Lets say I want to iterate f64.
You can find the code in the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c614429c541f337adb102c14518cf39e
My first attempt was
fn dot(a : impl std::iter::Iterator<Item = f64>,b : impl std::iter::Iterator<Item = f64>) -> f64 {
a.zip(b).map(|(x,y)| x*y).sum()
}
This fails to compile if we try to iterate over slices
So you can do
fn dot<'a>(a : impl std::iter::Iterator<Item = &'a f64>,b : impl std::iter::Iterator<Item = &'a f64>) -> f64 {
a.zip(b).map(|(x,y)| x*y).sum()
}
This fails to compile if I try to iterate over mapped Ranges.
(Why does the compiler requires the livetime parameters here?)
So I tried to accept references and not references generically:
pub fn dot<T : Borrow<f64>, U : Borrow<f64>>(a : impl std::iter::Iterator::<Item = T>, b: impl std::iter::Iterator::<Item = U>) -> f64 {
a.zip(b).map(|(x,y)| x.borrow()*y.borrow()).sum()
}
This works with all combinations I tried, but it is quite verbose and I don't really understand every aspect of it.
Are there more cases?
What would be the best practice of solving this problem?
There is no right way to write a function that can accept Iterators, but there are some general principles that we can apply to make your function general and easy to use.
Write functions that accept impl IntoIterator<...>. Because all Iterators implement IntoIterator, this is strictly more general than a function that accepts only impl Iterator<...>.
Borrow<T> is the right way to abstract over T and &T.
When trait bounds get verbose, it's often easier to read if you write them in where clauses instead of in-line.
With those in mind, here's how I would probably write dot:
fn dot<I, J>(a: I, b: J) -> f64
where
I: IntoIterator,
J: IntoIterator,
I::Item: Borrow<f64>,
J::Item: Borrow<f64>,
{
a.into_iter()
.zip(b)
.map(|(x, y)| x.borrow() * y.borrow())
.sum()
}
However, I also agree with TobiP64's answer in that this level of generality may not be necessary in every case. This dot is nice because it can accept a wide range of arguments, so you can call dot(&some_vec, some_iterator) and it just works. It's optimized for readability at the call site. On the other hand, if you find the Borrow trait complicates the definition too much, there's nothing wrong with optimizing for readability at the definition, and forcing the caller to add a .iter().copied() sometimes. The only thing I would definitely change about the first dot function is to replace Iterator with IntoIterator.
You can iterate over slices with the first dot implementation like that:
dot([0, 1, 2].iter().cloned(), [0, 1, 2].iter().cloned());
(https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.cloned)
or
dot([0, 1, 2].iter().copied(), [0, 1, 2].iter().copied());
(https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.copied)
Why does the compiler requires the livetime parameters here?
As far as I know every reference in rust has a lifetime, but the compiler can infer simple it in cases. In this case, however the compiler is not yet smart enough, so you need to tell it how long the references yielded by the iterator lives.
Are there more cases?
You can always use iterator methods, like the solution above, to get an iterator over f64, so you don't have to deal with lifetimes or generics.
What would be the best practice of solving this problem?
I would recommend the first version (and thus leaving it to the caller to transform the iterator to Iterator<f64>), simply because it's the most readable.

Why does the compiler prevent me from using push on a Vec created using collect()?

The following compiles:
pub fn build_proverb(list: &[&str]) -> String {
if list.is_empty() {
return String::new();
}
let mut result = (0..list.len() - 1)
.map(|i| format!("For want of a {} the {} was lost.", list[i], list[i + 1]))
.collect::<Vec<String>>();
result.push(format!("And all for the want of a {}.", list[0]));
result.join("\n")
}
The following does not (see Playground):
pub fn build_proverb(list: &[&str]) -> String {
if list.is_empty() {
return String::new();
}
let mut result = (0..list.len() - 1)
.map(|i| format!("For want of a {} the {} was lost.", list[i], list[i + 1]))
.collect::<Vec<String>>()
.push(format!("And all for the want of a {}.", list[0]))
.join("\n");
result
}
The compiler tells me
error[E0599]: no method named `join` found for type `()` in the current scope
--> src/lib.rs:9:10
|
9 | .join("\n");
| ^^^^
I get the same type of error if I try to compose just with push.
What I would expect is that collect returns B, aka Vec<String>. Vec is not (), and Vec of course has the methods I want to include in the list of composed functions.
Why can't I compose these functions? The explanation might include describing the "magic" of terminating the expression after collect() to get the compiler to instantiate the Vec in a way that does not happen when I compose with push etc.
If you read the documentation for Vec::push and look at the signature of the method, you will learn that it does not return the Vec:
pub fn push(&mut self, value: T)
Since there is no explicit return type, the return type is the unit type (). There is no method called join on (). You will need to write your code in multiple lines.
See also:
What is the purpose of the unit type in Rust?
I'd write this more functionally:
use itertools::Itertools; // 0.8.0
pub fn build_proverb(list: &[&str]) -> String {
let last = list
.get(0)
.map(|d| format!("And all for the want of a {}.", d));
list.windows(2)
.map(|d| format!("For want of a {} the {} was lost.", d[0], d[1]))
.chain(last)
.join("\n")
}
fn main() {
println!("{}", build_proverb(&["nail", "shoe"]));
}
See also:
What's an idiomatic way to print an iterator separated by spaces in Rust?
Thank you to everyone for the useful interactions. Everything stated in the previous response is precisely correct. And, there is a bigger picture as I'm learning Rust.
Coming from Haskell (with C training years ago), I bumped into the OO method chaining approach that uses a pointer to chain between method calls; no need for pure functions (i.e., what I was doing with let mut result = ..., which was then used/required to change the value of the Vec using push in result.push(...)). What I believe is a more general observation is that, in OO, it is "aok" to return unit because method chaining does not require a return value.
The custom code below defines push as a trait; it uses the same inputs as the "OO" push, but returns the updated self. Perhaps only as a side comment, this makes the function pure (output depends on input) but in practice, means the push defined as a trait enables the FP composition of functions I had come to expect was a norm (fair enough I thought at first given how much Rust borrows from Haskell).
What I was trying to accomplish, and at the heart of the question, is captured by the code solution that #Stargateur, #E_net4 and #Shepmaster put forward. With only the smallest edits is as follows:
(see playground)
pub fn build_proverb(list: &[&str]) -> String {
if list.is_empty() {
return String::new();
}
list.windows(2)
.map(|d| format!("For want of a {} the {} was lost.", d[0], d[1]))
.collect::<Vec<_>>()
.push(format!("And all for the want of a {}.", list[0]))
.join("\n")
}
The solution requires that I define push as a trait that return the self,
type Vec in this instance.
trait MyPush<T> {
fn push(self, x: T) -> Vec<T>;
}
impl<T> MyPush<T> for Vec<T> {
fn push(mut self, x: T) -> Vec<T> {
Vec::push(&mut self, x);
self
}
}
Final observation, in surveying many of the Rust traits, I could not find a trait function that returns () (modulo e.g., Write that returns Result ()).
This contrasts with what I learned here to expect with struct and enum methods. Both traits and the OO methods have access to self and thus have each been described as "methods", but there seems to be an inherent difference worth noting: OO methods use a reference to enable sequentially changing self, FP traits (if you will) uses function composition that relies on the use of "pure", state-changing functions to accomplish the same (:: (self, newValue) -> self).
Perhaps as an aside, where Haskell achieves referential transparency in this situation by creating a new copy (modulo behind the scenes optimizations), Rust seems to accomplish something similar in the custom trait code by managing ownership (transferred to the trait function, and handed back by returning self).
A final piece to the "composing functions" puzzle: For composition to work, the output of one function needs to have the type required for the input of the next function. join worked both when I was passing it a value, and when I was passing it a reference (true with types that implement IntoIterator). So join seems to have the capacity to work in both the method chaining and function composition styles of programming.
Is this distinction between OO methods that don't rely on a return value and traits generally true in Rust? It seems to be the case "here and there". Case in point, in contrast to push where the line is clear, join seems to be on its way to being part of the standard library defined as both a method for SliceConcatExt and a trait function for SliceConcatExt (see rust src and the rust-lang issue discussion). The next question, would unifying the approaches in the standard library be consistent with the Rust design philosophy? (pay only for what you use, safe, performant, expressive and a joy to use)

How can I zip together an unknown-at-compile-time number of iterators?

I have a number of Rust iterators specified by user input that I would like to iterate through in lockstep.
This sounds like a job for something like Iterator::zip, except that I may need more than two iterators zipped together. I looked at itertools::multizip and itertools::izip, but those both require that the number of iterators to be zipped be known at compile time. For my task the number of iterators to be zipped together depends on user input, and thus cannot be known at compile time.
I was hoping for something like Python's zip function which takes an iterable of iterables. I imagine the function signature might look like:
fn manyzip<T>(iterators: Vec<T>) -> ManyZip<T>
where
T: Iterator
How can I zip more than two iterators? only answers for the situation where the number of iterators is known at compile time.
I can solve my particular problem using indices and such, it just feels like there ought to be a better way.
Implement your own iterator that iterates over the input iterators and collects them:
struct Multizip<T>(Vec<T>);
impl<T> Iterator for Multizip<T>
where
T: Iterator,
{
type Item = Vec<T::Item>;
fn next(&mut self) -> Option<Self::Item> {
self.0.iter_mut().map(Iterator::next).collect()
}
}
fn main() {
let mz = Multizip(vec![1..=2, 10..=20, 100..=200]);
for v in mz {
println!("{:?}", v);
}
}

Resources