How to get a zero Array2 copying dimension from another Array2 - rust

Just learning some rust. I'm using ndarray and I need to construct a zero matrix copying the dimension from another matrix. I tried
fn make_0(matrix: Array2<i32>) -> Array2<i32> {
Array2::zeros(matrix.shape())
}
But this does not compile:
error[E0271]: type mismatch resolving `<&[usize] as ShapeBuilder>::Dim == Dim<[usize; 2]>`
--> src/lib.rs:62:9
|
62 | Array2::zeros(matrix.shape())
| ^^^^^^^^^^^^^ expected array `[usize; 2]`, found struct `IxDynImpl`
|
= note: expected struct `Dim<[usize; 2]>`
found struct `Dim<IxDynImpl>`
I can solve it with
fn make_0(matrix: Array2<i32>) -> Array2<i32> {
Array2::zeros((matrix.shape()[0], matrix.shape()[1]))
}
But I guess there's something better and I'm lost with types here.

The documentation for ArrayBase::shape() recommends using .raw_dim() instead:
Note that you probably don’t want to use this to create an array of the same shape as another array because creating an array with e.g. Array::zeros() using a shape of type &[usize] results in a dynamic-dimensional array. If you want to create an array that has the same shape and dimensionality as another array, use .raw_dim() instead:
// To get the same dimension type, use `.raw_dim()` instead:
let c = Array::zeros(a.raw_dim());
assert_eq!(a, c);
So you probably want to do something like this:
fn make_0(matrix: Array2<i32>) -> Array2<i32> {
Array2::zeros(matrix.raw_dim())
}

Related

Converting a Utf8 Series into a Series of List<Utf8> via a custom function in Rust polars

I have a Utf8 column in my DataFrame, and from that I want to create a column of List<Utf8>.
In particular for each row I am taking the text of a HTML document and using soup to parse out all the paragraphs of class <p>, and store the collection of text of each separate paragraph as a Vec<String> or Vec<&str>. I have this as a standalone function:
fn parse_paragraph(s: &str) -> Vec<&str> {
let soup = Soup::new(s);
soup.tag(p).find_all().iter().map(|&p| p.text()).collect()
}
In trying to adapt the few available examples of applying custom functions in Rust polars, I can't seem to get the conversion to compile.
Take this MVP example, using a simpler string-to-vec-of-strings example, borrowing from the Iterators example from the documentation:
use polars::prelude::*;
fn vector_split(text: &str) -> Vec<&str> {
text.split(' ').collect()
}
fn vector_split_series(s: &Series) -> PolarsResult<Series> {
let output : Series = s.utf8()
.expect("Text data")
.into_iter()
.map(|t| t.map(vector_split))
.collect();
Ok(output)
}
fn main() {
let df = df! [
"text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
].unwrap();
df.clone().lazy()
.select([
col("text").apply(|s| vector_split_series(&s), GetOutput::default())
.alias("words")
])
.collect();
}
(Note: I know there is an in-built split function for utf8 Series, but I needed a simpler example than parsing HTML)
I get the following error from cargo check:
error[E0277]: a value of type `polars::prelude::Series` cannot be built from an iterator over elements of type `Option<Vec<&str>>`
--> src/main.rs:11:27
|
11 | let output : Series = s.utf8()
| ___________________________^
12 | | .expect("Text data")
13 | | .into_iter()
14 | | .map(|t| t.map(vector_split))
| |_____________________________________^ value of type `polars::prelude::Series` cannot be built from `std::iter::Iterator<Item=Option<Vec<&str>>>`
15 | .collect();
| ------- required by a bound introduced by this call
|
= help: the trait `FromIterator<Option<Vec<&str>>>` is not implemented for `polars::prelude::Series`
= help: the following other types implement trait `FromIterator<A>`:
<polars::prelude::Series as FromIterator<&'a bool>>
<polars::prelude::Series as FromIterator<&'a f32>>
<polars::prelude::Series as FromIterator<&'a f64>>
<polars::prelude::Series as FromIterator<&'a i32>>
<polars::prelude::Series as FromIterator<&'a i64>>
<polars::prelude::Series as FromIterator<&'a str>>
<polars::prelude::Series as FromIterator<&'a u32>>
<polars::prelude::Series as FromIterator<&'a u64>>
and 15 others
note: required by a bound in `std::iter::Iterator::collect`
What is the correct idiom for this kind of procedure? Is there a simpler way to apply a function?
For future seekers, I will explain the general solution and then the specific code to make the example work. I'll also point out some gotchas for this specific example.
Explanation
If you need to use a custom function instead of using the convenient Expr expressions, at the core of it you'll need to make a function that converts the Series of the input column into a Series backed by a ChunkedArray of the correct output type. This function is what you give to map in the select statement in main. The type of the ChunkedArray is the type you provide as GetOutput.
The code inside vector_split_series in the question works for conversion functions of standard numeric types, or List of numeric types. It does not work automatically for Lists of Utf8 strings, for example, as they are treated specially for ChunkedArrays. This is for performance reasons. You need to build up the Series explicitly, via the correct type builder.
In the question's case, we need to use a ListUtf8ChunkedBuilder which will create a ChunkedArray of List<Utf8>.
So in general, the question's code works for conversion outputs that are numeric or Lists of numerics. But for lists of strings, you need to use a ListUtf8ChunkedBuilder.
Correct code
The correct code for the question's example looks like this:
use polars::prelude::*;
fn vector_split(text: &str) -> Vec<String> {
text.split(' ').map(|x| x.to_owned()).collect()
}
fn vector_split_series(s: Series) -> PolarsResult<Series> {
let ca = s.utf8()?;
let mut builder = ListUtf8ChunkedBuilder::new("words", s.len(), ca.get_values_size());
ca.into_iter()
.for_each(|opt_s| match opt_s {
None => builder.append_null(),
Some(s) => {
builder.append_series(
&Series::new("words", vector_split(s).into_iter() )
)
}});
Ok(builder.finish().into_series())
}
fn main() {
let df = df! [
"text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
].unwrap();
let df2 = df.clone().lazy()
.select([
col("text")
.apply(|s| vector_split_series(s), GetOutput::from_type(DataType::List(Box::new(DataType::Utf8))))
// Can instead use default if the compiler can determine the types
//.apply(|s| vector_split_series(s), GetOutput::default())
.alias("words")
])
.collect()
.unwrap();
println!("{:?}", df2);
}
The core is in vector_split_series. It has that function definition to be used in map.
The match statement is required because Series can have null entries, and to preserve the length of the Series, you need to pass nulls through. We use the builder here so it appends the appropriate null.
For non-null entries the builder needs to append Series. Normally you can append_from_iter, but there is (as of polars 0.26.1) no implementation of FromIterator for Iterator<Item=Vec<T>>. So you need to convert the collection into an iterator on values, and that iterator into a new Series.
Once the larger ChunkedArray (of type ListUtf8ChunkedArray) is built, you can convert it into a PolarsResult<Series> to return to map.
Gotcha
In the above example, vector_split can return Vec<String> or Vec<&str>. This is because split creates its iterator of &str in a nice way.
If you are using something more complicated --- like my original example of extracting text via Soup queries --- if they output iterators of &str, the references may be considered owned by temporary and then you will have issues about returning references to temporaries.
This is why in the working code, I pass Vec<String> back to the builder, even though it is not strictly required.

Why does size for value types matter when returning a value vs a reference? [duplicate]

I'm trying to manipulate a string derived from a function parameter and then return the result of that manipulation:
fn main() {
let a: [u8; 3] = [0, 1, 2];
for i in a.iter() {
println!("{}", choose("abc", *i));
}
}
fn choose(s: &str, pad: u8) -> String {
let c = match pad {
0 => ["000000000000000", s].join("")[s.len()..],
1 => [s, "000000000000000"].join("")[..16],
_ => ["00", s, "0000000000000"].join("")[..16],
};
c.to_string()
}
On building, I get this error:
error[E0277]: the trait bound `str: std::marker::Sized` is not satisfied
--> src\main.rs:9:9
|
9 | let c = match pad {
| ^ `str` does not have a constant size known at compile-time
|
= help: the trait `std::marker::Sized` is not implemented for `str`
= note: all local variables must have a statically known size
What's wrong here, and what's the simplest way to fix it?
TL;DR Don't use str, use &str. The reference is important.
The issue can be simplified to this:
fn main() {
let demo = "demo"[..];
}
You are attempting to slice a &str (but the same would happen for a String, &[T], Vec<T>, etc.), but have not taken a reference to the result. This means that the type of demo would be str. To fix it, add an &:
let demo = &"demo"[..];
In your broader example, you are also running into the fact that you are creating an allocated String inside of the match statement (via join) and then attempting to return a reference to it. This is disallowed because the String will be dropped at the end of the match, invalidating any references. In another language, this could lead to memory unsafety.
One potential fix is to store the created String for the duration of the function, preventing its deallocation until after the new string is created:
fn choose(s: &str, pad: u8) -> String {
let tmp;
match pad {
0 => {
tmp = ["000000000000000", s].join("");
&tmp[s.len()..]
}
1 => {
tmp = [s, "000000000000000"].join("");
&tmp[..16]
}
_ => {
tmp = ["00", s, "0000000000000"].join("");
&tmp[..16]
}
}.to_string()
}
Editorially, there's probably more efficient ways of writing this function. The formatting machinery has options for padding strings. You might even be able to just truncate the string returned from join without creating a new one.
What it means is harder to explain succinctly. Rust has a number of types that are unsized. The most prevalent ones are str and [T]. Contrast these types to how you normally see them used: &str or &[T]. You might even see them as Box<str> or Arc<[T]>. The commonality is that they are always used behind a reference of some kind.
Because these types don't have a size, they cannot be stored in a variable on the stack — the compiler wouldn't know how much stack space to reserve for them! That's the essence of the error message.
See also:
What is the return type of the indexing operation?
Return local String as a slice (&str)
Why your first FizzBuzz implementation may not work

Error with non matching type: Option<&[u8]> and Option<&[u8;32]>

I playing with the Rust code, and I've got to a place in which I have a problem with converting Option<&[u8; 32]> to Option<&[u8]>.
A (very) simplified example:
pub type Foo = [u8; 32];
fn fun_one(inp: Option<&[u8]>) {
println!("{:?}", inp);
}
fn fun_two(x: Option<&Foo>) {
fun_one(x)
}
fn main() {
let x = [11u8; 32];
fun_two(Some(&x));
}
Link: Rust Playground
error[E0308]: mismatched types
--> src/main.rs:8:13
|
8 | fun_one(x)
| ^ expected slice `[u8]`, found array `[u8; 32]`
|
= note: expected enum `Option<&[u8]>`
found enum `Option<&[u8; 32]>`
A slice isn't just a pointer over an array. It's both the pointer to the data and a length (see Arrays and Slices) as it refers to only a part of the array. This is why the types aren't compatible.
What you want here is a slice over the whole array, which you get with the .. full range expression: slice = &array[..].
Having an option, you can conditionnaly apply this transformation using map.
Combining all this:
fn fun_two(x: Option<&Foo>) {
fun_one(x.map(|a| &a[..]))
}

How to access the element at variable index of a tuple?

I'm writing a function to read vectors from stdin, and here is what I have so far:
fn read_vector() -> (i64, i64, i64) {
let mut vec = (0, 0, 0);
let mut value = String::new();
for i in 0..3 {
io::stdin().read_line(&mut value).expect("Failed to read line");
vec.i = value.trim().parse().expect("Failed to read number!"); // error!
}
}
However, the annotated line contains an error:
error: no field `i` on type `({integer}, {integer}, {integer})`
--> src/main.rs:13:13
|
13 | vec.i = value.trim().parse().expect("Failed to read number!");
| ^
Reading the documentation entry doesn't reveal any get, or similar function.
So, is there any way to get the ith value of a tuple?
There isn't a way built in the language, because variable indexing on a heterogeneous type like a tuple makes it impossible for the compiler to infer the type of the expression.
You could use a macro that unrolls a for loop with variable indexing for a tuple if it is really, really necessary though.
If you are going to be using homogeneous tuples that require variable indexing, why not just use a fixed-length array?
So, is there any way to get the ith value of vec?
No, there isn't. Since tuples can contain elements of different types, an expression like this wouldn't have a statically-known type in general.
You could consider using an array instead of a tuple.
While there are no built-in methods to extract the i-th value for non-constant i, there exist crates like tuple to implement dynamic indexing of a homogeneous tuple.
extern crate tuple;
...
*vec.get_mut(i).unwrap() = value.trim().parse().expect("!");
(But, as #fjh mentioned, it is far better to operate on an array [i64; 3] instead of a tuple (i64, i64, i64).)

How to access a slice from pre-defined 'Range' in a struct?

Accessing a slice is straightforward using slice syntax: slice = vector[i..j]
In the case where the range is stored however, from what I can tell you can't do:
struct StructWithRange {
range: std::ops::Range<usize>,
}
fn test_slice(s: &StructWithRange, vector: &Vec<i32>) {
let slice = &vector[s.range];
println!("{:?}", slice); // prints [2, 3]
}
fn main() {
let vector = vec![1,2,3,4,5];
let s = StructWithRange {
range: 1..3
};
test_slice(&s, &vector);
}
This gives the error:
error[E0507]: cannot move out of borrowed content
--> src/main.rs:6:25
|
6 | let slice = &vector[s.range];
| ^ cannot move out of borrowed content
Is there a way to get the slice from a range without expanding it?eg: vector[s.range.start..s.range.end]
If a usize in a struct can be used for an index lookup, why can't a Range<usize> be used in the same way?
Since Index is a trait requiring the following function:
fn index(&self, index: Idx) -> &Self::Output
It consumes/moves the value used for indexing (index). In your case you are attempting to index the slice using a Range from a borrowed struct, but since you are only passing a reference and the range doesn't implement Copy, this fails.
You can fix it by e.g. changing the definition of test_slice to consume StructWithRange or clone()ing the s.range in the index.
The error message occurs because Range does not implement Copy and Index consumes its index.
It can be solved by adding a call to .clone(): &vector[s.range.clone()].
If you check the code, it links to the rejected proposal to add Copy to Range in the case where its parameter is Copy here.
The rejection reason is:
These don't have it because they're iterators.
The choice of removing Copy impls instead of adjusted for loop desugaring or linting was made to prevent this problematic case:
let mut iter = 0..n;
for i in iter { if i > 2 { break; } }
iter.collect()
Here iter is actually not mutated, but copied. for i in &mut iter is required to mutate the iterator.
We could switch to linting against using an iterator variable after it was copied by a for loop, but there was no decision towards that.

Resources