How to get date element when iterating over DateChunked in polars rust - rust

I want to iterate over DateChunked using map and use the date element. But iterating over gives me i32. How I can use date?
Running below code complains.
use polars::export::chrono::{NaiveDate};
fn main() {
let df = df! [
"date_series" => [NaiveDate::from_ymd(2020, 1, 1), NaiveDate::from_ymd(2020, 1, 2), NaiveDate::from_ymd(2020, 1, 3)]
].unwrap();
let a: DateChunked = df["date_series"].date().unwrap().into_iter().map(|d| {
match d {
Some(d) => Some(d),
None => None
}
}).collect();
}
complains
}).collect();
| ^^^^^^^ value of type `Logical<DateType, Int32Type>` cannot be built from `std::iter::Iterator<Item=Option<i32>>`

Related

Polars with_context filter

I'm having a hard time understanding how to use the lazy with_context. The docs say
This allows expressions to also access columns from DataFrames that are not part of this one.
I want to filter a column based on a column from another frame, but I get errors stating that the column from the other frame does not exist. Not sure what I'm doing wrong here, as it seems to fit the description of what the with_context docs describe.
main.rs
use polars::prelude::*;
fn main() {
let df0 = df! {
"id" => [1, 2, 3],
"name" => ["foo", "bar", "baz"],
}
.unwrap()
.lazy();
let other_df = df! {
"other_id" => [1,2,2,1],
"name" => ["w", "x", "y", "z"],
}
.unwrap()
.lazy();
let lf = df0.with_context(&[other_df]);
let res = lf
.filter(col("id").is_in(col("other_id")))
.collect()
.unwrap();
println!("{:?}", res);
}
Cargo.toml
[dependencies]
polars = {git = "https://github.com/pola-rs/polars", branch = "master", features = ["lazy", "is_in"]}
Edit:
If i do select instead of filter, I don't get an error.
let res = lf
.select(&[col("id").is_in(col("other_id"))])
.collect()
.unwrap();

How do I serialize Polars DataFrame Row/HashMap of `AnyValue` into JSON?

I have a row of a polars dataframe created using iterators reading a parquet file from this method: Iterate over rows polars rust
I have constructed a HashMap that represents an individual row and I would like to now convert that row into JSON.
This is what my code looks like so far:
use polars::prelude::*;
use std::iter::zip;
use std::{fs::File, collections::HashMap};
fn main() -> anyhow::Result<()> {
let file = File::open("0.parquet").unwrap();
let mut df = ParquetReader::new(file).finish()?;
dbg!(df.schema());
let fields = df.fields();
let columns: Vec<&String> = fields.iter().map(|x| x.name()).collect();
df.as_single_chunk_par();
let mut iters = df.iter().map(|s| s.iter()).collect::<Vec<_>>();
for _ in 0..df.height() {
let mut row = HashMap::new();
for (column, iter) in zip(&columns, &mut iters) {
let value = iter.next().expect("should have as many iterations as rows");
row.insert(column, value);
}
dbg!(&row);
let json = serde_json::to_string(&row).unwrap();
dbg!(json);
break;
}
Ok(())
}
And I have the following feature flags enabled: ["parquet", "serde", "dtype-u8", "dtype-i8", "dtype-date", "dtype-datetime"].
I am running into the following error at the serde_json::to_string(&row).unwrap() line:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("the enum variant AnyValue::Datetime cannot be serialized", line: 0, column: 0)', src/main.rs:47:48
I am also unable to implement my own serialized for AnyValue::DateTime because of only traits defined in the current crate can be implemented for types defined outside of the crate.
What's the best way to serialize this row into JSON?
I was able to resolve this error by using a match statement over value to change it from a Datetime to an Int64.
let value = match value {
AnyValue::Datetime(value, TimeUnit::Milliseconds, _) => AnyValue::Int64(value),
x => x
};
row.insert(column, value);
Root cause is there is no enum variant for Datetime in the impl Serialize block: https://docs.rs/polars-core/0.24.0/src/polars_core/datatypes/mod.rs.html#298
Although this code now works, it outputs data that looks like:
{'myintcolumn': {'Int64': 22342342343},
'mylistoclumn': {'List': {'datatype': 'Int32', 'name': '', 'values': []}},
'mystrcolumn': {'Utf8': 'lorem ipsum lorem ipsum'}
So you likely to be customizing the serialization here regardless of the data type.
Update: If you want to get the JSON without all of the inner nesting, I had to do a gnarly match statement:
use polars::prelude::*;
use std::iter::zip;
use std::{fs::File, collections::HashMap};
use serde_json::json;
fn main() -> anyhow::Result<()> {
let file = File::open("0.parquet").unwrap();
let mut df = ParquetReader::new(file).finish()?;
dbg!(df.schema());
let fields = df.fields();
let columns: Vec<&String> = fields.iter().map(|x| x.name()).collect();
df.as_single_chunk_par();
let mut iters = df.iter().map(|s| s.iter()).collect::<Vec<_>>();
for _ in 0..df.height() {
let mut row = HashMap::new();
for (column, iter) in zip(&columns, &mut iters) {
let value = iter.next().expect("should have as many iterations as rows");
let value = match value {
AnyValue::Null => json!(Option::<String>::None),
AnyValue::Int64(val) => json!(val),
AnyValue::Int32(val) => json!(val),
AnyValue::Int8(val) => json!(val),
AnyValue::Float32(val) => json!(val),
AnyValue::Float64(val) => json!(val),
AnyValue::Utf8(val) => json!(val),
AnyValue::List(val) => {
match val.dtype() {
DataType::Int32 => ({let vec: Vec<Option<_>> = val.i32().unwrap().into_iter().collect(); json!(vec)}),
DataType::Float32 => ({let vec: Vec<Option<_>> = val.f32().unwrap().into_iter().collect(); json!(vec)}),
DataType::Utf8 => ({let vec: Vec<Option<_>> = val.utf8().unwrap().into_iter().collect(); json!(vec)}),
DataType::UInt8 => ({let vec: Vec<Option<_>> = val.u8().unwrap().into_iter().collect(); json!(vec)}),
x => panic!("unable to parse list column: {} with value: {} and type: {:?}", column, x, x.inner_dtype())
}
},
AnyValue::Datetime(val, TimeUnit::Milliseconds, _) => json!(val),
x => panic!("unable to parse column: {} with value: {}", column, x)
};
row.insert(*column as &str, value);
}
let json = serde_json::to_string(&row).unwrap();
dbg!(json);
break;
}
Ok(())
}

rust macro, how to control the different vec have the same length?

code first:
use std::collections::HashMap;
macro_rules! arr{
([$($t:expr=>[$($c:expr),*]),*]) => {
vec![
$({
let mut m = HashMap::new();
m.insert($t, vec![$($c),*]);
m
}),*
]
};
}
fn main() {
let a = arr!([
"A"=>[1,2,3],
"B"=>[3,4]
]);
println!("{:?}", a);
//print: [{"A": [1, 2, 3]}, {"B": [3, 4]}]
}
I have above macro to generate a vec, contains several HashMap, in which these HashMap value is a vec as well,
{"A": [1, 2, 3]} => vec value length: 3,
{"B": [3, 4]} => vec value length: 2,
I wanna all the HashMap have the same length,
how to write in the macro to control this?
You can change the macro so that it creates a block (second set of {} encapsulating the macro definition) that you can set helper variables in and do a second pass over your vector, resizing anything that is smaller than the largest array.
In this case I've resized the arrays with the default value of the type to keep it simple. You may wish to wrap the data in Some().
This:
use std::cmp;
use std::collections::HashMap;
use std::default::Default;
macro_rules! arr{
([$($t:expr=>[$($c:expr),*]),*]) => {{
let mut max = 0;
let mut result = vec![
$({
let mut m = HashMap::new();
m.insert($t, vec![$($c),*]);
// Simply unwrap here as we know we inserted at this key above
max = cmp::max(max, m.get($t).unwrap().len());
m
}),*
];
for m in result.iter_mut() {
for v in m.values_mut() {
if v.len() < max {
v.resize_with(max, Default::default);
}
}
}
result
}};
}
fn main() {
let a = arr!([
"A"=>[1,2,3],
"B"=>[3,4]
]);
println!("{:?}", a);
//print: [{"A": [1, 2, 3]}, {"B": [3, 4]}]
}
Yields:
[{"A": [1, 2, 3]}, {"B": [3, 4, 0]}]

how to convert Option<&u8> to u8

i want to convert Option<&u8> to u8 so i will be able to print it
my code:
fn main() {
let v : Vec<u8> = vec![1, 2, 3, 4, 5];
let out_of_range = &v[100];
let out_of_range = v.get(100);
match out_of_range{
Some(&u8) => println!("data out of range: {}", out_of_range),
None => println!("bruh"),
}
}
Your match statement needs the introduction of a binding, not a type (the &u8 you used was not expected here).
Here Some(val) matches with something which is an Option<&u8>, thus val is a binding to the embedded &u8 (if not None, of course).
The example explicitly dereferences val as an illustration, and highlights the fact that val is not an u8 but a reference, but it's not required for the following operation (printing).
fn main() {
let v: Vec<u8> = vec![1, 2, 3, 4, 5];
for idx in [2, 20, 4] {
let at_index = v.get(idx);
match at_index {
Some(val) => {
// val has type &u8
let copy_of_val = *val; // not required, just for the example
println!("at {} --> {}", idx, copy_of_val);
}
None => println!("no value at {}", idx),
}
}
}

How can I group consecutive integers in a vector in Rust?

I have a Vec<i64> and I want to know all the groups of integers that are consecutive. As an example:
let v = vec![1, 2, 3, 5, 6, 7, 9, 10];
I'm expecting something like this or similar:
[[1, 2, 3], [5, 6, 7], [9, 10]];
The view (vector of vectors or maybe tuples or something else) really doesn't matter, but I should get several grouped lists with continuous numbers.
At the first look, it seems like I'll need to use itertools and the group_by function, but I have no idea how...
You can indeed use group_by for this, but you might not really want to. Here's what I would probably write instead:
fn consecutive_slices(data: &[i64]) -> Vec<&[i64]> {
let mut slice_start = 0;
let mut result = Vec::new();
for i in 1..data.len() {
if data[i - 1] + 1 != data[i] {
result.push(&data[slice_start..i]);
slice_start = i;
}
}
if data.len() > 0 {
result.push(&data[slice_start..]);
}
result
}
This is similar in principle to eXodiquas' answer, but instead of accumulating a Vec<Vec<i64>>, I use the indices to accumulate a Vec of slice references that refer to the original data. (This question explains why I made consecutive_slices take &[T].)
It's also possible to do the same thing without allocating a Vec, by returning an iterator; however, I like the above version better. Here's the zero-allocation version I came up with:
fn consecutive_slices(data: &[i64]) -> impl Iterator<Item = &[i64]> {
let mut slice_start = 0;
(1..=data.len()).flat_map(move |i| {
if i == data.len() || data[i - 1] + 1 != data[i] {
let begin = slice_start;
slice_start = i;
Some(&data[begin..i])
} else {
None
}
})
}
It's not as readable as a for loop, but it doesn't need to allocate a Vec for the return value, so this version is more flexible.
Here's a "more functional" version using group_by:
use itertools::Itertools;
fn consecutive_slices(data: &[i64]) -> Vec<Vec<i64>> {
(&(0..data.len()).group_by(|&i| data[i] as usize - i))
.into_iter()
.map(|(_, group)| group.map(|i| data[i]).collect())
.collect()
}
The idea is to make a key function for group_by that takes the difference between each element and its index in the slice. Consecutive elements will have the same key because indices increase by 1 each time. One reason I don't like this version is that it's quite difficult to get slices of the original data structure; you almost have to create a Vec<Vec<i64>> (hence the two collects). The other reason is that I find it harder to read.
However, when I first wrote my preferred version (the first one, with the for loop), it had a bug (now fixed), while the other two versions were correct from the start. So there may be merit to writing denser code with functional abstractions, even if there is some hit to readability and/or performance.
let v = vec![1, 2, 3, 5, 6, 7, 9, 10];
let mut res = Vec::new();
let mut prev = v[0];
let mut sub_v = Vec::new();
sub_v.push(prev);
for i in 1..v.len() {
if v[i] == prev + 1 {
sub_v.push(v[i]);
prev = v[i];
} else {
res.push(sub_v.clone());
sub_v.clear();
sub_v.push(v[i]);
prev = v[i];
}
}
res.push(sub_v);
This should solve your problem.
Iterating over the given vector, checking if the current i64 (in my case i32) is +1 to the previous i64, if so push it into a vector (sub_v). After the series breaks, push the sub_v into the result vector. Repeat.
But I guess you wanted something functional?
Another possible solution, that uses std only, could be:
fn consecutive_slices(v: &[i64]) -> Vec<Vec<i64>> {
let t: Vec<Vec<i64>> = v
.into_iter()
.chain([*v.last().unwrap_or(&-1)].iter())
.scan(Vec::new(), |s, &e| {
match s.last() {
None => { s.push(e); Some((false, Vec::new())) },
Some(&p) if p == e - 1 => { s.push(e); Some((false, Vec::new()))},
Some(&p) if p != e - 1 => {let o = s.clone(); *s = vec![e]; Some((true, o))},
_ => None,
}
})
.filter_map(|(n, v)| {
match n {
true => Some(v.clone()),
false => None,
}
})
.collect();
t
}
The chain is used to get the last vector.
I like the answers above but you could also use peekable() to tell if the next value is different.
https://doc.rust-lang.org/stable/std/iter/struct.Peekable.html
I would probably use a fold for this?
That's because I'm very much a functional programmer.
Obviously mutating the accumulator is weird :P but this works too and represents another way of thinking about it.
This is basically a recursive solution and can be modified easily to use immutable datastructures.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=43b9e3613c16cb988da58f08724471a4
fn main() {
let v = vec![1, 2, 3, 5, 6, 7, 9, 10];
let mut res: Vec<Vec<i32>> = vec![];
let (last_group, _): (Vec<i32>, Option<i32>) = v
.iter()
.fold((vec![], None), |(mut cur_group, last), x| {
match last {
None => {
cur_group.push(*x);
(cur_group, Some(*x))
}
Some(last) => {
if x - last == 1 {
cur_group.push(*x);
(cur_group, Some(*x))
} else {
res.push(cur_group);
(vec![*x], Some(*x))
}
}
}
});
res.push(last_group);
println!("{:?}", res);
}

Resources