Polars with_context filter - rust-polars

I'm having a hard time understanding how to use the lazy with_context. The docs say
This allows expressions to also access columns from DataFrames that are not part of this one.
I want to filter a column based on a column from another frame, but I get errors stating that the column from the other frame does not exist. Not sure what I'm doing wrong here, as it seems to fit the description of what the with_context docs describe.
main.rs
use polars::prelude::*;
fn main() {
let df0 = df! {
"id" => [1, 2, 3],
"name" => ["foo", "bar", "baz"],
}
.unwrap()
.lazy();
let other_df = df! {
"other_id" => [1,2,2,1],
"name" => ["w", "x", "y", "z"],
}
.unwrap()
.lazy();
let lf = df0.with_context(&[other_df]);
let res = lf
.filter(col("id").is_in(col("other_id")))
.collect()
.unwrap();
println!("{:?}", res);
}
Cargo.toml
[dependencies]
polars = {git = "https://github.com/pola-rs/polars", branch = "master", features = ["lazy", "is_in"]}
Edit:
If i do select instead of filter, I don't get an error.
let res = lf
.select(&[col("id").is_in(col("other_id"))])
.collect()
.unwrap();

Related

How do I serialize Polars DataFrame Row/HashMap of `AnyValue` into JSON?

I have a row of a polars dataframe created using iterators reading a parquet file from this method: Iterate over rows polars rust
I have constructed a HashMap that represents an individual row and I would like to now convert that row into JSON.
This is what my code looks like so far:
use polars::prelude::*;
use std::iter::zip;
use std::{fs::File, collections::HashMap};
fn main() -> anyhow::Result<()> {
let file = File::open("0.parquet").unwrap();
let mut df = ParquetReader::new(file).finish()?;
dbg!(df.schema());
let fields = df.fields();
let columns: Vec<&String> = fields.iter().map(|x| x.name()).collect();
df.as_single_chunk_par();
let mut iters = df.iter().map(|s| s.iter()).collect::<Vec<_>>();
for _ in 0..df.height() {
let mut row = HashMap::new();
for (column, iter) in zip(&columns, &mut iters) {
let value = iter.next().expect("should have as many iterations as rows");
row.insert(column, value);
}
dbg!(&row);
let json = serde_json::to_string(&row).unwrap();
dbg!(json);
break;
}
Ok(())
}
And I have the following feature flags enabled: ["parquet", "serde", "dtype-u8", "dtype-i8", "dtype-date", "dtype-datetime"].
I am running into the following error at the serde_json::to_string(&row).unwrap() line:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("the enum variant AnyValue::Datetime cannot be serialized", line: 0, column: 0)', src/main.rs:47:48
I am also unable to implement my own serialized for AnyValue::DateTime because of only traits defined in the current crate can be implemented for types defined outside of the crate.
What's the best way to serialize this row into JSON?
I was able to resolve this error by using a match statement over value to change it from a Datetime to an Int64.
let value = match value {
AnyValue::Datetime(value, TimeUnit::Milliseconds, _) => AnyValue::Int64(value),
x => x
};
row.insert(column, value);
Root cause is there is no enum variant for Datetime in the impl Serialize block: https://docs.rs/polars-core/0.24.0/src/polars_core/datatypes/mod.rs.html#298
Although this code now works, it outputs data that looks like:
{'myintcolumn': {'Int64': 22342342343},
'mylistoclumn': {'List': {'datatype': 'Int32', 'name': '', 'values': []}},
'mystrcolumn': {'Utf8': 'lorem ipsum lorem ipsum'}
So you likely to be customizing the serialization here regardless of the data type.
Update: If you want to get the JSON without all of the inner nesting, I had to do a gnarly match statement:
use polars::prelude::*;
use std::iter::zip;
use std::{fs::File, collections::HashMap};
use serde_json::json;
fn main() -> anyhow::Result<()> {
let file = File::open("0.parquet").unwrap();
let mut df = ParquetReader::new(file).finish()?;
dbg!(df.schema());
let fields = df.fields();
let columns: Vec<&String> = fields.iter().map(|x| x.name()).collect();
df.as_single_chunk_par();
let mut iters = df.iter().map(|s| s.iter()).collect::<Vec<_>>();
for _ in 0..df.height() {
let mut row = HashMap::new();
for (column, iter) in zip(&columns, &mut iters) {
let value = iter.next().expect("should have as many iterations as rows");
let value = match value {
AnyValue::Null => json!(Option::<String>::None),
AnyValue::Int64(val) => json!(val),
AnyValue::Int32(val) => json!(val),
AnyValue::Int8(val) => json!(val),
AnyValue::Float32(val) => json!(val),
AnyValue::Float64(val) => json!(val),
AnyValue::Utf8(val) => json!(val),
AnyValue::List(val) => {
match val.dtype() {
DataType::Int32 => ({let vec: Vec<Option<_>> = val.i32().unwrap().into_iter().collect(); json!(vec)}),
DataType::Float32 => ({let vec: Vec<Option<_>> = val.f32().unwrap().into_iter().collect(); json!(vec)}),
DataType::Utf8 => ({let vec: Vec<Option<_>> = val.utf8().unwrap().into_iter().collect(); json!(vec)}),
DataType::UInt8 => ({let vec: Vec<Option<_>> = val.u8().unwrap().into_iter().collect(); json!(vec)}),
x => panic!("unable to parse list column: {} with value: {} and type: {:?}", column, x, x.inner_dtype())
}
},
AnyValue::Datetime(val, TimeUnit::Milliseconds, _) => json!(val),
x => panic!("unable to parse column: {} with value: {}", column, x)
};
row.insert(*column as &str, value);
}
let json = serde_json::to_string(&row).unwrap();
dbg!(json);
break;
}
Ok(())
}

How to properly apply a MAP function

Need some help with a map function.
I want to take a DataType::Date and store the corresponding weekday as a string column.
I have it working starting with string -> date-type -> string (case #1).
What I am looking for is date-type -> string (case#2).
Here is the working code for the first case ... any suggestions on how to get this to work for my second case?
My challenge with this stems from my lack of proper understanding of how map is supposed to work in this instance.
use chrono::{Date, Datelike, NaiveDate, Utc};
use polars::prelude::*;
fn main() {
let days = df!("column_1" => &["Tuesday"],
"column_2" => &["1900-01-02"]);
let options = StrpTimeOptions {
date_dtype: DataType::Date,
fmt: Some("%Y-%m-%d".into()),
strict: false,
exact: true,
};
// convert column_2-string into dtype(date) and put into new column "date"
let days = days
.unwrap()
.lazy()
.with_column(col("column_2").alias("date").str().strptime(options));
let o = GetOutput::from_type(DataType::Utf8);
fn str_to_weekday(str_val: Series) -> Result<Series> {
let x = str_val
.utf8()
.unwrap()
.into_iter()
// your actual custom function would be in this map
.map(|opt_date: Option<&str>| {
opt_date.map(|date: &str| {
// for DEBUG purpose only:
println! {"Date-String: {:?}", date};
NaiveDate::parse_from_str(date, "%Y-%m-%d")
.unwrap()
.format("%A")
.to_string()
})
})
.collect::<Utf8Chunked>();
Ok(x.into_series())
}
// column_2 to weekday-string ... into new column "weekday"
let days = days
.with_column(col("column_2").alias("weekday").apply(str_to_weekday, o))
.collect()
.unwrap()
.lazy();
println!("{:?}", days.clone().collect());
}
Got it to work :)
With a simplified approach ...
use polars::prelude::*;
fn main() {
let days = df!("column_1" => &["Tuesday"],
"column_2" => &["1900-01-02"]);
let options = StrpTimeOptions {
date_dtype: DataType::Date,
fmt: Some("%Y-%m-%d".into()),
strict: false,
exact: true,
};
// convert column_2-string into dtype(date) and put into new column "date"
let days = days
.unwrap()
.lazy()
.with_column(col("column_2").alias("date").str().strptime(options));
println!("{:?}", days.clone().collect());
let o = GetOutput::from_type(DataType::Utf8);
let days = days.with_column(
col("date")
.alias("weekday")
.map(|x| Ok(x.strftime("%A").unwrap()), o),
);
println!("{:?}", days.collect());
}

LazyFrame: How to do string manipulation on values in a single column

I want to change all string values in a LazyFrame-Column.
e.g. from "alles ok" ==> to "ALLES OK"
I see that a series has a function to do it:
polars.internals.series.StringNameSpace.to_uppercase
Q: What is the proper way to apply a string (or Date) manipulation on just one column in a LazyFrame?
Q: Do I need to extract the column I want to work on as a series and re-integrate it?
I can do math on elements of a column and put the result in a new column e.g.:
df.with_column((col("b") ** 2).alias("b_squared")).collect()
but strings?
Ok, after some digging I was able to take a string-column of a LazyFrame and convert it to dtype(datetime).
I also found a code snippet to apply a "len" function to the first column and add the result into a new column:
use polars::prelude::*;
fn main() {
let df: Result<DataFrame> = df!("column_1" => &["Tuesday"],
"column_2" => &["1900-01-02"]);
let options = StrpTimeOptions {
date_dtype: DataType::Datetime(TimeUnit::Milliseconds, None),
fmt: Some("%Y-%m-%d".into()),
strict: false,
exact: true,
};
// in-place convert string into dtype(datetime)
let days = df
.unwrap()
.lazy()
.with_column(col("column_2").str().strptime(options));
// ### courtesy of Alex Moore-Niemi:
let o = GetOutput::from_type(DataType::UInt32);
fn str_to_len(str_val: Series) -> Result<Series> {
let x = str_val
.utf8()
.unwrap()
.into_iter()
// your actual custom function would be in this map
.map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
.collect::<UInt32Chunked>();
Ok(x.into_series())
}
// ###
// add new column with length of string in column_1
let days = days
.with_column(col("column_1").alias("new_column").apply(str_to_len, o))
.collect()
.unwrap();
let o = GetOutput::from_type(DataType::Utf8);
fn str_to_uppercase(str_val: Series) -> Result<Series> {
let x = str_val
.utf8()
.unwrap()
.into_iter()
// your actual custom function would be in this map
.map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.to_uppercase()))
.collect::<Utf8Chunked>();
Ok(x.into_series())
}
// column_1 to UPPERCASE ... in-place
let days = days
.lazy()
.with_column(col("column_1").apply(str_to_uppercase, o))
.collect()
.unwrap();
println!("{}", days);
}

How to get date element when iterating over DateChunked in polars rust

I want to iterate over DateChunked using map and use the date element. But iterating over gives me i32. How I can use date?
Running below code complains.
use polars::export::chrono::{NaiveDate};
fn main() {
let df = df! [
"date_series" => [NaiveDate::from_ymd(2020, 1, 1), NaiveDate::from_ymd(2020, 1, 2), NaiveDate::from_ymd(2020, 1, 3)]
].unwrap();
let a: DateChunked = df["date_series"].date().unwrap().into_iter().map(|d| {
match d {
Some(d) => Some(d),
None => None
}
}).collect();
}
complains
}).collect();
| ^^^^^^^ value of type `Logical<DateType, Int32Type>` cannot be built from `std::iter::Iterator<Item=Option<i32>>`

Does Rust have an equivalent to Python's dictionary comprehension syntax?

How would one translate the following Python, in which several files are read and their contents are used as values to a dictionary (with filename as key), to Rust?
countries = {region: open("{}.txt".format(region)).read() for region in ["canada", "usa", "mexico"]}
My attempt is shown below, but I was wondering if a one-line, idiomatic solution is possible.
use std::{
fs::File,
io::{prelude::*, BufReader},
path::Path,
collections::HashMap,
};
macro_rules! map(
{ $($key:expr => $value:expr),+ } => {
{
let mut m = HashMap::new();
$(
m.insert($key, $value);
)+
m
}
};
);
fn lines_from_file<P>(filename: P) -> Vec<String>
where
P: AsRef<Path>,
{
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines()
.map(|l| l.expect("Could not parse line"))
.collect()
}
fn main() {
let _countries = map!{ "canada" => lines_from_file("canada.txt"),
"usa" => lines_from_file("usa.txt"),
"mexico" => lines_from_file("mexico.txt") };
}
Rust's iterators have map/filter/collect methods which are enough to do anything Python's comprehensions can. You can create a HashMap with collect on an iterator of pairs, but collect can return various types of collections, so you may have to specify the type you want.
For example,
use std::collections::HashMap;
fn main() {
println!(
"{:?}",
(1..5).map(|i| (i + i, i * i)).collect::<HashMap<_, _>>()
);
}
Is roughly equivalent to the Python
print({i+i: i*i for i in range(1, 5)})
But translated very literally, it's actually closer to
from builtins import dict
def main():
print("{!r}".format(dict(map(lambda i: (i+i, i*i), range(1, 5)))))
if __name__ == "__main__":
main()
not that you would ever say it that way in Python.
Python's comprehensions are just sugar for a for loop and accumulator. Rust has macros--you can make any sugar you want.
Take this simple Python example,
print({i+i: i*i for i in range(1, 5)})
You could easily re-write this as a loop and accumulator:
map = {}
for i in range(1, 5):
map[i+i] = i*i
print(map)
You could do it basically the same way in Rust.
use std::collections::HashMap;
fn main() {
let mut hm = HashMap::new();
for i in 1..5 {
hm.insert(i + i, i * i);
}
println!("{:?}", hm);
}
You can use a macro to do the rewriting to this form for you.
use std::collections::HashMap;
macro_rules! hashcomp {
($name:ident = $k:expr => $v:expr; for $i:ident in $itr:expr) => {
let mut $name = HashMap::new();
for $i in $itr {
$name.insert($k, $v);
}
};
}
When you use it, the resulting code is much more compact. And this choice of separator tokens makes it resemble the Python.
fn main() {
hashcomp!(hm = i+i => i*i; for i in 1..5);
println!("{:?}", hm);
}
This is just a basic example that can handle a single loop. Python's comprehensions also can have filters and additional loops, but a more advanced macro could probably do that too.
Without using your own macros I think the closest to
countries = {region: open("{}.txt".format(region)).read() for region in ["canada", "usa", "mexico"]}
in Rust would be
let countries: HashMap<_, _> = ["canada", "usa", "mexico"].iter().map(|&c| {(c,read_to_string(c.to_owned() + ".txt").expect("Error reading file"),)}).collect();
but running a formatter, will make it more readable:
let countries: HashMap<_, _> = ["canada", "usa", "mexico"]
.iter()
.map(|&c| {
(
c,
read_to_string(c.to_owned() + ".txt").expect("Error reading file"),
)
})
.collect();
A few notes:
To map a vector, you need to transform it into an iterator, thus iter().map(...).
To transform an iterator back into a tangible data structure, e.g. a HashMap (dict), use .collect(). This is the advantage and pain of Rust, it is very strict with types, no unexpected conversions.
A complete test program:
use std::collections::HashMap;
use std::fs::{read_to_string, File};
use std::io::Write;
fn create_files() -> std::io::Result<()> {
let regios = [
("canada", "Ottawa"),
("usa", "Washington"),
("mexico", "Mexico city"),
];
for (country, capital) in regios {
let mut file = File::create(country.to_owned() + ".txt")?;
file.write_fmt(format_args!("The capital of {} is {}", country, capital))?;
}
Ok(())
}
fn create_hashmap() -> HashMap<&'static str, String> {
let countries = ["canada", "usa", "mexico"]
.iter()
.map(|&c| {
(
c,
read_to_string(c.to_owned() + ".txt").expect("Error reading file"),
)
})
.collect();
countries
}
fn main() -> std::io::Result<()> {
println!("Hello, world!");
create_files().expect("Failed to create files");
let countries = create_hashmap();
{
println!("{:#?}", countries);
}
std::io::Result::Ok(())
}
Not that specifying the type of countries is not needed here, because the return type of create_hashmap() is defined.

Resources