Rust Polars: How to get the row count of a DataFrame? - rust

I want to filter a Polars DataFrame and then get the number of rows.
What I'm doing now seems to work but feels so wrong:
let item_count = item_df
.lazy()
.filter(not(col("status").is_in(lit(filter))))
.collect()?
.shape().0;
In a subsequent DataFrame operation I need to use this in a division operation
.with_column(
col("count")
.div(lit(item_count as f64))
.mul(lit(100.0))
.alias("percentage"),
);
This is for a tiny dataset (tens of rows) so I'm not worried about performance but I'd like to learn what the best way would be.

While there doesn't seem to be a predefined method on LazyFrame, you can use polars expressions:
use polars::prelude::*;
let df = df!["a" => [1, 2], "b" => [3, 4]].unwrap();
dbg!(df.lazy().select([count()]).collect().unwrap());

Related

Polars Rust categorical type for string filtering

Several question regarding performance and categoricals:
https://docs.rs/polars/latest/polars/docs/performance/index.html
It suggests benefits for filtering with Strings. However, it doesn't provide any examples with filtering. It uses a try_apply() method to change into categoricals but how to change back to utf8? And what kind of filtering is possible? Is contains() supported? (I usually use str().contains("search_str"))
I tried a simple example but it resulted in an error:
let mut df = df![
"utf8_column"=> ["foo", "bar", "ham"]
]?;
let df2 = df.try_apply("utf8_column", |s| s.categorical().cloned())?;
println!("{}", &df2);
Error: Data types don't match: Series of dtype: Utf8 != Categorical
So, how does it work and how useful is it for string filtering?

How to create a column with the lengths of strings from a different column in Polars Rust?

I'm trying to replicate one of the Polars Python examples in Rust but seem to have hit a wall. In the Python docs there is an example which creates a new column with the lengths of the strings from another column. So for example, column B will contain the lengths of all the strings in column A.
The example code looks like this:
import polars as pl
df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})
df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count"))
As you can see it uses the str namespace to access the lengths() function but when trying the same in the Rust version this does not work:
use polars::prelude::*;
// This will throw the following error:
// no method named `lengths` found for struct `StringNameSpace` in the current scope
fn print_length_strings_in_column() -> () {
let df = generate_df().expect("error");
let new_df = df
.lazy()
.with_column(col("vendor_id").str().lengths().alias("vendor_id_length"))
.collect();
}
Cargo.toml:
[dependencies]
polars = {version = "0.22.8", features = ["strings", "lazy"]}
I checked the docs and it seems like the Rust version of Polars does not implement the lengths() function. There is the str_lengths function in the Utf8NameSpace but it's not entirely clear to me how to use this.
I feel like I'm missing something very simple here but I don't see it. How would i go about tackling this issue?
Thanks!
You have to use apply function and cast the series to Utf8 Chunked Array. It then has a method str_lengths():
https://docs.rs/polars/0.22.8/polars/chunked_array/struct.ChunkedArray.html
let s = Series::new("vendor_id", &["Ant", "no", "how", "Ant", "mans"]);
let df = DataFrame::new(vec![s]).unwrap();
let res = df.lazy()
.with_column(col("vendor_id").apply(|srs|{
Ok(srs.utf8()?
.str_lengths()
.into_series())
}, GetOutput::from_type(DataType::Int32))
.alias("vendor_id_length"))
.collect();

Why do I need to define varibales in this example while sorting multiple lists at once in Python(3)?

Simple problem.
I've got these lists:
alpha = [5,10,1,2]
beta = [1,5,2]
gamma = [5,2,87,100,1]
I thought I could sort them all with:
map(lambda x: list.sort(x),[alpha,beta,gamma])
Which doesn't work.
What is working though is:
a,b,c = map(lambda x: list.sort(x),[alpha,beta,gamma])
Can someone explain why I need to define a,b and c for this code to work?
Because map() is lazy (since Python 3). It returns a generator-like object that is only evaluated when its contents are asked for, for instance because you want to assign its individual elements to variables.
E.g., this also forces it to evaluate:
>>> list(map(lambda x: list.sort(x),[alpha,beta,gamma]))
But using map is a bit archaic, list comprehensions and generator comprehensions exist and are almost always more idiomatic. Also, list.sort(x) is a bit of an odd way to write x.sort() that may or may not work, avoid it.
[x.sort() for x in [alpha, beta, gamma]]
works as you expected.
But if you aren't interested in the result, building a list isn't relevant. What you really want is a simple for loop:
for x in [alpha, beta, gamma]:
x.sort()
Which is perfectly Pythonic, except that I maybe like this one even better in this fixed, simple case:
alpha.sort()
beta.sort()
gamma.sort()
can't get more explicit than that.
Its is because list.sort() returns None. It sorts the list in place
If you want to sort them lists and return it do this:
a,b,c = (sorted(l) for l in [alpha, beta, gamma])
In [10]: alpha.sort()
In [11]: sorted(alpha)
Out[11]: [1, 2, 5, 10]

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)

How can I co-sort two Vecs based on the values in one of the Vecs?

I have two Vecs that correspond to a list of feature vectors and their corresponding class labels, and I'd like to co-sort them by the class labels.
However, Rust's sort_by operates on a slice rather than being a generic function over a trait (or similar), and the closure only gets the elements to be compared rather than the indices so I can sneakily hack the sort to be parallel.
I've considered the solution:
let mut both = data.iter().zip(labels.iter()).collect();
both.sort_by( blah blah );
// Now split them back into two vectors
I'd prefer not to allocate a whole new vector to do this every time because the size of the data can be extremely large.
I can always implement my own sort, of course, but if there's a builtin way to do this it would be much better.
I just wrote a crate "permutation" that allows you to do this :)
let names = vec!["Bob", "Steve", "Jane"];
let salary = vec![10, 5, 15];
let permutation = permutation::sort(&salary);
let ordered_names = permutation.apply_slice(&names);
let ordered_salaries = permutation.apply_slice(&salary);
assert!(ordered_names == vec!["Steve", "Bob", "Jane"]);
assert!(ordered_salaries == vec![5, 10, 15]);
It likely will support this in a single function call in the future.

Resources