Can any one provide example related to polars as_struct().apply() - rust

I try to add array column to the existing data-frame. input is something like this
| 1 | 3 |
| 2 | 4 |
and output is
| 1 | 3 | [?, ?, ?, ?] |
| 2 | 4 | [?, ?, ?, ?] |
value of the array will be populated by some custom function.
I try to implement something like this.
let df = df![
"a" => [1, 2],
"b" => [3, 4]
]?;
let lf: LazyFrame = df.lazy().select(
as_struct(&[col("a"), col("b")]).apply(
somefn,
GetOutput::from_type(DataType::List(Box::new(DataType::Float32))),
));
I don't know how to implement this somefn.

Related

Splitting tuples in a column of a dataframe [duplicate]

This question already has answers here:
How can I split a column of tuples in a Pandas dataframe?
(6 answers)
Closed 8 months ago.
I have a dataframe df containing tuples like below in Column A
| ID | A |
+----------+------------+
|0 |(1, [a]) |
|1 |(2, [a,b]) |
|2 |(3, [c,a,b])|
+----------+------------+
I want to split the tuples in the above df and want to see my new dataframe df like below.
| ID | A | B |
+----------+------------+----------+
|0 | 1 | [a] |
|1 | 2 | [a,b] |
|2 | 3 | [c,a,b] |
+----------+------------+----------+
So, how can I split the tuple in the above dataframe df?
Use the str accessor, make sure to start with creating B to avoid losing the data in A:
df['B'] = df['A'].str[1]
df['A'] = df['A'].str[0]
alternative:
df[['A', 'B']] = pd.DataFrame(df['A'].to_list(), columns=['A', 'B'])
output:
ID A B
0 0 1 [a]
1 1 2 [a, b]
2 2 3 [c, a, b]

Spark: Join two dataframes on an array type column

I have a simple use case
I have two dataframes df1 and df2, and I am looking for an efficient way to join them?
df1: Contains my main dataframe (billions of records)
+--------+-----------+--------------+
|doc_id |doc_name |doc_type_id |
+--------+-----------+--------------+
| 1 |doc_name_1 |[1,4] |
| 2 |doc_name_2 |[3,2,6] |
+--------+-----------+--------------+
df2: Contains labels of doc types(40000 records), as it's a small one I am broadcasting it.
+------------+----------------+
|doc_type_id |doc_type_name |
+------------+----------------+
| 1 |doc_type_1 |
| 2 |doc_type_2 |
| 3 |doc_type_3 |
| 4 |doc_type_4 |
| 5 |doc_type_5 |
| 6 |doc_type_5 |
+------------+----------------+
I would like to join these two dataframes to result in somthing like this:
+--------+------------+--------------+----------------------------------------+
|doc_id |doc_name |doc_type_id |doc_type_name |
+--------+------------+--------------+----------------------------------------+
| 1 |doc_name_1 |[1,4] |["doc_type_1","doc_type_4"] |
| 2 |doc_name_2 |[3,2,6] |["doc_type_3","doc_type_2","doc_type_6"]|
+--------+------------+--------------+----------------------------------------+
Thanks
We can use array_contains + groupBy + collect_list functions for this case.
Example:
val df1=Seq(("1","doc_name_1",Seq(1,4)),("2","doc_name_2",Seq(3,2,6))).toDF("doc_id","doc_name","doc_type_id")
val df2=Seq(("1","doc_type_1"),("2","doc_type_2"),("3","doc_type_3"),("4","doc_type_4"),("5","doc_type_5"),("6","doc_type_6")).toDF("doc_type_id","doc_type_name")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df1.createOrReplaceTempView("tbl")
df2.createOrReplaceTempView("tbl2")
spark.sql("select a.doc_id,a.doc_name,a.doc_type_id,collect_list(b.doc_type_name) doc_type_name from tbl a join tbl2 b on array_contains(a.doc_type_id,int(b.doc_type_id)) = TRUE group by a.doc_id,a.doc_name,a.doc_type_id").show(false)
//+------+----------+-----------+------------------------------------+
//|doc_id|doc_name |doc_type_id|doc_type_name |
//+------+----------+-----------+------------------------------------+
//|2 |doc_name_2|[3, 2, 6] |[doc_type_2, doc_type_3, doc_type_6]|
//|1 |doc_name_1|[1, 4] |[doc_type_1, doc_type_4] |
//+------+----------+-----------+------------------------------------+
Other way to achieve is by using explode + join + collect_list:
val df3=df1.withColumn("arr",explode(col("doc_type_id")))
df3.join(df2,df2.col("doc_type_id") === df3.col("arr"),"inner").
groupBy(df3.col("doc_id"),df3.col("doc_type_id"),df3.col("doc_name")).
agg(collect_list(df2.col("doc_type_name")).alias("doc_type_name")).
show(false)
//+------+-----------+----------+------------------------------------+
//|doc_id|doc_type_id|doc_name |doc_type_name |
//+------+-----------+----------+------------------------------------+
//|1 |[1, 4] |doc_name_1|[doc_type_1, doc_type_4] |
//|2 |[3, 2, 6] |doc_name_2|[doc_type_2, doc_type_3, doc_type_6]|
//+------+-----------+----------+------------------------------------+

Using groupby in pandas to filter a dataframe using count and column value

I am trying to clean my dataframe and I am trying to use groupby function. I have ID and event_type as my columns. I want to get a new dataframe where if there is only one row having a Unique ID then the event_type must be a. If not then delete that row.
Data looks like this: The event_type can be "a" or "b"
+-----+------------+
| ID | event_type |
+-----+------------+
| xyz | a |
| pqr | b |
| xyz | b |
| rst | a |
+-----+------------+
Output:
Since the ID "pqr" occurs only once (which is the count) and does not have a (column value) as the event_type the dataframe should convert to the following:
+-----+------------+
| ID | event_type |
+-----+------------+
| xyz | a |
| xyz | b |
| rst | a |
+-----+------------+
You can use your logic within a groupby
import pandas as pd
df = pd.DataFrame({"ID":['xyz', 'pqr', 'xyz', 'rst'],
"event_type":['a', 'b', 'b', 'a']})
what you are asking is this
df.groupby("ID")\
.apply(lambda x: not (len(x)==1 and
not "a" in x["event_type"].values))
as you can check by printing it. Finally to use this filter you just run
df = df.groupby("ID")\
.filter(lambda x: not (len(x)==1 and
not "a" in x["event_type"].values))\
.reset_index(drop=True)

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

How to explode columns?

After:
val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")
I have this DataFrame in Apache Spark:
+------+---------+
| Col1 | Col2 |
+------+---------+
| 1 |[2, 3, 4]|
| 1 |[2, 3, 4]|
+------+---------+
How do I convert this into:
+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| 1 | 2 | 3 | 4 |
| 1 | 2 | 3 | 4 |
+------+------+------+------+
A solution that doesn't convert to and from RDD:
df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
Or arguable nicer:
val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
The size of a Spark array column is not fixed, you could for instance have:
+----+------------+
|Col1| Col2|
+----+------------+
| 1| [2, 3, 4]|
| 1|[2, 3, 4, 5]|
+----+------------+
So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:
val nElements = df.select("Col2").first.getList(0).size
Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:
from pyspark.sql import functions as F
df.select([F.col('Col2').getItem(i) for i in range(nElements)])
Just add on to sgvd's solution:
If the size is not always the same, you can set nElements like this:
val nElements = df.select(size('Col2).as("Col2_count"))
.select(max("Col2_count"))
.first.getInt(0)
You can use a map:
df.map {
case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()
If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

Resources