group by element of many array of array in clickhouse/sparksql/pyspark - apache-spark

Having the table A below:
id units
1 [1,1,1]
2 [3,0,0]
1 [5,3,7]
3 [2,5,2]
2 [3,2,6]
I would like to query something like:
select id, array_append(units) from A group by id
And get the following result:
id units
1 [1,1,1,5,3,7]
2 [3,0,0,3,2,6]
3 [2,5,2]

For ClickHouse use the function groupArray and combinator Array:
SELECT
id,
groupArrayArray(units)
FROM
(
SELECT
data.1 AS id,
data.2 AS units
FROM
(
SELECT arrayJoin([(1, [1, 1, 1]), (2, [3, 0, 0]), (1, [5, 3, 7]), (3, [2, 5, 2]), (2, [3, 2, 6])]) AS data
)
)
GROUP BY id
/*
┌─id─┬─groupArrayArray(units)─┐
│ 1 │ [1,1,1,5,3,7] │
│ 2 │ [3,0,0,3,2,6] │
│ 3 │ [2,5,2] │
└────┴────────────────────────┘
*/

You want to achieve the list flattening after the grouping, you can
df.groupby('id').agg(func.flatten(func.collect_list('units')).alias('units')).show(10, False)

Related

Pair wise permutation of two lists in Python

I have a list with 10 numerical values. I want to return all possible combination of this list such that each element can take value +/- element value.
The approach I had in mind was to take a binary variable which takes in value from 0 to 1023. 1 in this variable corresponds to positive d[i] and 0 to negative d[i].
e.g. for bin(8) = 0000001000 implies that d7 will take value -d7 and rest will be positive. Repeat this for all 0 to 1023 to get all combinations.
For example, if D = [d1,d2,...d10], we will have 1024 (2^10) combinations such that:
D1 = [-d1,d2,d3,....d10]
D2 = [-d1,-d2,d3,....d10]
D3 = [d1,-d2,d3,....d10] ...
D1024 = [-d1,-d1,-d3,....-d10]
Thank You!
you can just use the builtin itertools.product to make all combinations of positive and negative values.
from itertools import product
inputs = list(range(10)) # [1,2,3,4,5,6,7,8,9]
outputs = []
choices = [(x,-x) for x in inputs]
for item in product(*choices):
outputs.append(item)
print(outputs[:3])
print(len(outputs))
# [(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), (0, 1, 2, 3, 4, 5, 6, 7, 8, -9), (0, 1, 2, 3, 4, 5, 6, 7, -8, 9)]
# 1024
in a compressed form:
outputs = [item for item in product(*[(x,-x) for x in inputs])]

How to use with_column method to create a calculated column in Polars Rust?

I was trying to create a new computed column based on existing column in polars rust DataFrame. There is a pyspark like with_column method available for that. But in the api documentation there is no example. Here is a example dataframe:
use polars::prelude::*;
fn example() {
let df = df!["foo" => ["A", "A", "B", "B", "C"],
"val1" => [1, 2, 2, 4, 2],
"val2" => [1, 2, 2, 4, 2],
].unwrap();
// newcolumn ration = val1/val2
// df.with_column(...)
println!("{}", df);
fn main{
example()
}
I want to create a ration column which will calculate the ration between val1 and val 2 but there is no example available in the API documentation. Also there is another issue. The with column method might also need the col type to wrap the columns like pyspark but polars::prelute::* does not brings the col type into scope. Or may be some features needed to be enabled in the cargo file.
I am using latest version of Polars 0.22.8.
Does any one knows how to do it.
Your initial idea works with the lazy API:
# Cargo.toml
# ...
[dependencies]
polars = { version = "0.22.8", features = ["lazy"] }
// src/main.rs
use polars::prelude::*;
fn example() -> DataFrame {
let df = df!["foo" => ["A", "A", "B", "B", "C"],
"val1" => [1, 2, 2, 4, 2],
"val2" => [1, 2, 2, 4, 2],
]
.unwrap();
df.lazy()
.with_column((col("val1") / col("val2")).alias("ration"))
.collect()
.unwrap()
}
fn main() {
let df = example();
println!("{:?}", df);
}
Output:
shape: (5, 4)
┌─────┬──────┬──────┬────────┐
│ foo ┆ val1 ┆ val2 ┆ ration │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ i32 │
╞═════╪══════╪══════╪════════╡
│ A ┆ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ A ┆ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ 4 ┆ 4 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C ┆ 2 ┆ 2 ┆ 1 │
└─────┴──────┴──────┴────────┘
I have found a nasty way of doing that. I believe there must be a better way. After reading some part of the API doc I got the info: The argument of df.with_column() must have to be a type which implements the InToIter trait. Another part of the doc it says the type Series has a implementation of that Trait. So first I have created a new series and then provided the series into the function which will add a new column as follows:
use polars::prelude::*;
fn example() -> DataFrame {
let mut df = df!["foo" => ["A", "A", "B", "B", "C"],
"val1" => [1, 2, 2, 4, 2],
"val2" => [1, 2, 2, 4, 2],
].unwrap();
let ration = Series::new(
"ration",
df.column("val1").unwrap()/df.column("val2").unwrap()
);
let _ = df.with_column(ration).unwrap();
df
}
fn main() {
let df = example();
println!("{}", df);
}
Result:
shape: (5, 4)
┌─────┬──────┬──────┬────────┐
│ foo ┆ val1 ┆ val2 ┆ ration │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ i32 │
╞═════╪══════╪══════╪════════╡
│ A ┆ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ A ┆ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ 4 ┆ 4 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C ┆ 2 ┆ 2 ┆ 1 │
└─────┴──────┴──────┴────────┘

How to check if positional indexes in a list exists in section of corresponding DataFrame

I have a DataFrame like this:
Date A B C
2021-08-20 1 2 3
2021-08-21 2 3 4
2021-08-22 3 4 5
2021-08-23 4 5 6
2021-08-24 7 8 9
2021-08-25 10 11 12
2021-08-26 11 12 13
2021-08-28 12 13 14
My "target" section is dates from 2021-08-21 to 2021-08-24.
Now I have a list of positional indices:
A = [0, 1, 3, 4, 6, 7]
What I'm trying to do is create a new list of indices that correspond to the indices only in my target section, and then find the total number of elements in the new list.
Target answer:
new_list = [1, 3, 4]
print(len(new_list))
3
I've tried this so far:
new_list = []
df_range = df.loc['2021-08-21':'2021-08-24']
for data_idx in A:
if data_idx == df_range.iloc[data_idx]:
new_list.append(data_idx)
print(len(new_list))
But I get IndexErrors (single positional indexer is out-of-bounds) or Key errors (for a similar attempt). I believe what's erroring is when the program tries to locate the indexes outside of this range?
Thank you in advance and sorry if anything is confusing. I know there should be an easy way to do this but I just can't figure it out.
IIUC:
A = [0, 1, 3, 4, 6, 7]
df["tmp"] = range(len(df))
x = df.loc["2021-08-21":"2021-08-24"]
print(x.loc[x["tmp"].isin(A), "tmp"].to_list())
Prints:
[1, 3, 4]
If 'Date' is in the index of the dataframe and the datatype is datetime index, then we can use pd.Index.get_indexer and use set operations to find intersection.
#Copy dataframe from question above
df = pd.read_clipboard(index_col=[0])
df.index = pd.to_datetime(df.index)
idx = df.index.get_indexer(pd.date_range('2021-08-21', '2021-08-24', freq='D'))
A = [0, 1, 3, 4, 6, 7]
overlap = set(A) & set(idx)
print(f'{overlap=} and {len(overlap)=}')
Output:
overlap={1, 3, 4} and len(overlap)=3
If I understood the question, you're wanting to have a list with corresponding indexes to your df_range? If so these two approaches are commonly used for that
new_list = []
df_range = df.loc['2021-08-21':'2021-08-24']
for i, v in enumerate(df_range):
new_list.append(i)
for i in range(len(df_range)):
new_list.append(i)

How to let pandas groupby add a count column for each group after applying list aggregations?

I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?
We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3
Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3

Easiest way to get Pandas rolling window of values

I have a dataset. I want a window of 5 values. Does pandas have a native function that will give me a rolling window of 5 values until there are no longer 5 values that it can use? I want these to be rows.
I also want the new label to be the middle of the 5 values.
Input DataFrame
first label
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Output DataFrame desired:
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
I have tried using the .rolling function and haven't been successful.
You can use strides and for label get position of middle value and by numpy indexing set value:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(df['first'].to_numpy(), 5)
print (a)
[[ 1 2 3 4 5]
[2 3 4 5 6]]
#get positions of middle value
i = rolling_window(np.arange(len(df)), 5)[:, 2]
print (i)
[2 3]
df = pd.DataFrame({'first':a.tolist(),
'label': df['label'].to_numpy()[i]})
print (df)
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
You can more optimalize code for run strides only one:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
#get positions
idx = rolling_window(np.arange(len(df)), 5)
print (idx)
[[0 1 2 3 4]
[1 2 3 4 5]]
df = pd.DataFrame({'first': df['first'].to_numpy()[idx].tolist(),
'label': df['label'].to_numpy()[idx][:, 2]})
print (df)
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3
An alternative, more of a hack, I don't think pandas has a native function for what you want.
Convert dataframe to numpy, transpose dataframe and pull out labels and array, using a list comprehension:
M = df.to_numpy().T
outcome = [(M[0,i:5+i],
M[1][(5+i)//2])
for i in range(0,M.shape[1])
if 5+i <=M.shape[1]
]
print(outcome)
[(array([1, 2, 3, 4, 5]), 2), (array([2, 3, 4, 5, 6]), 3)]
pd.DataFrame(outcome,columns=['first','label'])
first label
0 [1, 2, 3, 4, 5] 2
1 [2, 3, 4, 5, 6] 3

Resources