Complex conditional aggregation in Pandas - python-3.x

In this table, I want to find the Average number of days between actions per each user.
What I mean here is, I want to group by user_id and then I want to subtract each date directly from the date before it by days per each user. and then find the average number of these days per each user (the average number of No_Action days per each user).
+---------+-----------+----------------------+
| User_ID | Action_ID | Action_At |
+---------+-----------+----------------------+
| 1 | 11 | 2019-01-31T23:00:37Z |
+---------+-----------+----------------------+
| 2 | 12 | 2019-01-31T23:11:12Z |
+---------+-----------+----------------------+
| 3 | 13 | 2019-01-31T23:14:53Z |
+---------+-----------+----------------------+
| 1 | 14 | 2019-02-01T00:00:30Z |
+---------+-----------+----------------------+
| 2 | 15 | 2019-02-01T00:01:03Z |
+---------+-----------+----------------------+
| 3 | 16 | 2019-02-01T00:02:32Z |
+---------+-----------+----------------------+
| 1 | 17 | 2019-02-06T11:30:28Z |
+---------+-----------+----------------------+
| 2 | 18 | 2019-02-06T11:30:28Z |
+---------+-----------+----------------------+
| 3 | 19 | 2019-02-07T09:09:16Z |
+---------+-----------+----------------------+
| 1 | 20 | 2019-02-11T15:37:24Z |
+---------+-----------+----------------------+
| 2 | 21 | 2019-02-18T10:02:07Z |
+---------+-----------+----------------------+
| 3 | 22 | 2019-02-26T12:01:31Z |
+---------+-----------+----------------------+

You can do it like this (and next time, please provide the data so that it is easy to help you; it took me much longer to enter the data than to get to the solution):
df = pd.DataFrame({'User_ID': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
'Action_ID': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
'Action_At': ['2019-01-31T23:00:37Z', '2019-01-31T23:11:12Z', '2019-01-31T23:14:53Z', '2019-02-01T00:00:30Z', '2019-02-01T00:01:03Z', '2019-02-01T00:02:32Z', '2019-02-06T11:30:28Z', '2019-02-06T11:30:28Z', '2019-02-07T09:09:16Z', '2019-02-11T15:37:24Z', '2019-02-18T10:02:07Z', '2019-02-26T12:01:31Z']})
df.Action_At = pd.to_datetime(df.Action_At)
df.groupby('User_ID').apply(lambda x: (x.Action_At - x.Action_At.shift()).mean())
## User_ID
## 1 3 days 13:32:15.666666
## 2 5 days 19:36:58.333333
## 3 8 days 12:15:32.666666
## dtype: timedelta64[ns]
Or, if you want the solution in days:
df.groupby('User_ID').apply(lambda x: (x.Action_At - x.Action_At.shift()).dt.days.mean())
## User_ID
## 1 3.333333
## 2 5.333333
## 3 8.333333
## dtype: float64

Related

Quarter to date growth

I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.
| yyyy_mm_dd | id1 | id2 | cost |
|------------|-----|------|-------|
| 2020-01-01 | 23 | 7253 | 5003 |
| 2020-01-01 | 23 | 7743 | 30340 |
| 2020-01-02 | 23 | 7253 | 450 |
| 2020-01-02 | 23 | 7743 | 4500 |
| ... | ... | ... | ... |
| 2021-01-01 | 23 | 7253 | 5675 |
| 2021-01-01 | 23 | 134 | 1030 |
| 2021-01-01 | 23 | 3445 | 564 |
| 2021-01-01 | 23 | 4534 | 345 |
| ... | ... | ... | ... |
I have grouped and calculated the summed cost like so:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
I am able to then successfully make a quarter over quarter comparison like so:
w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', F.lag(F.col('cost')).over(w))
.withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value')))
).where(F.col('year_quarter') >= 202101)
I would like to modify this to be quarter to date instead of quarter over quarter. For example, the above would compare April 1st 2020 - June 30th 2020 with April 1st 2020 - April 15th 2021 (or whatever maximum date in df is).
Instead, I would prefer to compare April 1st 2020 - April 15th 2020 with April 1st 2021 - April 15th 2021.
Is it possible to ensure only the same periods are compared within year_quarter?
Edit: Adding sample output:
grouped_quarterly.where(F.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|-------|
| 222 | 202001 | 49428 |
| 222 | 202002 | 43292 |
| 222 | 202003 | 73928 |
| 222 | 202004 | 12028 |
| 222 | 202101 | 19382 |
| 222 | 202102 | 4282 |
growth.where(F.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff | growth |
|-----|--------------|-------|------------|--------|--------|
| 222 | 202101 | 52494 | 49428 | 3066 | 6.20 |
| 222 | 202102 | 4282 | 43292 | -39010 | -90.10 |
The growth calculation from the window is being done correctly. However, since 202102 is in progress, it gets compared to the full 202002. The comparison for 202101 works perfectly as both year_quarters are complete.
Is there anyway to ensure the window function only compares the same period within the year_quarter with the previous year, for incomplete quarters? I hope the sample data makes my question a bit more clear
The idea is to split the task into two parts:
Calculate the growth for the complete quarters. This logic is completely taken over from the question and then
calculate the growth for the currently running quarter.
First generate some additional test data for 2019Q2, 2020Q2 and 2021Q2:
data = [('2019-04-01', 23, 1), ('2019-04-01', 23, 2), ('2019-04-02', 23, 3), ('2019-04-15', 23, 4),
('2019-04-16', 23, 5), ('2019-04-17', 23, 6), ('2019-05-01', 23, 7), ('2019-06-30', 23, 8),
('2019-07-01', 23, 9), ('2020-01-01',23,5003),('2020-01-01',23,30340), ('2020-01-02',23,450),
('2020-01-02',23,4500), ('2020-04-01', 23, 10), ('2020-04-01', 23, 20), ('2020-04-02', 23, 30),
('2020-04-15', 23, 40), ('2020-04-16', 23, 50), ('2020-04-17', 23, 60), ('2020-05-01', 23, 70),
('2020-06-30', 23, 80), ('2020-07-01', 23, 90), ('2021-01-01',23,5675), ('2021-01-01',23,1030),
('2021-01-01',23,564), ('2021-01-01',23,345), ('2021-04-01', 23, -10), ('2021-04-01', 23, -20),
('2021-04-02', 23, -30), ('2021-04-15', 23, -40)]
Calcuate the year_quarter column and cache the result:
df = spark.createDataFrame(data=data, schema = ["yyyy_mm_dd", "id1", "cost"]) \
.withColumn("yyyy_mm_dd", F.to_date("yyyy_mm_dd", "yyyy-MM-dd")) \
.withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')))) \
.cache()
Get the maximum date and its corresponding quarter:
max_row = df.selectExpr("max(yyyy_mm_dd)", "max_by(year_quarter, yyyy_mm_dd)").head()
cur_date, cur_quarter = max_row[0], max_row[1]
It is not strictly necessary to set cur_date to the maximum date of the data. Instead cur_date and cur_quarter could also be set manually.
For all quarters but the current one apply the logic given in the question:
w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
df_full_quarters = df.filter(f"year_quarter <> {cur_quarter}") \
.groupby('id1', 'year_quarter') \
.agg(F.sum('cost').alias('cost')) \
.withColumn('prev_value', F.lag(F.col('cost')).over(w))
For the current quarter filter out all dates in the previous year that should be ignored:
df_cur_quarter = df.filter(f"year_quarter = {cur_quarter} or (year_quarter = {cur_quarter - 100} and add_months(yyyy_mm_dd, 12) <= '{cur_date}')") \
.groupby('id1', 'year_quarter') \
.agg(F.sum('cost').alias('cost')) \
.withColumn('prev_value', F.lag(F.col('cost')).over(w)) \
.filter(f"year_quarter = {cur_quarter}")
Finally union the two parts and calculate the diff column:
growth = df_full_quarters.union(df_cur_quarter) \
.withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value'))) \
.orderBy("id1", "year_quarter")
The result will be:
+---+------------+-----+----------+------+
|id1|year_quarter| cost|prev_value| diff|
+---+------------+-----+----------+------+
| 23| 201902| 36| null| 0|
| 23| 201903| 9| null| 0|
| 23| 202001|40293| null| 0|
| 23| 202002| 360| 36| 324|
| 23| 202003| 90| 9| 81|
| 23| 202101| 7614| 40293|-32679|
| 23| 202102| -100| 100| -200|
+---+------------+-----+----------+------+
In this example, for the comparison of 2021Q2 with the previous year the sum for 2020Q2 is given as 100, but the actual value for the full 2020Q2 is 360.
If you want quarter to date comparison YoY but quarter is incomplete, then do agg by dayofmonth(col("input")).alias("dayofmonth") if current quarter being compared to is equal to current month of current year maybe with .agg(when(col("date_column") condition exp)) Some more insights here

Extracting indexes from a max pool over uniform data

I'm trying to find max points in a 2D tensor for a given kernel size, but I'm having issues with a special case where all the values are uniform. For example, given the following example, I would like to mark each point as a max point:
+---+---+---+---+
| 5 | 5 | 5 | 5 |
+---+---+---+---+
| 5 | 5 | 5 | 5 |
+---+---+---+---+
| 5 | 5 | 5 | 5 |
+---+---+---+---+
| 5 | 5 | 5 | 5 |
+---+---+---+---+
If I run torch.nn.functional.max_pool2d with a kernel size=3, stride=1, and padding=1, I get the following indicies:
+---+---+---+----+
| 0 | 0 | 1 | 2 |
+---+---+---+----+
| 0 | 0 | 1 | 2 |
+---+---+---+----+
| 4 | 4 | 5 | 6 |
+---+---+---+----+
| 8 | 8 | 9 | 10 |
+---+---+---+----+
What changes do I need to account for to instead obtain the following indicies?
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| 5 | 6 | 7 | 8 |
+----+----+----+----+
| 9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+
You can do the following:
a = torch.ones(4,4)
indices = (a == torch.max(a).item()).nonzero()
What this does is return a [16,2] sized tensor with the 2D coordinates of the max value(s), i.e. [0,0], [0,1], .., [3,3]. The torch.max part should be easy to understand, nonzero() considers the boolean tensor given by (a == torch.max(a).item()), takes False to be 0, and returns the non-zero indices. Hope this helps!
If you want indices in 2d shape #ccl have give you answer but for 1d indices you can first make x 1d using torch.flatten tensor and then get indices with torch.nonzero and finally convert into same shape.
x = torch.ones(4,4) * 5
(x.flatten() == x.flatten().max()).nonzero().reshape(x.shape) + 1
tensor([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])

Python selecting different number of rows for each group of a mutlilevel index

I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1

pandas rearranging experiment data

So I've looked at some other posts, but they didn't quite help. I'm not new to python, but I'm relatively new to pandas and this has me stumped as to how to accomplish it in any manner that's not horribly inefficient. The data sets I've got are a little bit large and have some extraneous columns of data that I don't need, I've got them loaded as dataframes but they basically look like this:
+---------+---------+--------+-------+
| Subject | Week | Test | Value |
+---------+---------+--------+-------+
| 1 | Week 4 | Test 1 | 4 |
| 1 | Week 8 | Test 1 | 7 |
| 1 | Week 12 | Test 1 | 3 |
| 1 | Week 4 | Test 2 | 6 |
| 1 | Week 8 | Test 2 | 3 |
| 1 | Week 12 | Test 2 | 9 |
| 2 | Week 4 | Test 1 | 1 |
| 2 | Week 8 | Test 1 | 4 |
| 2 | Week 12 | Test 1 | 2 |
| 2 | Week 4 | Test 2 | 8 |
| 2 | Week 8 | Test 2 | 1 |
| 2 | Week 12 | Test 2 | 3 |
+---------+---------+--------+-------+
I want to rearrange the dataframes so that they look like this:
+---------+---------+--------+--------+
| Subject | Week | Test 1 | Test 2 |
+---------+---------+--------+--------+
| 1 | Week 4 | 4 | 6 |
| 1 | Week 8 | 7 | 3 |
| 1 | Week 12 | 3 | 9 |
| 2 | Week 4 | 1 | 8 |
| 2 | Week 8 | 4 | 1 |
| 2 | Week 12 | 2 | 3 |
+---------+---------+--------+--------+
If anyone has any ideas on how I can make this happen, I'd greatly appreciate it, and thank you in advance for your time!
Edit: After trying the solution provided by #HarvIpan, this is the output I'm getting:
+-----------------------------------------------+
| Subject Week Test_Test 1 Test_Test 2 |
+-----------------------------------------------+
| 0 1 Week 12 5 0 |
| 1 1 Week 4 5 0 |
| 2 1 Week 8 11 0 |
| 3 2 Week 12 0 12 |
| 4 2 Week 4 0 14 |
| 5 2 Week 8 0 4 |
+-----------------------------------------------+
Try using df.pivot_table.
You should be able to get the desired outcome with:
df.pivot_table(index=['Subject','Week'], columns='Test', values='Value')
You need get dummy variable for column Test with pd.get_dummies(df[['Test', 'Value']], 'Test').mul(df['Value'], 0)] with multiplication of their Value before concatenating them back to your original df. Then groupby Subject and Week before summing them.
pd.concat([df.drop(['Test', 'Value'],1), pd.get_dummies(df[['Test']], 'Test').mul(df['Value'], 0)], axis=1).groupby(['Subject', 'Week']).sum(axis=1).reset_index()
Output:
Subject Week Test_ Test 1 Test_ Test 2
0 1 Week 12 3 9
1 1 Week 4 4 6
2 1 Week 8 7 3
3 2 Week 12 2 3
4 2 Week 4 1 8
5 2 Week 8 4 1

I need formula in column name "FEBRUARY"

I have a set of data as below.
SHEET 1
+------+-------+
| JANUARY |
+------+-------+
+----+----------+------+-------+
| ID | NAME |COUNT | PRICE |
+----+----------+------+-------+
| 1 | ALFRED | 11 | 150 |
| 2 | ARIS | 22 | 120 |
| 3 | JOHN | 33 | 170 |
| 4 | CHRIS | 22 | 190 |
| 5 | JOE | 55 | 120 |
| 6 | ACE | 11 | 200 |
+----+----------+------+-------+
SHEET2
+----+----------+------+-------+
| ID | NAME |COUNT | PRICE |
+----+----------+------+-------+
| 1 | CHRIS | 13 | 123 |
| 2 | ACE | 26 | 165 |
| 3 | JOE | 39 | 178 |
| 4 | ALFRED | 21 | 198 |
| 5 | JOHN | 58 | 112 |
| 6 | ARIS | 11 | 200 |
+----+----------+------+-------+
The RESULT should look like this in sheet1 :
+------+-------++------+-------+
| JANUARY | FEBRUARY |
+------+-------++------+-------+
+----+----------+------+-------++-------+-------+
| ID | NAME |COUNT | PRICE || COUNT | PRICE |
+----+----------+------+-------++-------+-------+
| 1 | ALFRED | 11 | 150 || 21 | 198 |
| 2 | ARIS | 22 | 120 || 11 | 200 |
| 3 | JOHN | 33 | 170 || 58 | 112 |
| 4 | CHRIS | 22 | 190 || 13 | 123 |
| 5 | JOE | 55 | 120 || 39 | 178 |
| 6 | ACE | 11 | 200 || 26 | 165 |
+----+----------+------+-------++-------+-------+
I need formula in column name "FEBRUARY". this formula will find its match in sheet 2
BUILD SAMPLE DATA
create table table1(
id int,
id_entry varchar(10),
tag int,
tag2 int
)
create table table2(
id int,
name varchar(50),
lastname varchar(50),
age int,
tel int
)
insert into table1
select 1, 'A1', 11, 12 union all
select 2, 'C2', 22, 13 union all
select 3, 'S5', 33, 14 union all
select 4, 'C2', 22, 13 union all
select 5, 'B6', 55, 16 union all
select 6, 'A1', 11, 12
insert into table2
select 1, 'ALFRED', 'DAVE', 21, 555 union all
select 2, 'FRED', 'SMITH', 22, 666 union all
select 3, 'MANNY', 'PAC', 23, 777 union all
select 4, 'FRED', 'DAVE', 22, 666 union all
select 5, 'JOHN', 'SMITH', 25, 999 union all
select 6, 'ALFRED', 'DAVE', 21, 555
SOLUTION
;with cte as(
select
t1.id_entry,
t1.tag,
t1.tag2,
t2.name,
t2.lastname,
t2.age,
t2.tel,
cc = count(*) over(partition by t1.id_entry),
rn = row_number() over(partition by t1.id_entry order by t2.lastname desc)
from table1 t1
inner join table2 t2
on t2.id = t1.id
)
select
id_entry,
tag,
tag2,
name,
lastname,
age,
tel
from cte
where
cc > 1
and rn = 1
DROP SAMPLE DATA
drop table table1
drop table table2
Try this
SELECT T2.ID_ENTRY, T1.TAG, T1.TAG2, T2.Name, T2.LastName, T2.Age, T2.Tel
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.ID = T2.ID
GROUP BY T2.ID_ENTRY, T1.TAG, T1.TAG2, T2.Name, T2.LastName, T2.Age, T2.Tel
HAVING Count(T2.ID_ENTRY) > 1

Resources