How to group by value for certain time period - python-3.x

I had a DataFrame like below:
Item Date Count
a 6/1/2018 1
b 6/1/2018 2
c 6/1/2018 3
a 12/1/2018 3
b 12/1/2018 4
c 12/1/2018 1
a 1/1/2019 2
b 1/1/2019 3
c 1/1/2019 2
I would like to get the sum of Count per Item with the specified duration from 7/1/2018 to 6/1/2019. For this case, the expected output will be:
Item TotalCount
a 5
b 7
c 3

We can use query with Series.between and chain that with GroupBy.sum:
df.query('Date.between("07-01-2018", "06-01-2019")').groupby('Item')['Count'].sum()
Output
Item
a 5
b 7
c 3
Name: Count, dtype: int64
To match your exact output, use reset_index:
df.query('Date.between("07-01-2018", "06-01-2019")').groupby('Item')['Count'].sum()\
.reset_index(name='Totalcount')
Output
Item Totalcount
0 a 5
1 b 7
2 c 3

Here is one with .loc[] using lambda:
#df.Date=pd.to_datetime(df.Date)
(df.loc[lambda x: x.Date.between("07-01-2018", "06-01-2019")]
.groupby('Item',as_index=False)['Count'].sum())
Item Count
0 a 5
1 b 7
2 c 3

Related

Excel Determine Ascending Rank Based on Each Date

Here is the raw data:
Date Name Score
25/2/2021 A 10
25/2/2021 B 8
25/2/2021 C 8
25/2/2021 D 4
25/2/2021 E 1
24/2/2021 A 0
24/2/2021 B 20
24/2/2021 C 7
24/2/2021 D 10
24/2/2021 E 4
I would love to assign consecutive rank (preferably ascending order) to the students by each date, as follows:
Date Name Score Rank
25/2/2021 A 10 1
25/2/2021 B 8 2
25/2/2021 C 8 2
25/2/2021 D 4 3
25/2/2021 E 1 4
24/2/2021 A 0 5
24/2/2021 B 20 1
24/2/2021 C 7 3
24/2/2021 D 10 2
24/2/2021 E 6 4
I've tried customised rank function but it's hard to output this result, how could I do that? Thanks in advance!
You can try below formula with Excel365. It will also work on unsorted data.
=XMATCH(C2,SORT(FILTER($C$2:$C$11,$A$2:$A$11=A2),1,-1))
In D2 use:
=COUNTIFS(A$2:A$11,A2,C$2:C$11,">"&C2)+1
EDIT: Based on your comment, try:
Formula in D2:
=SUM(--(UNIQUE(FILTER(C$2:C$11,A$2:A$11=A2))>C2))+1

Sum of all rows based on specific column values

I have a df like this:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
I want to add all the rows which has Parameter values as "Apple" , "Banana" and "Pear".
Output:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
6 Total 7 11 13 11 13
My Effort:
df[:,'Total'] = df.sum(axis=1) -- Works but I want specific values only and not all
Tried by the index in my case 1,2 and 5 but in my original df the index can vary from time to time and hence rejected that solution.
Saw various answers on SO but none of them could solve my problem!!
First idea is create index by Parameters column and select rows for sum and last convert index to column:
L = ["Apple" , "Banana" , "Pear"]
df = df.set_index('Parameters')
df.loc['Total'] = df.loc[L].sum()
df = df.reset_index()
print (df)
Parameters A B C D E
0 Apple 1 2 3 4 5
1 Banana 2 4 5 3 5
2 Potato 3 5 3 2 1
3 Tomato 1 1 1 1 1
4 Pear 4 5 5 4 3
5 Total 7 11 13 11 13
Or add new row for filtered rows by membership with Series.isin and overwrite last added value by Total:
last = len(df)
df.loc[last] = df[df['Parameters'].isin(L)].sum()
df.loc[last, 'Parameters'] = 'Total'
print (df)
Parameters A B C D E
Index
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Total 7 11 13 11 13
Another similar solution is filtering all columns without first and add value in one element list:
df.loc[len(df)] = ['Total'] + df.iloc[df['Parameters'].isin(L).values, 1:].sum().tolist()

sumproduct in different columns between dates

Im trying to sum between two dates across columns. If I had a start date input in Sheet1!F1 and an end date input in Sheet1!F2 and I needed to multiply column B times column E.
I can do sumproduct(Sheet1!B2:B14,Sheet1!E2:E14) which would result in 48 based on the example table below. However, I need to include date parameters so I could choose between dates 2/1/15 and 6/1/15 which should result in 20.
A B C D E
Date Value1 Value2 Value3 Value4
1/1/2015 1 2 3 4
2/1/2015 1 2 3 4
3/1/2015 1 2 3 4
4/1/2015 1 2 3 4
5/1/2015 1 2 3 4
6/1/2015 1 2 3 4
7/1/2015 1 2 3 4
8/1/2015 1 2 3 4
9/1/2015 1 2 3 4
10/1/2015 1 2 3 4
11/1/2015 1 2 3 4
12/1/2015 1 2 3 4
Try,
=SUMPRODUCT((Sheet1!A2:A14>=Sheet1!F1)*(Sheet1!A2:A14<=Sheet1!F2)*Sheet1!B2:B14*Sheet1!E2:E14)

Pandas: How to extra only latest date in pivot table dataframe

How do I create a new dataframe which only include as index the latest date of the column 'txn_date' for each 'day' based on the pivot table in the picture?
Thank you
d1 = pd.to_datetime(['2016-06-25'] *2 + ['2016-06-28']*4)
df = pd.DataFrame({'txn_date':pd.date_range('2012-03-05 10:20:03', periods=6),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'day':d1}).set_index(['day','txn_date'])
print (df)
B C D E
day txn_date
2016-06-25 2012-03-05 10:20:03 4 7 1 5
2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-07 10:20:03 4 9 5 6
2012-03-08 10:20:03 5 4 7 9
2012-03-09 10:20:03 5 2 1 2
2012-03-10 10:20:03 4 3 0 4
1.
I think you need first sort_index if necessary first, then groupby by level day and aggregate last:
df1 = df.sort_index().reset_index(level=1).groupby(level='day').last()
print (df1)
txn_date B C D E
day
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4
2.
Filter by boolean indexing with duplicated:
#if necessary
df = df.sort_index()
df2 = df[~df.index.get_level_values('day').duplicated(keep='last')]
print(df2)
B C D E
day txn_date
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!
use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Resources