Summing a years worth of data that spans two years pandas - python-3.x

I have a DataFrame that contains data similar to this:
Name Date A B C
John 19/04/2018 10 11 8
John 20/04/2018 9 7 9
John 21/04/2018 22 15 22
… … … … …
John 16/04/2019 8 8 9
John 17/04/2019 10 11 18
John 18/04/2019 8 9 11
Rich 19/04/2018 18 7 6
… … … … …
Rich 18/04/2019 19 11 17
The data can start on any day and contains at least 365 days of data, sometimes more. What I want to end up with is a DataFrame like this:
Name Date Sum
John April 356
John May 276
John June 209
Rich April 452
I need to sum up all of the months to get a year’s worth of data (April - March) but I need to be able to handle taking part of April’s total (in this example) from 2018 and part from 2019. What I would also like to do is shift the days so they are consecutive and follow on in sequence so rather than:
John 16/04/2019 8 8 9 Tuesday
John 17/04/2019 10 11 18 Wednesday
John 18/04/2019 8 9 11 Thursday
John 19/04/2019 10 11 8 Thursday (was 19/04/2018)
John 20/04/2019 9 7 9 Friday (was 20/04/2018)
It becomes
John 16/04/2019 8 8 9 Tuesday
John 17/04/2019 10 11 18 Wednesday
John 18/04/2019 8 9 11 Thursday
John 19/04/2019 9 7 9 Friday (was 20/04/2018)
Prior to summing to get the final DataFrame. Is this possible?
Additional information requested in comments
Here is a link to the initial data set https://github.com/stottp/exampledata/blob/master/SOExample.csv and the required output would be:
Name Month Total
John March 11634
John April 11470
John May 11757
John June 10968
John July 11682
John August 11631
John September 11085
John October 11924
John November 11593
John December 11714
John January 11320
John February 10167
Rich March 11594
Rich April 12383
Rich May 12506
Rich June 11112
Rich July 11636
Rich August 11303
Rich September 10667
Rich October 10992
Rich November 11721
Rich December 11627
Rich January 11669
Rich February 10335

Let's see if I understood correctly. If you want to sum, I suppose you mean sum the values of columns ['A', 'B', 'C'] for each day and get the total value monthly.
If that's right, the first thing to to is set the ['Date'] column as the index so that the data frame is easier to work with:
df.set_index(df['Date'], inplace=True, drop=True)
del df['Date']
Next, you will want to add the new column ['Sum'] by re-sampling your data frame (from days to months) whilst summing the values of ['A', 'B', 'C']:
df['Sum'] = df['A'].resample('M').sum() + df['B'].resample('M').sum() + df['C'].resample('M').sum()
df['Sum'].head()
Out[37]:
Date
2012-11-30 1956265
2012-12-31 2972076
2013-01-31 2972565
2013-02-28 2696121
2013-03-31 2970687
Freq: M, dtype: int64
The last part about squashing February of 2018 and 2019 together as if they were a single month might yield from:
df['2019-02'].merge(df['2018-02'], how='outer', on=['Date', 'A', 'B', 'C'])
Test this last step and see if it works for you.
Cheers

Related

Filter and display all duplicated rows based on multiple columns in Pandas [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 years ago.
Given a dataset as follows:
name month year
0 Joe December 2017
1 James January 2018
2 Bob April 2018
3 Joe December 2017
4 Jack February 2018
5 Jack April 2018
I need to filter and display all duplicated rows based on columns month and year in Pandas.
With code below, I get:
df = df[df.duplicated(subset = ['month', 'year'])]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
Out:
name month year
3 Joe December 2017
5 Jack April 2018
But I want the result as follows:
name month year
0 Joe December 2017
1 Joe December 2017
2 Bob April 2018
3 Jack April 2018
How could I do that in Pandas?
The following code works, by adding keep = False:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

Difference between value from multiple sheets

I would like to find the difference between the matched serial numbers from multiple excel sheets
sheet 1
June July
B 10 20
A 50 90
Sheet 2
June July
A 6 3
C 5 9
B 10 5
Sheet 3(results)
June July
A 44 87
B 0 15

Dynamically Lookup Value with Between - Excel

I have a chronological list of Product, Year, Month, Profit (like below).
Summary Table
Product Year Month Profit
TV 2018 1 10
TV 2018 2 20
TV 2018 3 30
TV 2018 4 50
TV 2018 5 35
TV 2018 6 60
TV 2018 7 90
Heater 2018 1 20
Heater 2018 2 3
Heater 2018 3 8
Heater 2018 4 4
Heater 2018 5 6
Heater 2018 6 11
Heater 2018 7 1
What I wanted to do is lookup another sheet that has all of the price changes within by month and year as well as the table below shows.
Sale Price
Product Year Month Price
TV 2018 1 $1,000.00
TV 2018 4 $800.00
TV 2018 7 $950.00
Heater 2018 1 $20.00
Heater 2018 2 $60.00
Heater 2018 5 $45.00
So the end result for example, TV Month = 2 and Year = 2018, I want it to pull in $1,000 to be part of my profit calculation.
to get the correct Price, use:
=INDEX(J:J,AGGREGATE(14,6,ROW($I$2:$I$7)/(($G$2:$G$7=A2)*($H$2:$H$7=B2)*($I$2:$I$7<=C2)),1))

Excel indexmatch, vlookup

I have a holiday calendar for several years in one table. Can anyone help – How to arrange this data by week and show holiday against week? I want to reference this data in other worksheets and hence arranging this way will help me to use formulae on other sheets. I want the data to be: col A having week numbers and column B showing holiday for year 1, col. C showing holiday for year 2, etc.
Fiscal Week
2015 2014 2013 2012
Valentine's Day 2 2 2 3
President's Day 3 3 3 4
St. Patrick's Day 7 7 7 7
Easter 10 12 9 11
Mother's Day 15 15 15 16
Memorial Day 17 17 17 18
Flag Day 20 19 19 20
Father's Day 21 20 20 21
Independence Day 22 22 22 23
Labor Day 32 31 31 32
Columbus Day 37 37 37 37
Thanksgiving 43 43 43 43
Christmas 47 47 47 48
New Year's Day 48 48 48 49
ML King Day 51 51 51 52
It's not too clear what year 1 is, so I'm going to assume that's 2015, and year 2 is 2014, etc.
Here's how you could set it up, if I understand correctly. Use this index/match formula (psuedo-formula):
=Iferror(Index([holiday names range],match([week number],[2015's week numbers in your table],0)),"")
It looks like this:
(=IFERROR(INDEX($A$3:$A$17,MATCH($H3,B$3:B$17,0)),""), in the cell next to the week numbers)
You can then drag the formula over, and the matching group (in above picture, B3:B17) will "slide over" as you drag the formula over.

PowerPivot Cohort Analysis

I'm trying to do cohort analysis using Excel's PowerPivot. I have a table recording which users have purchased which products in which months eg.
UserID Product Date Quantity
1 Ham Mar 15 2
1 Cheese Jan 15 7
2 Ham Mar 15 8
3 Fish Mar 15 2
2 Cheese Apr 15 8
I want to use a calculated field to filter for a cohort of users who purchased a given product in a given month but be able to analyse all their purchases.
Eg cohort Ham, March 15
--> Users 1, 2
UserID Product Date Quantity
1 Ham Mar 15 2
1 Cheese Jan 15 7
2 Ham Mar 15 8
2 Cheese Apr 15 8
I know this could be done easily using SQL but I am working with colleagues who prefer to use Excel over Access/Some SQL interface.
Thankyou
Create a calculated column like this:
=if([UserID]&SlicerValue=[UserID]&[Product],[UserID])
where HAM would be selected from slicer created from a table of unique products.

Resources