Running sums from one column conditional on values in another column - python-3.x

I've got a dataset containing data values associated with times (amongst other categories), and I'd like to add an accumulated value column - that is, the sum of all values up to and including the time. So, taking something like this:
ID YEAR VALUE
0 A 2018 144
1 B 2018 147
2 C 2018 164
3 D 2018 167
4 A 2019 167
5 B 2019 109
6 C 2019 183
7 D 2019 121
8 A 2020 136
9 B 2020 187
10 C 2020 170
11 D 2020 188
and adding a column like this:
ID YEAR VALUE CUMULATIVE_VALUE
0 A 2018 144 144
1 B 2018 147 147
2 C 2018 164 164
3 D 2018 167 167
4 A 2019 167 311
5 B 2019 109 256
6 C 2019 183 347
7 D 2019 121 288
8 A 2020 136 447
9 B 2020 187 443
10 C 2020 170 517
11 D 2020 188 476
Where e.g. in row 7 the CUMULATIVE_VALUE is the sum of the 2 VALUE for ID="D" in years 2018 and 2019 (and not 2020).
I've looked at cumsum() but can't see how I could use it in this specific case so the best I've come up with is this:
import numpy as np
import pandas as pd
np.random.seed(0)
ids=["A","B","C","D"]
years=[2018,2019,2020]
df = pd.DataFrame({"ID": np.tile(ids, 3),
"YEAR": np.repeat(years, 4),
"VALUE": np.random.randint(100,200,12)})
print(df)
df["CUMULATIVE_VALUE"] = None
for id in ids:
for year in years:
df.loc[(df.ID==id) & (df.YEAR==year), "CUMULATIVE_VALUE"] = \
df[(df.ID==id) & (df.YEAR <= year)].VALUE.sum()
print(df)
but I'm sure there must be a better and more efficient way of doing it. Anyone?

You can use pd.Groupby to group by ID and aggregate with cumsum:
df['CUMULATIVE_VALUE'] = df('ID').VALUE.cumsum()
ID YEAR VALUE CUMULATIVE_VALUE
0 A 2018 144 144
1 B 2018 147 147
2 C 2018 164 164
3 D 2018 167 167
4 A 2019 167 311
5 B 2019 109 256
6 C 2019 183 347
7 D 2019 121 288
8 A 2020 136 447
9 B 2020 187 443
10 C 2020 170 517
11 D 2020 188 476
In the case the years are not sorted instead do:
df = df.sort_values(['ID','YEAR']).reset_index(drop=True)
df['cumsum'] = df.groupby('ID').agg({'VALUE':'cumsum'})

Related

How to update column based on conditions and previous row is not equal to the same condition

How to identify Winner Week when the previous row is not equal to the current row.
Week is classified as a "Winner" when the [Weekly_counts] is greater than the [Winner_Num] and the previous week is not a Winner.
Here a copy of the final data set:
Year ISOweeknum Weekly_Counts NumOfWeeks Yearly_Count WeeklyAverage Winner_Num
0 2017 9 1561 44 12100 275 330
1 2017 10 1001 44 12100 275 330
2 2017 11 451 44 12100 275 330
3 2017 12 513 44 12100 275 330
4 2017 13 431 44 12100 275 330
... ... ... ... ... ... ... ...
232 2021 32 136 36 4212 117 140
233 2021 33 84 36 4212 117 140
234 2021 34 95 36 4212 117 140
235 2021 35 120 36 4212 117 140
236 2021 53 77 36 4212 117 140
I've tried using this code but not getting the results desired:
new_df3['Winner_Results'] = 0
for i in range(len(new_df3)-1):
if (new_df3['Weekly_Votes_Counts'].iloc[i] > new_df3['Winner_Num'].iloc[i]) & (new_df3['Weekly_Votes_Counts'].iloc[i+1] > new_df3['Winner_Num'].iloc[i+1]):
new_df3['Winner_Results'].iloc[i] = 'Not Winner'
else:
new_df3['Winner_Results'].iloc[i] = 'Winner'
.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df3['Winner_Results'].iloc[i] = 'Winner'
The Expected Result:
[![Excel Example][1]][1]
[1]: https://i.stack.imgur.com/BEcuX.png
Here's a way to get the result in your question:
df['Counts_Gt_Num'] = df.Weekly_Counts > df.Winner_Num
df['Cumsum'] = df['Counts_Gt_Num'].cumsum()
df.loc[(~df['Counts_Gt_Num']) & df['Counts_Gt_Num'].shift(), 'Subtract'] = df['Cumsum']
df['Is_Winner'] = (df['Cumsum'] - df['Subtract'].ffill().fillna(0)).mod(2)
df['Winner_Results'] = np.where(df['Is_Winner'], 'Winner', 'Not Winner')
df = df.drop(columns=['Counts_Gt_Num', 'Cumsum', 'Subtract', 'Is_Winner'])
Output:
Year ISOweeknum Weekly_Counts NumOfWeeks Yearly_Count WeeklyAverage Winner_Num Winner_Results
0 2017 9 1561 44 12100 275 330 Winner
1 2017 10 1001 44 12100 275 330 Not Winner
2 2017 11 451 44 12100 275 330 Winner
3 2017 12 513 44 12100 275 330 Not Winner
4 2017 13 431 44 12100 275 330 Winner
5 2017 14 371 44 12100 275 330 Not Winner
6 2017 15 361 44 12100 275 330 Winner
7 2017 16 336 44 12100 275 330 Not Winner
8 2017 17 332 44 12100 275 330 Winner
9 2017 18 124 44 12100 275 330 Not Winner
10 2017 19 142 44 12100 275 330 Not Winner
11 2017 20 290 44 12100 275 330 Not Winner
12 2017 21 116 44 12100 275 330 Not Winner
13 2017 22 142 44 12100 275 330 Not Winner
14 2017 23 132 44 12100 275 330 Not Winner
15 2017 24 69 44 12100 275 330 Not Winner
16 2017 25 124 44 12100 275 330 Not Winner
17 2017 26 136 44 12100 275 330 Not Winner
18 2017 27 63 44 12100 275 330 Not Winner
Explanation:
mark rows with counts > num using boolean in new column Counts_Gt_Num
put cumsum of the above in new column Cumsum
create new column Subtract by coping from Cumsum for rows where Counts_Gt_Num is False but was True in the preceding row, and forward fill using ffill() for rows with nan
create Is_Winner column selecting as winners the rows at an even offset (0, 2, 4 ...) within a streak of non-zero values in Cumsum - Subtract
create Winner_Results by assigning the desired win/no-win value based on Is_Winner
drop intermediate columns.

Convert string date column with format of ordinal numeral day, abbreviated month name, and normal year to %Y-%m-%d

Given the following df with string date column with ordinal numbers for day, abbreviated month name for month, and normal year:
date oil gas
0 1st Oct 2021 428 99
1 10th Sep 2021 401 101
2 2nd Oct 2020 189 74
3 10th Jan 2020 659 119
4 1st Nov 2019 691 130
5 30th Aug 2019 742 162
6 10th May 2019 805 183
7 24th Aug 2018 860 182
8 1st Sep 2017 759 183
9 10th Mar 2017 617 151
10 10th Feb 2017 591 149
11 22nd Apr 2016 343 88
12 10th Apr 2015 760 225
13 23rd Jan 2015 1317 316
I'm wondering how could we parse date column to standard %Y-%m-%d format?
My ideas so far: 1. strip ordinal indicators ('st', 'nd', 'rd', 'th') from character day string while keeping the day number with re; 2. and convert abbreviated month name to numbers (seems not %b), 3. finally convert them to %Y-%m-%d.
Code may be useful for the first step:
re.compile(r"(?<=\d)(st|nd|rd|th)").sub("", df['date'])
References:
https://metacpan.org/release/DROLSKY/DateTime-Locale-0.46/view/lib/DateTime/Locale/en_US.pm#Months
pd.to_datetime already handles this case if you don't specify the format parameter:
>>> pd.to_datetime(df['date'])
0 2021-10-01
1 2021-09-10
2 2020-10-02
3 2020-01-10
4 2019-11-01
5 2019-08-30
6 2019-05-10
7 2018-08-24
8 2017-09-01
9 2017-03-10
10 2017-02-10
11 2016-04-22
12 2015-04-10
13 2015-01-23
Name: date, dtype: datetime64[ns]

How to Rank using Rank formula basis different criteria in excel

I have below table
Month LoB Score Rank
Jan A 1
Jan B 2
Feb B 1
Feb B 2
Jan A 2
Mar C 1
Feb A 3
Jan A 3
Mar C 2
Mar A 1
Mar C 3
I want to Rank the scores basis Month and LoB. For ex in Jan for A whatever is highest will get Rank 1. Similarly in Jan for LoB B whatever is highest will get Rank 1.
I understand that Index and Row formula are to be used in conjunction with Rank.eq but i am unable to put it together at all. I would appreciate any help on this.
Thank you
Assuming Row1 is the header row and actual data lies in the range A2:C11, then try this...
In D2
=SUMPRODUCT(($A$2:$A$11=A2)*($B$2:$B$11=B2)*($C$2:$C$11>C2))+1
and copy it down.
AoA / Good morning
by using Rank formula problems face in correct position.=RANK(K5,K5:K34)
Marks Position total marks 350
obtained
290 29
346 9 (student obtained 346 marks how he have 9th position he must be 4th position)
250 30
343 20
345 13
342 21
334 26
346 9
345 13
346 9
346 9
348 5
350 1
349 3
335 24
345 13
335 24
348 5
339 22
295 28
350 1
345 13
348 5
344 18
345 13
338 23
347 2
349 3
297 27

sumproduct using different criteria

I have the above excel table and i would like to calculate the total per company per departament per year. I used:
=SUMPRODUCT(--($A$2:$A$9=A12),--($B$2:$B$9=B12)*$C$2:$F$9)
dosen`t seems to work.
A B C D E F
1 COMPANY DEPART. QUARTER 1 QUARTER 2 QUARTER 3 QUARTER 4
2 AB PRO 123 223 3354 556
3 CD PIV 222 235 223 568
4 CD PRO 236 254 184 223
5 AB STA 254 221 96 265
6 EF PIV 254 112 485 256
7 CD STA 558 185 996 231
8 GH PRO 548 696 698 895
9 AB PRO 148 254 318 229
10
11 TOAL PER COMPANY PER DEPARTAMENT PER YEAR:
12 AB PRO =
Asusming that in Row 12, Col A = AB, and Row 12, Col B == PRO, then:
=SUMPRODUCT((A2:A9=A12)*(B2:B9=B12) *C2:F9)
Example:

R: Reversing the data in a time series object

I figured out a way to backcast (ie. predicting the past) with a time series. Now I'm just struggling with the programming in R.
I would like to reverse the time series data so that I can forecast the past. How do I do this?
Say the original time series object looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 116 99 115 101 112 120 120 110 143 136 147 142
2009 117 114 133 134 139 147 147 131 125 143 136 129
I want it to look like this for the 'backcasting':
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 129 136 143 125 131 147 147 139 134 133 114 117
2009 142 147 136 143 110 120 120 112 101 115 99 116
Note, I didn't forget to change the years - I am basically mirroring/reversing the data and keeping the years, then going to forecast.
I hope this can be done in R? Or should I export and do it in Excel somehow?
Try this:
tt <- ts(1:24, start = 2008, freq = 12)
tt[] <- rev(tt)
ADDED. This also works and does not modify tt :
replace(tt, TRUE, rev(tt))
You can just coerce the matrix to a vector, reverse it, and make it a matrix again. Here's an example:
mat <- matrix(seq(24),nrow=2,byrow=TRUE)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 2 3 4 5 6 7 8 9 10 11 12
[2,] 13 14 15 16 17 18 19 20 21 22 23 24
> matrix( rev(mat), nrow=nrow(mat) )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 23 22 21 20 19 18 17 16 15 14 13
[2,] 12 11 10 9 8 7 6 5 4 3 2 1
I found this post of Hyndman under http://www.r-bloggers.com/backcasting-in-r/ and am basically pasting in his solution, which in my opinion provids a complete answer to you question.
library(forecast)
x <- WWWusage
h <- 20
f <- frequency(x)
# Reverse time
revx <- ts(rev(x), frequency=f)
# Forecast
fc <- forecast(auto.arima(revx), h)
plot(fc)
# Reverse time again
fc$mean <- ts(rev(fc$mean),end=tsp(x)[1] - 1/f, frequency=f)
fc$upper <- fc$upper[h:1,]
fc$lower <- fc$lower[h:1,]
fc$x <- x
# Plot result
plot(fc, xlim=c(tsp(x)[1]-h/f, tsp(x)[2]))

Resources