Sum of next n rows in python - python-3.x

I have a dataframe which is grouped at product store day_id level Say it looks like the below and I need to create a column with rolling sum
prod store day_id visits
111 123 1 2
111 123 2 3
111 123 3 1
111 123 4 0
111 123 5 1
111 123 6 0
111 123 7 1
111 123 8 1
111 123 9 2
need to create a dataframe as below
prod store day_id visits rolling_4_sum cond
111 123 1 2 6 1
111 123 2 3 5 1
111 123 3 1 2 1
111 123 4 0 2 1
111 123 5 1 4 0
111 123 6 0 4 0
111 123 7 1 NA 0
111 123 8 1 NA 0
111 123 9 2 NA 0
i am looking for create a
cond column: that recursively checks a condition , say if rolling_4_sum is greater than 5 then make the next 4 rows as 1 else do nothing ,i.e. even if the condition is not met retain what was already filled before , do this check for each row until 7 th row.
How can i achieve this using python ? i am trying
d1['rolling_4_sum'] = d1.groupby(['prod', 'store']).visits.rolling(4).sum()
but getting an error.

The formation of rolling sums can be done with rolling method, using boxcar window:
df['rolling_4_sum'] = df.visits.rolling(4, win_type='boxcar', center=True).sum().shift(-2)
The shift by -2 is because you apparently want the sums to be placed at the left edge of the window.
Next, the condition about rolling sums being less than 4:
df['cond'] = 0
for k in range(1, 4):
df.loc[df.rolling_4_sum.shift(k) < 7, 'cond'] = 1
A new column is inserted and filled with 0; then for each k=1,2,3,4, look k steps back; if the sum then less than 7, then set the condition to 1.

Related

Sum whenever another column changes

I have a df with VENDOR, INVOICE and AMOUNT. I want to create a column called ITEM, which starts at 1, and when the invoice number changes it will change to 2 and so on.
I tried using cumsum, but it isn't actually working - and it makes sense not to work. The way I wrote the code it will sum 1 for the same invoice and start over when the invoice changes.
data = pd.read_csv('data.csv')
data['ITEM_drop'] = 1
s = data['INVOICE'].ne(data['INVOICE'].shift()).cumsum()
data['ITEM'] = data.groupby(s)['ITEM_drop'].cumsum()
Output:
VENDOR INVOICE AMOUNT ITEM_drop ITEM
A 123 10 1 1
A 123 12 1 2
A 456 44 1 1
A 456 5 1 2
A 456 10 1 3
B 999 7 1 1
B 999 1 1 2
And what I want is:
VENDOR INVOICE AMOUNT ITEM_drop ITEM
A 123 10 1 1
A 123 12 1 1
A 456 44 1 2
A 456 5 1 2
A 456 10 1 2
B 999 7 1 3
B 999 1 1 3

Excel - Lookup the group based on value range per segment

I have a table like below.
segmentnum group 1 group 2 group 3 group 4
1 0 12 33 66
2 0 3 10 26
3 0 422 1433 3330
And a table like below.
vol segmentnum
0 1
58 1
66 1
48 1
9 2
13 2
7 2
10 3
1500 3
I'd like to add a column that tells me which group the vol for a given segmentnum belongs to. Such that
Group 1 = x to < group 2
Group 2 = x to < group 3
Group 3 = x to <= group 4
Desired result:
vol segmentnum group
0 1 1
58 1 3
66 1 3
48 1 3
9 2 2
13 2 3
7 2 2
10 3 3
1500 3 3
Per the accompanying image, put this in I2 and drag down.
=MATCH(G2, INDEX(B$2:E$4, MATCH(H2, A$2:A$4, 0), 0))
While these results differ from yours, I believe they are correct.

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

Excel auto increment id by 01 when cell changes value

I have a file that has three columns. ID which is blank, Name which has some names and PARENT_ID which stores the parent ID of the Name.
What I want to do is at the column ID to take the Parent id and add a two digit number which will increment by 01. For example we have 10 cats with parent id 1. I want at the column ID to take the parent id "1" and then add "01" for the first cat, "02" for the second cat and so on. So at the column ID I will have foreach cat an auto incrementing value 101,102,...110.
Then the dogs start, so it will take the parent id which is "2" and start again foreach dog do add incrementig values 201,202... etc.
Then the fish 301,302
Here is an example of what I am trying to do.
ID NAME PARENT_ID
101 cat 1
102 cat1 1
103 cat2 1
104 cat3 1
105 cat4 1
106 cat5 1
107 cat6 1
108 cat7 1
109 cat8 1
110 cat9 1
111 cat10 1
201 dog 2
202 dog1 2
203 dog2 2
204 dog3 2
205 dog4 2
206 dog5 2
301 fish 3
302 fish 3
The column name is not of concern, I just placed it for you to understand better.
I am not familiar with visual basic and I tried to accomplish this with formulas but with no luck.
Thank you for any help.
Put this in A2 and copy/drag down:
=IF(C2<>C1,C2*100+1,A1+1)
Paste the below formula in "A2" =C2&RIGHT("00"&COUNTIF($C$1:C2,C2),2)
and drag the formula to down. if your data has more than 10 Unique Records then make it like =C2&RIGHT("00"&COUNTIF($C$1:C2,C2),3)
Not a VBA approach, but a formula approach -- If this is something like what you're looking for:
Row A B C D E F
1 ID NAME PARENT_ID RunningName RunningID NewID
2 101 cat 1 cat 0 cat
3 102 cat1 1 cat 1 cat01
4 103 cat2 1 cat 2 cat02
5 104 cat3 1 cat 3 cat03
6 105 cat4 1 cat 4 cat04
7 106 cat5 1 cat 5 cat05
8 107 cat6 1 cat 6 cat06
9 108 cat7 1 cat 7 cat07
10 109 cat8 1 cat 8 cat08
11 110 cat9 1 cat 9 cat09
12 111 cat10 1 cat 10 cat10
13 201 dog 2 dog 0 dog
14 202 dog1 2 dog 1 dog01
15 203 dog2 2 dog 2 dog02
16 204 dog3 2 dog 3 dog03
17 205 dog4 2 dog 4 dog04
18 206 dog5 2 dog 5 dog05
19 301 fish 3 fish 0 fish
20 302 fish 3 fish 1 fish01
...then I used the following formulas:
D2: =if(a2="","",if(sum(C2)<>sum(C1),trim(B2),trim(D1)))
E2: =if(a2="","",if(sum(C2)<>sum(C1),0,sum(E1)+1))
F2: =if(a2="","",trim(D2)&if(sum(E2)=0,"",text(E2,"00")))
I then replicated those cells down the column as far as I cared to go. You can make the "Running" columns a very light grey text color so as to render them non-distracting to the user.
Hopefully this can help inspire you to craft a solution that works for you.

Using Pandas filtering non-numeric data from two columns of a Dataframe

I'm loading a Pandas dataframe which has many data types (loaded from Excel). Two particular columns should be floats, but occasionally a researcher entered in a random comment like "not measured." I need to drop any rows where any values in one of two columns is not a number and preserve non-numeric data in other columns. A simple use case looks like this (the real table has several thousand rows...)
import pandas as pd
df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))
Which results in this data table:
A B C D
0 1 96 12 apples
1 2 33 Not measured oranges
2 3 45 15 peaches
3 4 66 plums
4 5 8 42 pears
I'm not clear how to get to this table:
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
I tried dropna, but the types are "object" since there are non-numeric entries.
I can't convert the values to floats without either converting the whole table, or doing one series at a time which loses the relationship to the other data in the row. Perhaps there is something simple I'm not understanding?
You can first create subset with columns B,C and apply to_numeric, check if all values are notnull. Then use boolean indexing:
print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Next solution use str.isdigit with isnull and xor (^):
print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
But solution with to_numeric with isnull and notnull is fastest:
print df[pd.to_numeric(df['B'], errors='coerce').notnull()
^ pd.to_numeric(df['C'], errors='coerce').isnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Timings:
#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)
In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop
In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop
In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.49 ms per loop

Resources