Dataframe: Computed row based on cell above and cell on the left - python-3.x

I have a dataframe with a bunch of integer values. I then compute the column totals and append it as a new row to the dataframe. So far so good.
Now I want to append another computed row where the value of each cell is the sum of cell above and the cell on the left. You can see what I mean below:
----------------------------------------------------------------
|250000 |0 |145000 |145000 |220000 |165000 |145000 |145000 |
----------------------------------------------------------------
|250000 |250000 |395000 |540000 |760000 |925000 |1070000|1215000 |
----------------------------------------------------------------
How can this be done?

I think you need Series.cumsum with select last row (total row) by DataFrame.iloc:
df = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
})
df.loc['sum'] = df.sum()
df.loc['cumsum'] = df.iloc[-1].cumsum()
#if need only cumsum row
#df.loc['cumsum'] = df.sum().cumsum()
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
sum 13 24 9 14
cumsum 13 37 46 60

Related

How to filter a dataframe using a cumulative sum of a column as parameter

I have this df:
df=pd.DataFrame({'Name':['John','Mike','Lucy','Mary','Andy'],
'Age':[10,23,13,12,15],
'%':[20,20,10,25,25]})
I want to filter this df by taking from row 0 to row n until the sum of column % = 50
I don't want to sort the % column or the df, I just need to get it's first row where % column sums 50
The output is:
filtered=pd.DataFrame({'Name':['John','Mike','Lucy'],'Age':[10,23,13],'%':[20,20,10]})
cumsum, boolean index and slice using the loc or iloc accessor
df.iloc[:(df['%'].cumsum()==50).idxmax()+1,:]
Name Age %
0 John 10 20
1 Mike 23 20
2 Lucy 13 10

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

How can I add previous column values to to get new value in Excel?

I am working on graph and in need data in below format. I have data in COL A. I need to calculate COL B values as in below picture.
What is the formula for obtaining this in excel?
You can do with cumsum and shift:
# sample data
df = pd.DataFrame({'COL A': np.arange(11)})
df['COL B'] = df['COL A'].shift(fill_value=0).cumsum()
Output:
COL A COL B
0 0 0
1 1 0
2 2 1
3 3 3
4 4 6
5 5 10
6 6 15
7 7 21
8 8 28
9 9 36
10 10 45
Use simple MS technique.
You can use the formula (A3*A2)/2 for COL2

How to compare and iterate over certain rows in column while creating output as new column in dataframe?

I am wanting to backtest a trading strategy.
The data I have is OHLC (open,high,low, close) for a financial product, that is formatted into a dataframe with 300 rows (each row is 1 day) like so:
datetime O H L C
2020-03-24 1 2 3 4
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
What I want to do is, starting on the date closet to current date, in this case row with 2020-03-24:
1. take the number in column `L`
2. compare if the number in column `L` is at any point greater than the values in column `L` for the previous two days.
3. Create and fill in new column if value from 1 is greater than value in interation.
4. Repeat steps 1, 2, & 3 but take the number in column `L` that was not into included in the iteration.
Example:
1. Starting on row `2020-03-24`, take value `3`
2. Is `3` at any point greater than `7` or `2` for rows starting with `2020-03-23` and `2020-03-22`?
3. YES,assign `TRUE` to column `comparison` in df for row starting with `2020-03-24`
4. Repeat, starting on row `2020-03-21`, take value `2` in column `L`
4a. Is `2` at any point greater than values in rows `2020-03-20` or `2020-03-19`?
4b. NO, assign `FALSE` to column `comparison` in df for row starting with `2020-03-21`.
New df looks like this:
datetime O H L C Comparison
2020-03-24 1 2 3 4 TRUE
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3 FALSE
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
The only way I know how to do this is with a FOR loop, but that doesnt work on iterating and comparing only certain subsets like so:
for i in df['L']:
if df['L'] >
You need a combination of rolling() and shift():
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True, ascending=False)
df['Comparison'] = False
df['Comparison'] = df.loc[:, 'L'] > df.loc[:, 'L'].rolling(window=2).min().shift(-2)
With rolling() you get the minimum of the last two days, shift() moves it to the right row.

if specific value/string occurs in the entire dataframe I want to sum its index values

i have a dataframe in which I need to find a specific image name in the entire dataframe and sum its index values every time they are found. SO my data frame looks like:
c 1 2 3 4
g
0 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg
1 1209270004-2.jpg 180609-2-31.jpg 1209270004-2.jpg 1209270004-2.jpg
2 1209270004-1.jpg 180414-2-38.jpg 180707-1-31.jpg 1209050002-1.jpg
3 1708260004-1.jpg 1209270004-2.jpg 180609-2-31.jpg 1209270004-1.jpg
4 1108220001-5.jpg 1209270004-1.jpg 1108220001-5.jpg 1108220001-2.jpg
I need to find the 1209270004-2.jpg in entire dataframe. And as it is found at index 1 and 3 I want to add the index values so it should be
1+3+1+1=6.
I tried the code:
img_fname = '1209270004-2.jpg'
df2 = df1[df1.eq(img_fname).any(1)]
sum = int(np.sum(df2.index.values))
print(sum)
I am getting the answer of sum 4 i.e 1+3=4. But it should be 6.
If the string occurence is only once or twice or thrice or four times like for eg 180707-1-31 is in column 3. then the sum should be 45+45+3+45 = 138. Which signifies that if the string is not present in the dataframe take vallue as 45 instead the index value.
You can multiple boolean mask by index values and then sum:
img_fname = '1209270004-1.jpg'
s = df1.eq(img_fname).mul(df1.index.to_series(), 0).sum()
print (s)
1 2
2 4
3 0
4 3
dtype: int64
out = np.where(s == 0, 45, s).sum()
print (out)
54
If dataset does not have many columns, this can also work with your original question
df1 = pd.DataFrame({"A":["aa","ab", "cd", "ab", "aa"], "B":["ab","ab", "ab", "aa", "ab"]})
s = 0
for i in df1.columns:
s= s+ sum(df1.index[df1.loc[:,i] == "ab"].tolist())
Input :
A B
0 aa ab
1 ab ab
2 cd ab
3 ab aa
4 aa ab
Output :11
Based on second requirement:

Resources