How to compare and iterate over certain rows in column while creating output as new column in dataframe? - python-3.x

I am wanting to backtest a trading strategy.
The data I have is OHLC (open,high,low, close) for a financial product, that is formatted into a dataframe with 300 rows (each row is 1 day) like so:
datetime O H L C
2020-03-24 1 2 3 4
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
What I want to do is, starting on the date closet to current date, in this case row with 2020-03-24:
1. take the number in column `L`
2. compare if the number in column `L` is at any point greater than the values in column `L` for the previous two days.
3. Create and fill in new column if value from 1 is greater than value in interation.
4. Repeat steps 1, 2, & 3 but take the number in column `L` that was not into included in the iteration.
Example:
1. Starting on row `2020-03-24`, take value `3`
2. Is `3` at any point greater than `7` or `2` for rows starting with `2020-03-23` and `2020-03-22`?
3. YES,assign `TRUE` to column `comparison` in df for row starting with `2020-03-24`
4. Repeat, starting on row `2020-03-21`, take value `2` in column `L`
4a. Is `2` at any point greater than values in rows `2020-03-20` or `2020-03-19`?
4b. NO, assign `FALSE` to column `comparison` in df for row starting with `2020-03-21`.
New df looks like this:
datetime O H L C Comparison
2020-03-24 1 2 3 4 TRUE
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3 FALSE
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
The only way I know how to do this is with a FOR loop, but that doesnt work on iterating and comparing only certain subsets like so:
for i in df['L']:
if df['L'] >

You need a combination of rolling() and shift():
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True, ascending=False)
df['Comparison'] = False
df['Comparison'] = df.loc[:, 'L'] > df.loc[:, 'L'].rolling(window=2).min().shift(-2)
With rolling() you get the minimum of the last two days, shift() moves it to the right row.

Related

Dynamically updating row values based on a condition in pandas

I am running a simulation test where I want to dynamically change some values present in rows for each column based on certain set of conditions
The Problem Statement
My dataset has 400 rows and my first test case is to update 5% of the rows in each column, so 5% of 400 = 20 rows which needs to be updated
These 20 rows should be only updated for the top 5 categories that are present in my dataset. So 4 rows each which needs to be updated
My dataframe looks like this:
A B C D Category
1 10 3 4 X
4 9 6 9 Y
9 3 7 10 XX
10 1 9 7 YY
10 1 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
The conditions are:
While updating the rows I would want to make sure that 20 rows (5% of the overall dataset) should be updated only where the top 5 categories are encountered. In my case the top 5 categories are X, Y , XX, YY and ZZ. These rows should be updated to value 7 where the previous value was 1,2,3,4,5,6
The resultant datframe should look like this:
A B C D Category
7 10 7 7 X
7 9 7 9 Y
9 7 7 10 XX
10 7 9 7 YY
10 7 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
In the resultant dataframe, there is no impact on the categories which are not the top 5 categories, this case YZZ or YYYY and to demonstrate an example I can't show all the updated rows but for example in the above dataframe, 2 rows have been updated for column A where previous value was <=6 to a new value 7 and similarly the other two rows will get updated to 7 wherever the condition is met.
How can I achieve this?
You can try the following logic:
# get only desired Categories
m = df['Category'].isin(['X', 'Y', 'XX', 'YY', 'ZZ'])
# select 20 random rows from the above
idx = df[m].sample(n=20).index
# replace the 1 ≤ values ≤ 6 by 7
df.loc[idx] = df.loc[idx].mask(df.loc[idx].ge(1)&df.loc[idx].le(6), 7)
If you rather want 4 rows per Category, use this variant for the random sampling:
idx = df[m].groupby('Category').sample(n=4).index

Excel: find row in a table by condition

I have two tables with same column names, but different data.
Table1:
A B C D
1 3 4 5 OK
2 6 7 8
3 9 8 7
Table2:
A B C D
1 9 8 7
2 1 2 8
3 3 4 5
I want to write formula in D of Table2, which would copy D-column values by row values from Table1. (I want to find same row in other table and set D-column value for it).
I could use SUMIFS(Table1!$D:$D, Table1!$A:$A, "="&A1, Table1!$B:$B, "="&B1, Table1!$C:$C, "="&C1), but all the rows are unique and I have string value in $D:$D - I don't need SUM exactly, I need only one string value.
Is there any function to find column value by row condition?
The result I want:
A B C D
1 9 8 7
2 1 2 8
3 3 4 5 OK

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

pandas selecting rows whose sum equals to a value in another column

Hi Guys i have a dataFrame where i want to frist group rows by a column, then i find any rows that sum up to a given value in another column.
**A** **B** **c**
XCD 1 5
FFF 12 2
VB 3 6
XCD 8 5
AAA 2 7
AAA 5 7
XCD 4 5
VB 6 6
VB 3 6
FFF 2 2
For each unique entry in column A say XCD, the value of column C is always the same to represent the total sum needed per unique entry. To illustrate what i need, see the below final data Frame.
**A** **B** **c**
XCD 1 5
XCD 4 5
FFF 2 2
VB 6 6
AAA 2 7
AAA 5 7
The algorithm should select the rows that sum up to the column c. The algorithm can select a single row as long as its total sums up to the number in column c but we only take the first occurance that sum up to column c and leave out the rest, then have a new data Frame

How to multiple column join in one column in excel (I Want Formula)

HOW TO JOIN MULTIPLE COLUMN IN ONE COLUMN
TABLE 1 TABLE 2 TABLE 3
1 2 5
2 4 3
3 5 3
4 5 1
I WANT TO
1
2
3
4
2
4
5
5
5
3
3
1
If your data is like below, enter the formula in the first row of any column and drag down until there is no value left over,
=IF(ROW()<=COUNTA(A:A),INDEX(A:A,ROW()),IF(ROW()<=COUNTA(A:B),INDEX(B:B,ROW()-COUNTA(A:A)),IF(ROW()>COUNTA(A:C),"",INDEX(C:C,ROW()-COUNTA(A:B)))))

Resources