I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
and mapping dataframe
mapping = pd.DataFrame({'Product':['A','C'],'Product1':['B','D']}, columns = ['Product','Product1'])
and i wanted to compare products as per mapping. product A data should match with product B data.. the logic is product A number of records is 4 so product B records also should be 4 and those 4 records should be from the week number before and after form last week number of product A and including the last week number. so before 1 week of week number 4 i.e. 3rd week and after 2 weeks of week number 4 i.e 5,6 and week 4 data.
similarly product C number of records is 3 so product D records also should be 3 and those records before and after last week number of product C. so product c last week number 3 so product D records will be week number 2,3,4.
wanted data frame will be like below i wanted to remove those yellow records
Define the following function selecting rows from df, for products from
the current row in mapping:
def selRows(row, df):
rows_1 = df[df.Product == row.Product]
nr_1 = rows_1.index.size
lastWk_1 = rows_1.Week.iat[-1]
rows_2 = df[df.Product.eq(row.Product1) & df.Week.ge(lastWk_1 - 1)].iloc[:nr_1]
return pd.concat([rows_1, rows_2])
Then call it the following way:
result = pd.concat([ selRows(row, grp)
for _, grp in df2.groupby(['Country'])
for _, row in mapping.iterrows() ])
The list comprehension above creates a list on DataFrames - results of
calls of selRows on:
each group of rows from df2, for consecutive countries (the outer loop),
each row from mapping (the inner loop).
Then concat concatenates all of them into a single DataFrame.
Solution first create mapped column by mapping DataFrame and create dictionaries for mapping for length and last (maximal) value by groups by Country and Product:
df2['mapp'] = df2['Product'].map(mapping.set_index('Product1')['Product'])
df1 = df2.groupby(['Country','Product'])['Week'].agg(['max','size'])
#subtracted 1 for last previous value
dprev = df1['max'].sub(1).to_dict()
dlen = df1['size'].to_dict()
print(dlen)
{('UK', 'A'): 4, ('UK', 'B'): 8, ('UK', 'C'): 3, ('UK', 'D'): 6}
Then Series.map values of dict and filter out less values, then filter by second dictionary by lengths with DataFrame.head:
df3 = (df2[df2[['Country','mapp']].apply(tuple, 1).map(dprev) <= df2['Week']]
.groupby(['Country','mapp'])
.apply(lambda x: x.head(dlen.get(x.name))))
print(df3)
Country Product Week val mapp
Country mapp
UK A 6 UK B 3 7 A
7 UK B 4 8 A
8 UK B 5 9 A
9 UK B 6 10 A
C 16 UK D 2 6 C
17 UK D 3 7 C
18 UK D 4 8 C
Then filter original rows unmatched mapping['Product1'], add new df3 and sorting:
df = (df2[~df2['Product'].isin(mapping['Product1'])]
.append(df3, ignore_index=True)
.sort_values(['Country','Product'])
.drop('mapp', axis=1))
print(df)
Country Product Week val
0 UK A 1 5
1 UK A 2 4
2 UK A 3 3
3 UK A 4 1
7 UK B 3 7
8 UK B 4 8
9 UK B 5 9
10 UK B 6 10
4 UK C 1 5
5 UK C 2 5
6 UK C 3 5
11 UK D 2 6
12 UK D 3 7
13 UK D 4 8
Related
I have the following DataFrame:
Segments Airline_pct_tesco Airline_pct_asda food_pct_tesco food_pct_asda Airline_diff food_diff
A 1 2 4 2 -1 2
B 2 2 4 4 0 0
c 10 5 12 10 5 2
I want to convert it to this format:
Segments Category Asda% Tesco% Diff%
A Airline 2 1 -1
b Food 4 4 0
c Airline 5 10 5
A Food 2 4 2
(only partially showing). Note
category is the col name without the '_pct_tesco' or '_diff' or '_pct_asda'
I am unsure how to go about this - I have tried transform but I just don't know how I can get it in a way which is easy for any user to use. I am doing this in pandas and am not sure how to even begin! The Asda% are related to '_pct_asda' columns and same for diff and tesco columns respectively..
Let's try set_index to save columns, then create a MultiIndex.from_frame using str.extract on the columns to create a MultiIndex based on the values before a list of suffixes, then stack to go to long-form.
new_df = df.set_index('Segments')
# Define allowed suffixes here
suffixes = ['_pct_asda', '_pct_tesco', '_diff']
# Extract Values
new_df.columns = (
pd.MultiIndex.from_frame(
new_df.columns.str.extract(rf'(.*?)({"|".join(suffixes)})'),
names=['Category', None]
)
)
new_df = new_df.stack(0)
new_df:
_diff _pct_asda _pct_tesco
Segments Category
A Airline -1 2 1
food 2 2 4
B Airline 0 2 2
food 0 4 4
c Airline 5 5 10
food 2 10 12
To get cleaner output add reset_index + rename to fix column names and index and also re-order columns.
new_df = new_df.reset_index().rename(columns={
'_pct_asda': 'Asda%',
'_pct_tesco': 'Tesco%',
'_diff': 'Diff%'
})[['Segments', 'Category', 'Asda%', 'Tesco%', 'Diff%']]
new_df:
Segments Category Asda% Tesco% Diff%
0 A Airline 2 1 -1
1 A food 2 4 2
2 B Airline 2 2 0
3 B food 4 4 0
4 c Airline 5 10 5
5 c food 10 12 2
I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
I am wanting to backtest a trading strategy.
The data I have is OHLC (open,high,low, close) for a financial product, that is formatted into a dataframe with 300 rows (each row is 1 day) like so:
datetime O H L C
2020-03-24 1 2 3 4
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
What I want to do is, starting on the date closet to current date, in this case row with 2020-03-24:
1. take the number in column `L`
2. compare if the number in column `L` is at any point greater than the values in column `L` for the previous two days.
3. Create and fill in new column if value from 1 is greater than value in interation.
4. Repeat steps 1, 2, & 3 but take the number in column `L` that was not into included in the iteration.
Example:
1. Starting on row `2020-03-24`, take value `3`
2. Is `3` at any point greater than `7` or `2` for rows starting with `2020-03-23` and `2020-03-22`?
3. YES,assign `TRUE` to column `comparison` in df for row starting with `2020-03-24`
4. Repeat, starting on row `2020-03-21`, take value `2` in column `L`
4a. Is `2` at any point greater than values in rows `2020-03-20` or `2020-03-19`?
4b. NO, assign `FALSE` to column `comparison` in df for row starting with `2020-03-21`.
New df looks like this:
datetime O H L C Comparison
2020-03-24 1 2 3 4 TRUE
2020-03-23 5 6 7 8
2020-03-22 9 1 2 3
2020-03-21 9 2 2 3 FALSE
2020-03-20 9 3 2 3
2020-03-19 9 4 2 3
2020-03-18 9 5 2 3
The only way I know how to do this is with a FOR loop, but that doesnt work on iterating and comparing only certain subsets like so:
for i in df['L']:
if df['L'] >
You need a combination of rolling() and shift():
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True, ascending=False)
df['Comparison'] = False
df['Comparison'] = df.loc[:, 'L'] > df.loc[:, 'L'].rolling(window=2).min().shift(-2)
With rolling() you get the minimum of the last two days, shift() moves it to the right row.
How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())
I'm trying to combine multiple data frames in pandas and I want the new dataframe to contain the maximum element within the various dataframes. All of the dataframes have the same row and column labels. How can I do this?
Example:
df1 = Date A B C
1/1/15 3 5 1
2/1/15 2 4 7
df2 = Date A B C
1/1/15 7 2 2
2/1/15 1 5 4
I'd like the result to look like this.
df = Date A B C
1/1/15 7 5 2
2/1/15 2 5 7
You can use np.where to return an array of the values that satisfy your boolean condition, this can then be used to construct a df:
In [5]:
vals = np.where(df1 > df2, df1, df2)
vals
Out[5]:
array([['1/1/15', 7, 5, 2],
['2/1/15', 2, 5, 7]], dtype=object)
In [6]:
pd.DataFrame(vals, columns = df1.columns)
Out[6]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7
I don't know if Date is a column or index but the end result will be the same.
EDIT
Actually just use np.maximum:
In [8]:
np.maximum(df1,df2)
Out[8]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7