How to create new columns based off specific conditions? - python-3.x
I have a multi-index dataframe. The index's are represented by an ID and date. The 3 columns I have are cost, revenue, and expenditure.
I want to create 3 new columns based off certain conditions.
1) The first new column I would want to create would be based off the condition, for the 3 most previous dates per ID, if the cost column decreases consistently, label the new row values as 'NEG', if not then label it 'No'.
2) The second column I would want to create would be based off the condition, for the 3 most recent dates, if the revenue column decreases consistently, label the new row values as 'NEG', if not then label it 'No'.
3) The third column I would want to create would be based off the condition, for the 3 most recent dates, if the expenditure column increases consistently, label the new row value as 'POS' or if it stays the same label the new row value as 'STABLE'.
idx = pd.MultiIndex.from_product([['001', '002', '003','004'],
['2017-06-30', '2017-12-31', '2018-06-30','2018-12-31','2019-06-30']],
names=['ID', 'Date'])
col = ['Cost', 'Revenue','Expenditure']
dict2 = {'Cost':[12,6,-2,-10,-16,-10,14,12,6,7,4,2,1,4,-4,5,7,9,8,1],
'Revenue':[14,13,2,1,-6,-10,14,12,6,7,4,2,1,4,-4,5,7,9,18,91],
'Expenditure':[17,196,20,1,-6,-10,14,12,6,7,4,2,1,4,-4,5,7,9,18,18]}
df = pd.DataFrame(dict2,idx,col)
i have tried creating a function then applying it to my DF but i keep getting errors...
the solution i want to end up with would look like this..
idx = pd.MultiIndex.from_product([['001', '002', '003','004'],
['2017-06-30', '2017-12-31', '2018-06-30','2018-12-31','2019-06-30']],
names=['ID', 'Date'])
col = ['Cost', 'Revenue','Expenditure', 'Cost Outlook', 'Revenue Outlook', 'Expenditure Outlook']
dict3= {'Cost': [12,6,-2,-10,-16,
-10,14,12,6,7,
4,2,1,4,-4,
5,7,9,8,1],
'Cost Outlook': ['no','no','NEG','NEG','NEG',
'no','no','no','NEG','NEG',
'no','no','NEG','no','no',
'no','no','no','no','NEG'],
'Revenue':[14,13,2,1,-6,
-10,14,12,6,7,
4,2,1,4,-4,
5,7,9,18,91],
'Revenue Outlook': ['no','no','NEG','NEG','NEG',
'no','no','no','NEG','NEG',
'no','no','NEG','no','no',
'no','no','no','no','no'],
'Expenditure':[17,196,1220,1220, -6,
-10,14,120,126,129,
4,2,1,4,-4,
5,7,9,18,18],
'Expenditure Outlook':['no','no','POS','POS','no',
'no','no','POS','POS','POS',
'no','no','no','no','no',
'no','no','POS','POS','STABLE']
}
df_new = pd.DataFrame(dict3,idx,col)
Here's what I would do:
# update Cost and Revenue Outlooks
# because they have similar conditions
for col in ['Cost', 'Revenue']:
groups = df.groupby('ID')
outlook = f'{col} Outlook'
df[outlook] = groups[col].diff().lt(0)
# moved here
df[outlook] = np.where(groups[outlook].rolling(2).sum().eq(2), 'NEG', 'no')
# update Expenditure Outlook
col = 'Expenditure'
outlook = f'{col} Outlook'
s = df.groupby('ID')[col].diff()
df[outlook] = np.select( (s.eq(0).groupby(level=0).rolling(2).sum().eq(2),
s.gt(0).groupby(level=0).rolling(2).sum().eq(2)),
('STABLE', 'POS'), 'no')
See if this does the job:
is_descending = lambda a: np.all(a[:-1] > a[1:])
is_ascending = lambda a: np.all(a[:-1] <= a[1:])
df1 = df.reset_index()
df1["CostOutlook"] = df1.groupby("ID").Cost.rolling(3).apply(is_descending).fillna(0).apply(lambda x: "NEG" if x > 0 else "no").to_list()
df1["RevenueOutlook"] = df1.groupby("ID").Revenue.rolling(3).apply(is_descending).fillna(0).apply(lambda x: "NEG" if x > 0 else "no").to_list()
df1["ExpenditureOutlook"] = df1.groupby("ID").Expenditure.rolling(3).apply(is_ascending).fillna(0).apply(lambda x: "POS" if x > 0 else "no").to_list()
df1 = df1.set_index(["ID", "Date"])
Note: The requirement for "STABLE" is not handled.
Edit:
This is alternative solution:
is_descending = lambda a: np.all(a[:-1] > a[1:])
def is_ascending(a):
if np.all(a[:-1] <= a[1:]):
if a[-1] == a[-2]:
return 2
return 1
return 0
for col in ['Cost', 'Revenue']:
outlook = df[col].unstack(level="ID").rolling(3).apply(is_descending).fillna(0).replace({0.0:"no", 1.0:"NEG"}).unstack().rename(f"{col} outlook")
df = df.join(outlook)
col = "Expenditure"
outlook = df[col].unstack(level="ID").rolling(3).apply(is_ascending).fillna(0).replace({0.0:"no", 1.0:"POS", 2.0:"STABLE"}).unstack().rename(f"{col} outlook")
df = df.join(outlook)
Related
How to assign a specific value from an other column in pandas in a given time frame?
I want to create a rolling forecast for the following 12 months, the results for the month and entry must become part of the dataframe as well (Later it will be written out into excel as part of a bigger dataframe). The entries for the new dataframe needs to be calculated based on the criteria, that the forecasted month is between start_date and start_date + duration is also in the range of the forecasted 12 months. If these are met, the value from duration should be written here. expected output To do this I imagine that I have to use a numpy.where(), however I can not wrap my head around it. I came across Use lambda with pandas to calculate a new column conditional on existing column, but after some trying I came to the conclusion, that this can not be the whole truth for my case. import numpy as np import pandas as pd import datetime as dt months = ["Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez"] cur_month = dt.date.today().month - 1 cur_year = dt.date.today().year d = {'start_date': ['2020-12-23', '2021-02-08', '2021-06-11', '2022-01-07'], 'duration': [12, 6, 8, 3], 'effort': [0.3, 0.5, 1.2, 0.1]} df = pd.DataFrame(data=d) i = 0 while i < 12: # this creates the header rows for the 12 month period next_month = months[(cur_month + i) % len(months)] # here goes the calculation/condition I am stuck with... df[next_month] = np.where(...) i += 1
So I came up with this and seems to work, I also added some logic for weighting for the cases a project starts some time during the month, so we get a more accurate effort number. d = {"id": [1,2,3,4], "start_date": ['2020-12-23', '2021-02-08', '2021-06-11', '2022-01-07'], "duration": [12, 6, 8, 3], "effort": [0.3, 0.5, 1.2, 0.1]} df = pd.DataFrame(data=d) df["EndDates"] = df["start_date"].dt.to_period("M") + df_["duration"] i = 0 forecast = pd.Series(pd.period_range(today, freq="M", periods=12)) while i < 12: next_month = months[(cur_month + i) % len(months)] df[next_month] = "" for index, row in df.iterrows(): df_tmp = df.loc[df['id'] == int(row['id'])] if not df_tmp.empty and pd.notna(df_tmp["start_date"].item()): if df_tmp["start_date"].item().to_period("M") <= forecast[i] <= df_tmp["EndDates"].item(): # For the current month let's calculate with the remaining value if i == 0: act_enddate = monthrange(today.year, today.month)[1] weighter = 1 - (int(today.day) / int(act_enddate)) df.at[index, next_month] = round(df_tmp['effort'].values[0] * weighter, ndigits=2) # If it is the first entry for the oppty, how many FTEs will be needed for the first month # of the assignment elif df_tmp["start_date"].item().to_period("M") == forecast[i]: first_day = df_tmp["start_date"].item().day if first_day != 1: months_enddate = monthrange(forecast[i].year, forecast[i].month)[1] weighter = 1 - (int(first_day) / int(months_enddate)) df.at[index, next_month] = round(df_tmp['effort'].values[0] * weighter, ndigits=2) else: df.at[index, next_month] = df_tmp['effort'].values[0] # How many FTEs are needed for the last month of the assignment elif df_tmp["EndDates"].item() == forecast[i]: end_day = df_tmp["start_date"].item().day if end_day != 1: months_enddate = monthrange(forecast[i].year, forecast[i].month)[1] weighter = int(end_day) / int(months_enddate) df.at[index, next_month] = round(df_tmp['Umrechnung in FTEs'].values[0] * weighter, ndigits=2) else: continue else: df.at[index, next_month] = df_tmp['effort'].values[0]
What do I do if ValueError: x and y must have same first dimension, but have shapes (32,) and (31, 5)?
csv_data = pd.read_csv("master.csv") df = pd.DataFrame(csv_data, columns=['year', 'suicides/100k pop', 'age', 'country', 'sex']) us_rates = df['country'].values == 'United States' df_us_rates = df.loc[us_rates] teen_rates = df_us_rates['age'].values == '15-24 years' df_teen_rates = df_us_rates.loc[teen_rates] boy_rates = df_teen_rates['sex'].values == 'male' df_boy_rates = df_teen_rates.loc[boy_rates] girl_rates = df_teen_rates['sex'].values == 'female' df_girls_rates = df_teen_rates.loc[girl_rates] years = csv_data['year'] no_dups = [] print(df_teen_rates) for year in years: if year not in no_dups: no_dups.append(year) plt.plot(no_dups, df_boy_rates) plt.show()
You are trying to plot: no_dups, which is a 1D list of 32 values against df_boy_rates which is a 2D dataframe with 31 rows and 5 columns Assuming that the column you're interested in is 'suicides/100k pop', modify your code like this: df_boy_rates = df_teen_rates.loc[boy_rates, 'suicides/100k pop'] Also, you have to check why there is one more element in no_dups
Format certain rows after writing to excel file
I have some code which compares two excel files and determines any new rows (new_rows) added or any rows which were deleted (dropped_rows). It then uses xlsxwriter to write this to a excel sheet. The bit of code I am having trouble with is that it is supposed to then iterate through the rows and if the row was a new row or a dropped row it is supposed to format it a certain way. For whatever reason this part of the code isn't working correct and being ignored. I've tried a whole host of different syntax to make this work but no luck. UPDATE After some more trial and error the issue seems to be caused by the index column. It is a Case Number column and the values have a prefix like "Case_123, Case_456, Case_789, etc..". This seems to be the root of the issue. But not sure how to solve for it. grey_fmt = workbook.add_format({'font_color': '#E0E0E0'}) highlight_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'}) new_fmt = workbook.add_format({'font_color': '#32CD32','bold':True}) # set format over range ## highlight changed cells worksheet.conditional_format('A1:J10000', {'type': 'text', 'criteria': 'containing', 'value':'→', 'format': highlight_fmt}) # highlight new/changed rows for row in range(dfDiff.shape[0]): if row+1 in newRows: worksheet.set_row(row+1, 15, new_fmt) if row+1 in droppedRows: worksheet.set_row(row+1, 15, grey_fmt) the last part # highlight new/changed rows is the bit that is not working. The conditional format portion works fine. the rest of the code: import pandas as pd from pathlib import Path def excel_diff(path_OLD, path_NEW, index_col): df_OLD = pd.read_excel(path_OLD, index_col=index_col).fillna(0) df_NEW = pd.read_excel(path_NEW, index_col=index_col).fillna(0) # Perform Diff dfDiff = df_NEW.copy() droppedRows = [] newRows = [] cols_OLD = df_OLD.columns cols_NEW = df_NEW.columns sharedCols = list(set(cols_OLD).intersection(cols_NEW)) for row in dfDiff.index: if (row in df_OLD.index) and (row in df_NEW.index): for col in sharedCols: value_OLD = df_OLD.loc[row,col] value_NEW = df_NEW.loc[row,col] if value_OLD==value_NEW: dfDiff.loc[row,col] = df_NEW.loc[row,col] else: dfDiff.loc[row,col] = ('{}→{}').format(value_OLD,value_NEW) else: newRows.append(row) for row in df_OLD.index: if row not in df_NEW.index: droppedRows.append(row) dfDiff = dfDiff.append(df_OLD.loc[row,:]) dfDiff = dfDiff.sort_index().fillna('') print(dfDiff) print('\nNew Rows: {}'.format(newRows)) print('Dropped Rows: {}'.format(droppedRows)) # Save output and format fname = '{} vs {}.xlsx'.format(path_OLD.stem,path_NEW.stem) writer = pd.ExcelWriter(fname, engine='xlsxwriter') dfDiff.to_excel(writer, sheet_name='DIFF', index=True) df_NEW.to_excel(writer, sheet_name=path_NEW.stem, index=True) df_OLD.to_excel(writer, sheet_name=path_OLD.stem, index=True) # get xlsxwriter objects workbook = writer.book worksheet = writer.sheets['DIFF'] worksheet.hide_gridlines(2) worksheet.set_default_row(15) # define formats date_fmt = workbook.add_format({'align': 'center', 'num_format': 'yyyy-mm-dd'}) center_fmt = workbook.add_format({'align': 'center'}) number_fmt = workbook.add_format({'align': 'center', 'num_format': '#,##0.00'}) cur_fmt = workbook.add_format({'align': 'center', 'num_format': '$#,##0.00'}) perc_fmt = workbook.add_format({'align': 'center', 'num_format': '0%'}) grey_fmt = workbook.add_format({'font_color': '#E0E0E0'}) highlight_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'}) new_fmt = workbook.add_format({'font_color': '#32CD32','bold':True}) # set format over range ## highlight changed cells worksheet.conditional_format('A1:J10000', {'type': 'text', 'criteria': 'containing', 'value':'→', 'format': highlight_fmt}) # highlight new/changed rows for row in range(dfDiff.shape[0]): if row+1 in newRows: worksheet.set_row(row+1, 15, new_fmt) if row+1 in droppedRows: worksheet.set_row(row+1, 15, grey_fmt) # save writer.save() print('\nDone.\n') def main(): path_OLD = Path('file1.xlsx') path_NEW = Path('file2.xlsx') # get index col from data df = pd.read_excel(path_NEW) index_col = df.columns[0] print('\nIndex column: {}\n'.format(index_col)) excel_diff(path_OLD, path_NEW, index_col) if __name__ == '__main__': main()
Moving Unique Count Calculation Pandas DataFrame
I am defining a function that is being applied to every row in my Data Frame that counts unique codes in a the column "Code" for every id in the set. The code I have works, but it is incredibly slow and I am using a large data set. I am looking for a different approach that speed up the operation. from datetime import timedelta as td import pandas as pd df['Trailing_12M'] = df['Date'] - td(365) #current date - 1 year as new column def Unique_Count(row): """Creating a new df for each id and returning unique count to every row in original df""" temp1 = np.array(df['ID'] == row['ID']) temp2 = np.array(df['Date'] <= row['Date']) temp3 = np.array(df['Date'] >= row['Trailing_12M']) temp4 = np.array(temp1 & temp2 & temp3) df_Unique_Code_Count = np.array(df[temp4].Code.nunique()) return df_Unique_Code_Count df['Unique_Code_Count'] = df.apply(Unique_Count, axis=1)
How do i remove outliers using multiple columns pandas?
Out of my entire dataframe i have two columns price and quantity. These both contain outliers. How can i remove the outliers in both these columns such that the dataframe returned excludes outliers from both these columns? I can apply it to one but not sure how i can apply it to both columns. I've tried the below def make_mask(df, column): standardized = (df[column] - df[column].mean())/df[column].std() return standardized.abs() >= 2 def filter_outliers(df, columns): print(columns) masks = (make_mask(df, column) for column in columns) print(masks) full_mask = np.logical_or.reduce(masks) print(full_mask) return df[full_mask] outliersremoved_df=filter_outliers(df,['price','qty']) I have used this but i can only apply it to one column at a time: def remove_outlier(df_in, col_name): q1 = df_in[col_name].quantile(0.25) q3 = df_in[col_name].quantile(0.75) iqr = q3-q1 #Interquartile range fence_low = q1-1.5*iqr fence_high = q3+1.5*iqr df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)] return df_out error with the top 2 functions: ValueError: too many values to unpack (expected 1)
Please use the below function which would apply on all the columns you have in #df def cap_data(df): for col in df.columns: print("capping the ",col) if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')): percentiles = df[col].quantile([0.01,0.99]).values df[col][df[col] <= percentiles[0]] = percentiles[0] df[col][df[col] >= percentiles[1]] = percentiles[1] else: df[col]=df[col] return df final_df=cap_data(df)