Trying to group and find margins of pandas dataframe based on multiple columns. Keep getting IndexError - python-3.x

I am trying to calculate the margins between two values based on 2 other columns.
def calcMargin(data):
marginsData = data[data.groupby('ID')['Status'].transform(lambda x: all(x != 'Tie'))] # Taking out all inquiries with ties.
def difference(df): # Subtracts 'Accepted' from lowest price
if len(df) <=1:
return pd.NA
winner = df.loc[(df['Status'] == 'Accepted'), 'Price']
df = df[df.Status != 'Accepted']
return min(df['Price']) - winner
winningMargins = marginsData.groupby('ID').agg(difference(marginsData)).dropna()
winningMargins.columns = ['Margin']
winners = marginsData.loc[(marginsData.Status == 'Accepted'), :]
winners = winners.join(winningMargins, on = 'ID')
winnersMargins = winners[['Name', 'Margin']].groupby('Name').sum().reset_index()
To explain a bit further, I am trying to find the difference between two prices. One of them is wherever the "Accepted" value is in the second column. The other price is whatever is the lowest price after the "Accepted" row is extracted, then taking the difference between the two. But this is based on grouping by a third column, the ID column. Then, trying to attach the margin to the winner, 'Name', in the fourth column.
I keep getting the error -- IndexError: index 25 is out of bounds for axis 0 with size 0. Not 100% sure how to fix this, or if my code is correct.

Related

Pandas: Compare row with all other rows by multiple conditions

I want to compare the all the rows (one-by-one) with all the other rows in the following extract of my dataframe.
Idx ECTRL ID Latitude Longitude
0 186858227 53.617750 30.866759
1 186858229 40.569012 35.138237
2 186858235 38.915970 38.782447
3 186858295 39.737594 37.005481
4 186858299 48.287601 15.487567
I want to extract "ECTRL ID"-Combinations (e.g. 186858235, 186858295), where the differences of longitude and latitude are both less than 2.
e.g.:
df.iloc[2]["Latitude"] - df.iloc[3]["Latitude"] <= 2
if its true then i want to return it as a tuple and append it to a list.
(186858235, 186858295)
It works with a loop but its pretty slow:
l = []
for idx, row in data.iterrows():
for j, row2 in data.iterrows():
if np.absolute(row['Longitude'] - row2['Longitude']) < 0.05 and np.absolute(row['Latitude'] - row2['Latitude']) < 0.05 and row["ECTRL ID"] != row2["ECTRL ID"]:
tup = (row["ECTRL ID"], row2["ECTRL ID"])
l.append(tup)
is there any way to make this faster with the build-in pandas functions? i have not found a way without looping

Calculate percentage of grouped values

I have a Pandas dataframe that looks like"
I calculated the number of Win, Lost and Draw for each year, and now it looks like:
My goal is to calculate the percentage of each score group by year. To become like:
But I stuck here.
I looked in this thread but was not able to apply it on my df.
Any thoughts?
Here is quite a simple method I wrote for this task:
Just do as follows:
create a dataframe of the total score within each year:
total_score = df.groupby('year')['score'].sum().reset_index(name = 'total_score_each_year')
merge the original and the new dataframe into a single dataframe:
df = df.merge(total_score, on = 'year')
calculate the percents:
df['percentages'] = 100 * (df['score'] / df['total_score_each_year'])
That's it, I hope it helps :)
You could try using : df.iat[row, column]
it would look something like this:
percentages = []
for i in range(len(df) // 3):
draws = df.iat[i, 2]
losses = df.iat[i + 1, 2]
wins = df.iat[i + 2, 2]
nbr_of_games = draws + losses + wins
percentages.append[draws * 100/nbr_of_games] #Calculate percentage of draws
percentages.append[losses * 100/nbr_of_games] #Calculate percentage of losses
percentages.append[wins * 100/nbr_of_games] #Calculate percentage of wins
df["percentage"] = percentages
This may not be the fastest way to do it but i hope it helps !
Similar to #panter answer, but in only one line and without creating any additional DataFrame:
df['percentage'] = df.merge(df.groupby('year').score.sum(), on='year', how='left').apply(
lambda x: x.score_x * 100 / x.score_y, axis=1
)
In detail:
df.groupby('year').score.sum() creates a DataFrame with the sum of the score per year.
df.merge creates a Dataframe equal to the original df, but with the column score renamed to score_x and an additional column score_y, that represents the sum of all the scores for the year of each row; the how='left' keeps only row in the left DataFrame, i.e., df.
.apply computes for each the correspondent percentage, using score_x and score_y (mind the axis=1 option, to apply the lambda row by row).

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

Returning Multiple Columns from FuzzyWuzzy token_set_ratio

I am attempting to perform some fuzzy matching across two datasets containing lots of addresses.
I am iterating through a list of addresses in df, and finding the 'most matching' out of another:
for index,row in df.iterrows():
test_address = df.Full_Address[row]
first_comp = fuzz.token_set_ratio(df3.Full_Address,`test_address)
taking the row output returns me the full address from df, but I can't come up with a way to return the subsequently 'matched' address from df3.
Can anyone give a pointer please?
df ~ 18k rows
df3 ~ 2.5M rows
Which obviously presents limitations:
I have tried using np.meshgrid to create a list of values & get ratio for each value pair then select rows greater than the threshold.
Also tried this but with the dataset size it takes an age
matched_names =[]
for row1 in df.index:
name1 = df.get_value(row1,"Full_Address")
for row2 in df3.index:
name2= df3.get_value(row2,"Full_Address")
matched_token=fuzz.token_set_ratio(name1,name2)
if matched_token> 80:
matched_names.append([name1,name2,matched_token])
print(matched_names)

Resources