I have a Pandas dataframe that looks like"
I calculated the number of Win, Lost and Draw for each year, and now it looks like:
My goal is to calculate the percentage of each score group by year. To become like:
But I stuck here.
I looked in this thread but was not able to apply it on my df.
Any thoughts?
Here is quite a simple method I wrote for this task:
Just do as follows:
create a dataframe of the total score within each year:
total_score = df.groupby('year')['score'].sum().reset_index(name = 'total_score_each_year')
merge the original and the new dataframe into a single dataframe:
df = df.merge(total_score, on = 'year')
calculate the percents:
df['percentages'] = 100 * (df['score'] / df['total_score_each_year'])
That's it, I hope it helps :)
You could try using : df.iat[row, column]
it would look something like this:
percentages = []
for i in range(len(df) // 3):
draws = df.iat[i, 2]
losses = df.iat[i + 1, 2]
wins = df.iat[i + 2, 2]
nbr_of_games = draws + losses + wins
percentages.append[draws * 100/nbr_of_games] #Calculate percentage of draws
percentages.append[losses * 100/nbr_of_games] #Calculate percentage of losses
percentages.append[wins * 100/nbr_of_games] #Calculate percentage of wins
df["percentage"] = percentages
This may not be the fastest way to do it but i hope it helps !
Similar to #panter answer, but in only one line and without creating any additional DataFrame:
df['percentage'] = df.merge(df.groupby('year').score.sum(), on='year', how='left').apply(
lambda x: x.score_x * 100 / x.score_y, axis=1
)
In detail:
df.groupby('year').score.sum() creates a DataFrame with the sum of the score per year.
df.merge creates a Dataframe equal to the original df, but with the column score renamed to score_x and an additional column score_y, that represents the sum of all the scores for the year of each row; the how='left' keeps only row in the left DataFrame, i.e., df.
.apply computes for each the correspondent percentage, using score_x and score_y (mind the axis=1 option, to apply the lambda row by row).
Related
I have daily gridded rainfall data with dimensions (time: 14245, lon: 40, lat: 20) . I've calculated 2,3,5 and 7-days accumulated rainfall and their respective 90th percentiles at every grid points in my data domain. I've set my condition using DataArray.where(condition, drop=True) to know when daily rainfall amount exceed the threshold as shown in the code below. My current working code is here:
import numpy as np
import pandas as pd
import xarray as xr
#=== reading in the data ===
data_path = '/home/wilson/Documents/PH_D/GPCC/GPCC/GPCC_daily_1982-2020.nc'
data = xr.open_dataset(data_path)
#=== computing 2, 3, 5 and 7-days acummulated rainfall amount ===
data[['precip_2d']] = np.around(data.precip.rolling(time=2).sum(),decimals=2)
data[['precip_3d']] = np.around(data.precip.rolling(time=3).sum(),decimals=2)
data[['precip_5d']] = np.around(data.precip.rolling(time=5).sum(),decimals=2)
data[['precip_7d']] = np.around(data.precip.rolling(time=7).sum(),decimals=2)
#=== Computing 10% largest at each grid point (per grid cel) this is 90th percentile ===
data[['accum_2d_90p']] = np.around(data.precip_2d.quantile(0.9, dim='time'), decimals=2)
data[['accum_3d_90p']] = np.around(data.precip_3d.quantile(0.9, dim='time'), decimals=2)
data[['accum_5d_90p']] = np.around(data.precip_5d.quantile(0.9, dim='time'), decimals=2)
data[['accum_7d_90p']] = np.around(data.precip_7d.quantile(0.9, dim='time'), decimals=2)
#=== locating extreme events, i.e., when daily precip greater than 90th percentile of each of the accumulated rainfall amount ===
data[['extreme_2d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
data[['extreme_3d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
data[['extreme_5d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
data[['extreme_7d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
My problem now is how to count the number of grid cells/points within my domain where the condition is true on a particular date and using the result of the count to rank the date in descending order.
Expected result should look like a table that can be saved as txt file. For example: cells_count is a variable that contain desired result, when print(cells_count) gives
Date
Number of grid cells/point
1992-07-01
432
1983-09-23
407
2009-08-12
388
ok based on your comments it sounds like what you'd like is to get a list of dates in your time coordinate based on a global summary statistic of 3D (time, lat, lon) data. I'll use the condition (data['precip'] > data['accum_2d_90p']) as an example.
I'd definitely recommend reducing the dimensionality of your condition first, before working with the dates, because working with ragged 3D datetime arrays is a real pain. So since you mention wanting the count of pixels satisfying the criteria, you can simply do this:
global_pixel_count = (
(data['precip'] > data['accum_2d_90p']).sum(dim=("lat", "lon"))
)
Now, you have a 1-D array, global_pixel_count, indexed by time only. You can get the ranked order of dates from this using xr.DataArray.argsort:
sorted_date_positions = global_pixel_count.argsort()
sorted_dates = global_pixel_count.time.isel(
time=sorted_date_positions.values
)
This will return a DataArray of dates which sort the array in ascending order. You could reverse this to get the dates in descending order with sorted_date_positions.values[::-1] and you could select all the pixel counts in descending order with:
global_pixel_count.isel(time=sorted_date_positions.values[::-1])
you could also index your entire array with this indexer:
data.isel(time=sorted_date_positions.values[::-1])
I am trying to calculate the margins between two values based on 2 other columns.
def calcMargin(data):
marginsData = data[data.groupby('ID')['Status'].transform(lambda x: all(x != 'Tie'))] # Taking out all inquiries with ties.
def difference(df): # Subtracts 'Accepted' from lowest price
if len(df) <=1:
return pd.NA
winner = df.loc[(df['Status'] == 'Accepted'), 'Price']
df = df[df.Status != 'Accepted']
return min(df['Price']) - winner
winningMargins = marginsData.groupby('ID').agg(difference(marginsData)).dropna()
winningMargins.columns = ['Margin']
winners = marginsData.loc[(marginsData.Status == 'Accepted'), :]
winners = winners.join(winningMargins, on = 'ID')
winnersMargins = winners[['Name', 'Margin']].groupby('Name').sum().reset_index()
To explain a bit further, I am trying to find the difference between two prices. One of them is wherever the "Accepted" value is in the second column. The other price is whatever is the lowest price after the "Accepted" row is extracted, then taking the difference between the two. But this is based on grouping by a third column, the ID column. Then, trying to attach the margin to the winner, 'Name', in the fourth column.
I keep getting the error -- IndexError: index 25 is out of bounds for axis 0 with size 0. Not 100% sure how to fix this, or if my code is correct.
Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
Is there a way to create/generate a Pandas DataFrame from scratch, such that each record follows a specific mathematical function?
Background: In Financial Mathematics, very basic financial-derivatives (e.g. calls and puts) have closed-form pricing formulas (e.g. Black Scholes). These pricing formulas can be called stochastic functions (because they involve a random term)
I'm trying to create a Monte Carlo simulation of a stock price (and subseuqently an option payoff and price based on the stock price). I need, say, 1000 paths (rows) and 100 time-steps (columns). I want to "initiate" a dataframe that is 1000 by 100 and follows a stochastic equation.
# Psuedo-code
MonteCarloDF = DataFrame(rows=1000, columns=100, customFunc=TRUE,
appliedBy='by column',
FUNC={s0=321;
s_i=prev*exp(r-q*sqrt(sigma))*T +
(etc)*NormDist(rnd())*sqr(deltaT)}
)
Column 0 in every row would be 321, and each subsequent column would be figured out based on the FUNC above.
This is an example of something similar done in VBA
Function MonteCarlo_Vanilla_call(S, K, r, q, vol, T, N)
sum = 0
payoff = 0
For i = 1 To N
S_T = S * Exp((r - q - 0.5 * vol ^ 2) * T + vol * Sqr(T) * Application.NormSInv(Rnd()))
payoff = Application.Max(S_T - K, 0)
sum = sum + payoff
Next i
MonteCarlo_Vanilla_call = Exp(-r * T) * sum / N
End Function
Every passed in variable is a constant.
In my case, I want each next column in the same row to be just like S_T in the VBA code. That's really the only like that matters. I want to apply a function like S_T = S * Exp((r - q - 0.5 * vol ^ 2) * T + vol * Sqr(T) * Application.NormSInv(Rnd())) . Each S_T is the next column in the same row. There's N columns making one simulation. I will have, for example, 1000 simulations.
321 | 322.125 | 323.277 | ... | column 100 value
321 | 320.704 | 319.839 | ... | column 100 value
321 | 321.471 | 318.456 | ... | column 100 value
...
row 1000| etc | etc | ... | value (1000,100)
IIUC, you could create your own function to generate a DataFrame.
Within the function iterate using .iloc[:, -1] to use the last created column.
We'll also use numpy.random.randn to generate an array of normally distributed random values.
You may need to adjust the default values of your variables, but the idea would be something like:
Function
import pandas as pd
import numpy as np
from math import exp, sqrt
def monte_carlo_df(nrows,
ncols,
col_1_val,
r=0.03,
q=0.5,
sigma=0.002,
T=1.0002,
deltaT=0.002):
"""Returns stochastic monte carlo DataFrame"""
# Create first column
df = pd.DataFrame({'s0': [col_1_val] * nrows})
# Create subsequent columns
for i in range(1, ncols):
df[f's{i}'] = (df.iloc[:, -1] * exp(r - q * sqrt(sigma)) * T
+ (np.random.randn(nrows) * sqrt(deltaT)))
return df
Usage example
df = monte_carlo_df(nrows=1000, ncols=100, col_1_val=321)
To me your problem is a specific version of the following one: Pandas calculations based on other rows. Since you can pivot it shouldn't matter if we are talking rows or columns.
There is also a question relating to calculations using columns: Pandas complex calculation based on other columns which has a good suggestion of using a rolling window (rolling function) or using shift function: Calculate the percentage increase or decrease based on the previous column value of the same row in pandas dataframe
Speed considerations of similar calculations (or numpy vs pandas discussion): Numpy, Pandas: what is the fastest way to calculate dataset row value basing on previous N values?
To sum it all up - it seems that your question is somewhat of a duplicate.
I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0