Generate a Dataframe that follow a mathematical function for each column / row - python-3.x

Is there a way to create/generate a Pandas DataFrame from scratch, such that each record follows a specific mathematical function?
Background: In Financial Mathematics, very basic financial-derivatives (e.g. calls and puts) have closed-form pricing formulas (e.g. Black Scholes). These pricing formulas can be called stochastic functions (because they involve a random term)
I'm trying to create a Monte Carlo simulation of a stock price (and subseuqently an option payoff and price based on the stock price). I need, say, 1000 paths (rows) and 100 time-steps (columns). I want to "initiate" a dataframe that is 1000 by 100 and follows a stochastic equation.
# Psuedo-code
MonteCarloDF = DataFrame(rows=1000, columns=100, customFunc=TRUE,
appliedBy='by column',
FUNC={s0=321;
s_i=prev*exp(r-q*sqrt(sigma))*T +
(etc)*NormDist(rnd())*sqr(deltaT)}
)
Column 0 in every row would be 321, and each subsequent column would be figured out based on the FUNC above.
This is an example of something similar done in VBA
Function MonteCarlo_Vanilla_call(S, K, r, q, vol, T, N)
sum = 0
payoff = 0
For i = 1 To N
S_T = S * Exp((r - q - 0.5 * vol ^ 2) * T + vol * Sqr(T) * Application.NormSInv(Rnd()))
payoff = Application.Max(S_T - K, 0)
sum = sum + payoff
Next i
MonteCarlo_Vanilla_call = Exp(-r * T) * sum / N
End Function
Every passed in variable is a constant.
In my case, I want each next column in the same row to be just like S_T in the VBA code. That's really the only like that matters. I want to apply a function like S_T = S * Exp((r - q - 0.5 * vol ^ 2) * T + vol * Sqr(T) * Application.NormSInv(Rnd())) . Each S_T is the next column in the same row. There's N columns making one simulation. I will have, for example, 1000 simulations.
321 | 322.125 | 323.277 | ... | column 100 value
321 | 320.704 | 319.839 | ... | column 100 value
321 | 321.471 | 318.456 | ... | column 100 value
...
row 1000| etc | etc | ... | value (1000,100)

IIUC, you could create your own function to generate a DataFrame.
Within the function iterate using .iloc[:, -1] to use the last created column.
We'll also use numpy.random.randn to generate an array of normally distributed random values.
You may need to adjust the default values of your variables, but the idea would be something like:
Function
import pandas as pd
import numpy as np
from math import exp, sqrt
def monte_carlo_df(nrows,
ncols,
col_1_val,
r=0.03,
q=0.5,
sigma=0.002,
T=1.0002,
deltaT=0.002):
"""Returns stochastic monte carlo DataFrame"""
# Create first column
df = pd.DataFrame({'s0': [col_1_val] * nrows})
# Create subsequent columns
for i in range(1, ncols):
df[f's{i}'] = (df.iloc[:, -1] * exp(r - q * sqrt(sigma)) * T
+ (np.random.randn(nrows) * sqrt(deltaT)))
return df
Usage example
df = monte_carlo_df(nrows=1000, ncols=100, col_1_val=321)

To me your problem is a specific version of the following one: Pandas calculations based on other rows. Since you can pivot it shouldn't matter if we are talking rows or columns.
There is also a question relating to calculations using columns: Pandas complex calculation based on other columns which has a good suggestion of using a rolling window (rolling function) or using shift function: Calculate the percentage increase or decrease based on the previous column value of the same row in pandas dataframe
Speed considerations of similar calculations (or numpy vs pandas discussion): Numpy, Pandas: what is the fastest way to calculate dataset row value basing on previous N values?
To sum it all up - it seems that your question is somewhat of a duplicate.

Related

Pandas - Get the first n rows if a column has a specific value

I have a DataFrame that has 5 columns including User and MP.
I need to extract a sample of n rows for each User, n being a percentage based on User (if User has 1000 entries and n is 5, select the first 50 rows and and go to the next User. After that I have to add all the samples to a new DataFrame. Also if User has multiple values on the column MP, for example if the user has 2 values in the column MP, select 2.5% for 1 value and 2.5% for the other.
Somehow my logic isn't that good(started with the first step, without adding the logic for multiple MPs)
df = pd.read_excel("Results/fullData.xlsx")
dfSample = pd.DataFrame()
uniqueValues = df['User'].unique()
print(uniqueValues)
n = 5
for u in uniqueValues:
sm = df["User"].str.count(u).sum()
print(sm)
for u in df['User']:
sample = df.head(int(sm*(n/100)))
#print(sample)
dfSample.append(sample)
print(dfSample)
dfSample.to_excel('testFinal.xlsx')
Check Below example. It is intentionally verbose for understanding. The column that solve problem is "ROW_PERC". You can filter it based on the requirement (50% rows or 25% rows) that are required for each USR/MP.
import pandas as pd
df = pd.DataFrame({'USR':[1,1,1,1,2,2,2,2],'MP':['A','A','A','A','B','B','A','A'],"COL1":[1,2,3,4,5,6,7,8]})
df['USR_MP_RANK'] = df.groupby(['USR','MP']).rank()
df['USR_MP_RANK_MAX'] = df.groupby(['USR','MP'])['USR_MP_RANK'].transform('max')
df['ROW_PERC'] = df['USR_MP_RANK']/df['USR_MP_RANK_MAX']
df
Output:

How to Perform operation in each columns in Pandas [duplicate]

This question already has answers here:
pandas convert columns to percentages of the totals
(4 answers)
Closed 1 year ago.
I have this dataset and I want to check the percentage of each cell per year. Such as dividing each value by the sum of values of that year ( value/sum(1960) )*100. How can I get the value for each column and each row?
If I'm understanding correctly, you want the equivalent of 5 / sum([1, 2, 3, 4, 5]) * 100. If that's the case, then you could do the following:
subset_cols = df.columns
perc_df = df[subset_cols] / df[subset_cols].sum(axis=0) * 100
Using axis=0 will apply the function to each column, whereas axis=1 will apply to each row.
To convert column values into percentages, this is the simplest way:
df['1960_percentages'] = 100*df.1960/df.1960.sum()
Repeat similarly for other columns.
Note: This creates a new column in your dataframe keeping the original data intact. If you would just like to replace, do the following:
df.1960 /= (df.1960.sum() / 100)
Edit: To do the same for multiple columns at once:
cols = # list of columns to apply this over (set to df.columns for all columns)
df[cols] /= (df[cols].sum(axis=0) / 100)

Calculate percentage of grouped values

I have a Pandas dataframe that looks like"
I calculated the number of Win, Lost and Draw for each year, and now it looks like:
My goal is to calculate the percentage of each score group by year. To become like:
But I stuck here.
I looked in this thread but was not able to apply it on my df.
Any thoughts?
Here is quite a simple method I wrote for this task:
Just do as follows:
create a dataframe of the total score within each year:
total_score = df.groupby('year')['score'].sum().reset_index(name = 'total_score_each_year')
merge the original and the new dataframe into a single dataframe:
df = df.merge(total_score, on = 'year')
calculate the percents:
df['percentages'] = 100 * (df['score'] / df['total_score_each_year'])
That's it, I hope it helps :)
You could try using : df.iat[row, column]
it would look something like this:
percentages = []
for i in range(len(df) // 3):
draws = df.iat[i, 2]
losses = df.iat[i + 1, 2]
wins = df.iat[i + 2, 2]
nbr_of_games = draws + losses + wins
percentages.append[draws * 100/nbr_of_games] #Calculate percentage of draws
percentages.append[losses * 100/nbr_of_games] #Calculate percentage of losses
percentages.append[wins * 100/nbr_of_games] #Calculate percentage of wins
df["percentage"] = percentages
This may not be the fastest way to do it but i hope it helps !
Similar to #panter answer, but in only one line and without creating any additional DataFrame:
df['percentage'] = df.merge(df.groupby('year').score.sum(), on='year', how='left').apply(
lambda x: x.score_x * 100 / x.score_y, axis=1
)
In detail:
df.groupby('year').score.sum() creates a DataFrame with the sum of the score per year.
df.merge creates a Dataframe equal to the original df, but with the column score renamed to score_x and an additional column score_y, that represents the sum of all the scores for the year of each row; the how='left' keeps only row in the left DataFrame, i.e., df.
.apply computes for each the correspondent percentage, using score_x and score_y (mind the axis=1 option, to apply the lambda row by row).

How to calculate a new column from a large Dataframe using Dictionary in a custom function?

I have a dataframe df with 700 million rows and three columns in the following format
key_x key_y num
0 1 1 111.111
1 1 2 222.222
2 1 3 333.333
:
I have a dictionary dict where all values in key_x and key_y are stored as keys
I need to create a new column such that, for each row in df
df['result'] = df['num'] /( dict[key_x] * dict[key_y])
My current approach is to vectorize as the following:
def find_res(key_x,key_y,num):
return num/(dict[key_x]*row_dict[key_y])
df["result"] = np.vectorize(find_res)(df["key"],df["key_y"],df["num"])
However this approach is too slow. I have a RAM of around 500GB, so mem is not an issue. Is there an more efficient method to perform the same operation?
You can use map:
df['result'] = df['num'] / (df['key_x'].map(your_dict) * df['key_y'].map(your_dict) )

Calculate precision and recall based on values in two columns of a python pandas dataframe?

I have a dataframe in the following format:
Column 1 (Expected Output) | Column 2 (Actual Output)
[2,10,5,266,8] | [7,2,9,266]
[4,89,34,453] | [4,22,34,453]
I would like to find the number of items in the actual input that were expected. For example, for row 1, only 2 and 266 were in both the expected and actual output, which means that precision = 2/5 and recall = 2/5.
Since I have over 500 rows, I would like to find some sort of formula to find the precision and recall for each row.
Setting up your df like this:
df = pd.DataFrame({"Col1": [[2,10,5,266,8],[4,89,34,453]],
"Col2":[[7,2,9,266],[4,22,34,453]]})
You can find the matching values with:
df["matches"] = [set(df.loc[r, "Col1"]) & set(df.loc[r, "Col2"]) for r in range(len(df))]
from which you can calculate precision and recall.
But be warned that your example takes no account of the ordering of the elements in the expected output and actual output lists, and this solution will fall down if this is important, and also if there are duplicates of any values in the "Expected Output" list.

Resources