Creating a new column into a dataframe based on conditions - python-3.x

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.

Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0

Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Pandas - interpolate over values in index

I have the following Pandas dataframe:
a0 a1 a2 a3
0.2 0.46 15.85 124.06 -380.04
0.4 0.21 28.20 -53.17 87.97
0.6 1.10 -5.55 167.76 -417.72
0.8 0.82 6.11 16.90 -70.86
1.0 1.00 0.00 0.00 0.00
Which is made by:
import pandas as pd
df = pd.DataFrame(data={'a0': [0.46,0.21,1.10,0.82,1],
'a1': [15.85,28.20,-5.55,6.11,0],
'a2': [124.06,-53.17,167.76,16.90,0],
'a3': [-380.04,87.97,-417.72,-70.86,0]},
index=pd.Series(['0.2', '0.4', '0.6','0.8','1.0']))
a0,a1,a2,a3 are polynomial coefficients from a fit y= a0 + a1x + a2x^2 + a3*x^3.
5 fits have been made for 5 ratios Ht/H, these ratios are on the indices.
I want to return values for a0.. a3 for specified Ht/H ratio.
For example, if I specify Ht/H= 0.9, I want to get a0= 0.91, a1= 3.05,a2= 8.45,a3= -35.43.
First I notice that your index is currently strings, and you want numeric for interpolation. So do:
df.index = pd.to_numeric(df.index)
Let's try reindex:
s = 0.9
# create new index that includes the new value
new_idx = np.unique(list(df.index) + [s])
df.reindex(new_idx).interpolate('index').loc[s]
Output:
a0 0.910
a1 3.055
a2 8.450
a3 -35.430
Name: 0.9, dtype: float64

Python LIfe Expectancy

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin
IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

pandas, how to get close price from returns?

I'm trying to convert from returns to a price index to simulate close prices for the ffn library, but without success.
import pandas as pd
times = pd.to_datetime(pd.Series(['2014-07-4',
'2014-07-15','2014-08-25','2014-08-25','2014-09-10','2014-09-15']))
strategypercentage = [0.01, 0.02, -0.03, 0.04,0.5,-0.3]
df = pd.DataFrame({'llt_return': strategypercentage}, index=times)
df['llt_close']=1
df['llt_close']=df['llt_close'].shift(1)*(1+df['llt_return'])
df.head(10)
llt_return llt_close
2014-07-04 0.01 NaN
2014-07-15 0.02 1.02
2014-08-25 -0.03 0.97
2014-08-25 0.04 1.04
2014-09-10 0.50 1.50
2014-09-15 -0.30 0.70
How can I make this correct?
You can use the cumulative product of return-relatives.
A return-relative is one-plus that day's return.
>>> start = 1.0
>>> df['llt_close'] = start * (1 + df['llt_return']).cumprod()
>>> df
llt_return llt_close
2014-07-04 0.01 1.0100
2014-07-15 0.02 1.0302
2014-08-25 -0.03 0.9993
2014-08-25 0.04 1.0393
2014-09-10 0.50 1.5589
2014-09-15 -0.30 1.0912
This assumes the price index starts at start on the close of the trading day prior to 2014-07-04.
On 7-04, you have a 1% return and the price index closes at 1 * (1 + .01) = 1.01.
On 7-15, return was 2%; close price will be 1.01 * (1 + .02) = 1.0302.
Granted, this is not completely realistic given you're forming a price indexing from irregular-frequency data (missing dates), but hopefully this answers your question.

Pandas rolling mean don't change numbers to NaN in DataFrame

I'm working with a pandas DataFrame which looks like this:
(**N.B - the offset is set as the index of the DataFrame)
offset X Y Z
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.126465 -1.926758 -0.414062
50 -0.137207 -1.916992 -0.404297
60 -0.130371 -3.784591 -0.987654
70 -0.125000 -1.918457 -0.403809
80 -0.123456 -1.917480 -0.413574
90 -0.126465 -1.926758 -0.333554
I have applied the rolling mean with window size = 5, to the data frame using the following code.
I need to keep this window size = 5 and I need values for the whole dataframe for all of the offset values (no NaNs).
df = df.rolling(center=False, window=5).mean()
Which gives me:
offset X Y Z
0.0 NaN NaN NaN
10.0 NaN NaN NaN
20.0 NaN NaN NaN
30.0 NaN NaN NaN
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
I would like the DataFrame to be able to keep the first values that are NaN unchanged and have the the rest of the values as the result of the rolling mean. Is there a simple way that I would be able to do this? Thanks
i.e.
offset X Y Z
0.0 -0.140137 -1.924316 -0.426758
10.0 -2.789123 -1.111212 -0.416016
20.0 -0.133789 -1.923828 -4.408691
30.0 -0.101112 -1.457891 -0.425781
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
You can fill with the original df:
df.rolling(center=False, window=5).mean().fillna(df)
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
There is also an argument, min_periods that you can use. If you pass min_periods=1 then it will take the first value as it is, second value as the mean of the first two etc. It might make more sense in some cases.
df.rolling(center=False, window=5, min_periods=1).mean()
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -1.464630 -1.517764 -0.421387
20 -1.021016 -1.653119 -1.750488
30 -0.791040 -1.604312 -1.419311
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
Assuming you don't have other rows with all NaN's, you can identify which rows have all NaN's in your rolling_df, and replace them with the corresponding rows from the original. Example:
df=pd.DataFrame(np.random.rand(13,5))
df_rolling=df.rolling(center=False,window=5).mean()
#identify which rows are all NaN
idx = df_rolling.index[df_rolling.isnull().all(1)]
#replace those rows with the original data
df_rolling.loc[idx,:]=df.loc[idx,:]

Resources