Python LIfe Expectancy - python-3.x

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin

IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Adding values to a new column in Pandas depending on values in an existing column

I have a pandas dataframe as follows:
Name Age City Country percentage
a Jack 34 Sydney Australia 0.23
b Riti 30 Delhi India 0.45
c Vikas 31 Mumbai India 0.55
d Neelu 32 Bangalore India 0.73
e John 16 New York US 0.91
f Mike 17 las vegas US 0.78
I am planning to add one more column called bucket whose definition depends on the percentage column as follows:
less than 0.25 = 1
between 0.25 and 0.5 = 2
between 0.5 and 0.75 = 3
greater than 0.75 = 4
I tried the inbuilt conditions and choices properties of pandas follows:
conditions = [(df_obj['percentage'] < .25),
(df_obj['percentage'] >=.25 & df_obj['percentage'] < .5),
(df_obj['percentage'] >=.5 & df_obj['percentage'] < .75),
(df_obj['percentage'] >= .75)]
choices = [1,2,3,4]
df_obj['bucket'] = np.select(conditions, choices)
However, this gives me a random error as follows in the line where I create the conditions:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
A quick fix to your code is that you need more parentheses, for example:
((df_obj['percentage'] >=.25) & (df_obj['percentage'] < .5) )
^ ^ ^ ^
However, I think it's cleaner with pd.cut:
pd.cut(df['percentage'], bins=[0,0.25, 0.5, 0.75, 1],
include_lowest=True, right=False,
labels=[1,2,3,4])
Or since your buckets are linear:
df['bucket'] = (df['percentage']//0.25).add(1).astype(int)
Output
Name Age City Country percentage bucket
a Jack 34 Sydney Australia 0.23 1
b Riti 30 Delhi India 0.45 2
c Vikas 31 Mumbai India 0.55 3
d Neelu 32 Bangalore India 0.73 3
e John 16 New York US 0.91 4
f Mike 17 las vegas US 0.78 4
I think the easiest/most readable way to do this is to use the apply function:
def percentage_to_bucket(percentage):
if percentage < .25:
return 1
elif percentage >= .25 and percentage < .5:
return 2
elif percentage >= .5 and percentage < .75:
return 3
else:
return 4
df["bucket"] = df["percentage"].apply(percentage_to_bucket)
Pandas apply will take each value of a given column and apply the passed function to this value, returning a pandas series with the results, which you can then assign to your new column.

Limiting float value in for loop Python

I'm working on banking project, where my team asks me to limit all float values to .2 precision.
My dataSet.head()
Goal: To find max of all stocks comparatively
My present output:
Bank Ticker
BAC 54.900002
C 564.099976
GS 247.919998
JPM 70.080002
MS 89.300003
WFC 58.520000
dtype: float64
My expected output:
Bank Ticker
BAC 54.90
C 564.10
GS 247.91
JPM 70.08
MS 89.30
WFC 58.52
dtype: float64
Please help me with this!
You make a wrong use of "{:.2f}" in your print statement, you should use .format() to format your float.
You can use print("{:.2f}".format(some float)) to print a float with 2 decimals as explained here.
You could use pandas.Series.round method
I've got a toy DataFrame df:
l1 c1 c2
l2 a b c a b c
0 0.066667 0.666667 6.666667 0.0002 0.002 0.02
1 0.133333 1.333333 13.333333 0.0004 0.004 0.04
2 0.200000 2.000000 20.000000 0.0006 0.006 0.06
3 0.266667 2.666667 26.666667 0.0008 0.008 0.08
df.xs('c', axis=1, level='l2').max().round(2)
Results into this:
l1
c1 26.67
c2 0.08
dtype: float64
I guess in your case
res = bank_stocks.xs('Close', axis=1, level='Stock Info').max().round(2)
would result into a Series res indexed by tickers with name Bank Ticker and desired values rounded up to 2 decimal places.
According to this answer you can then print it with
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(res)
I am not a very advanced python programmer but I this should work:
{:.2f}.format(max())
to say the least, that will print out 564.099976 as 564.09
This works for me, which is reliable, and without loop
bank_stocks.xs(key='Close',axis=1,level='Stock Info').max().round(2)
O(n)^2 is now O(n)
This is not a clean solution but You can multiply the max stock price with 100, then do a floor division with 1 and then divide by 100.
This would solve your problems.

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Resources