How do I One Hot encode mixed strings and number cell values in pandas? - python-3.x

I have a datasets in which i need to One hot encode composition mixture of different materials.
The columns of my dataset looks like this:
id Composition
0 ZrB2 - 5% B4C
1 HfB2 - 15% SiC - 3% WC
2 HfB2 - 15% SiC
enter image description here
I need to put it in this format:
0)
ZrB2 95
HfB2 0
SiC 0
B4C 5
WC 0
1)
ZrB2 0
HfB2 82
SiC 15
B4C 0
WC 3
2)
ZrB2 0
HfB2 85
SiC 15
B4C 0
WC 0
WB 0
enter image description here

This is not hot encoding as such but parsing a list of strings into constituent parts
each component is delimited by " - "
each component is made up of two parts percentage and column name. Build re that matches each of these constituent parts
from list/dict comprehension put it into a dataframe
complete logic for calculating %age of column where it was not defined
data = ['ZrB2 - 5% B4C', 'HfB2 - 15% SiC - 3% WC', 'HfB2 - 15% SiC']
dfhc = pd.DataFrame({"Composition":data})
# build a list of dict, where dict is of form {'ZrB2': -1, 'B4C': '5'}
# where no %age, default to -1 to be calculated later
parse1 = [{tt[1]:tt[0].replace("% ","") if len(tt[0])>0 else -1
for t in r
# parse out token and percentage, exclude empty tuples (default behaviour of re.findall())
for tt in [x for x in re.findall("([0-9]*[%]?[ ]?)([A-Z,a-z,0-9]*)",t) if x!=("","")]
}
# each column is delimited by " - "
for r in [re.split(" - ",r) for r in dfhc["Composition"].values]
]
df = pd.DataFrame(parse1)
# dtype is important for sum() to work
df = df.astype({c:np.float64 for c in df.columns})
# where %age was not known and defaulted to -1 set it to 100 - sum of other cols
for c in df.columns:
mask = df[df[c]==-1].index
df.loc[mask, c] = 100 - df.loc[mask, [cc for cc in df.columns if cc!=c]].sum(axis=1)
print(f"{dfhc.to_string(index=False)}\n\n{df.to_string(index=False)}\n\n{parse1}")
output
Composition
ZrB2 - 5% B4C
HfB2 - 15% SiC - 3% WC
HfB2 - 15% SiC
ZrB2 B4C HfB2 SiC WC
95.0 5.0 NaN NaN NaN
NaN NaN 82.0 15.0 3.0
NaN NaN 85.0 15.0 NaN
[{'ZrB2': -1, 'B4C': '5'}, {'HfB2': -1, 'SiC': '15', 'WC': '3'}, {'HfB2': -1, 'SiC': '15'}]

Related

Convert different units of a column in pandas

I'm working on a Kaggle project. Below is my CSV file column:
total_sqft
1056
1112
34.46Sq. Meter
4125Perch
1015 - 1540
34.46
10Sq. Yards
10Acres
10Guntha
10Grounds
The column is of type object. First I want to convert all the values to float then update the string 1015 - 1540 with its average value and finally convert the units to square feet. I've tried different StackOverflow solutions but none of them seems to work. Any help would be appreciated.
Expected Output:
total_sqft
1056.00
1112.00
370.307
1123031.25
1277.5
34.46
90.00
435600
10890
24003.5
1 square meter = 10.764 * square foot
1 perch = 272.25 * square foot
1 square yards = 9 * square foot
1 acres = 43560 * square foot
1 guntha = 1089 * square foot
1 grounds = 2400.35 * square foot
First extract numeric values by Series.str.extractall, convert to floats and get averages:
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
print (df)
total_sqft avg
0 1056 1056.00
1 1112 1112.00
2 34.46Sq. Meter 34.46
3 4125Perch 4125.00
4 1015 - 1540 1277.50
5 34.46 34.46
then need more information for convert to square feet.
EDIT: Create dictionary for match units, extract them from column and use map, last multiple columns:
d = {'Sq. Meter': 10.764, 'Perch':272.25, 'Sq. Yards':9,
'Acres':43560,'Guntha':1089,'Grounds':2400.35}
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
df['unit'] = df['total_sqft'].str.extract(f'({"|".join(d)})', expand=False)
df['map'] = df['unit'].map(d).fillna(1)
df['total_sqft'] = df['avg'].mul(df['map'])
print (df)
total_sqft avg unit map
0 1.056000e+03 1056.00 NaN 1.000
1 1.112000e+03 1112.00 NaN 1.000
2 3.709274e+02 34.46 Sq. Meter 10.764
3 1.123031e+06 4125.00 Perch 272.250
4 1.277500e+03 1277.50 NaN 1.000
5 3.446000e+01 34.46 NaN 1.000
6 9.000000e+01 10.00 Sq. Yards 9.000
7 4.356000e+05 10.00 Acres 43560.000
8 1.089000e+04 10.00 Guntha 1089.000
9 2.400350e+04 10.00 Grounds 2400.350

How to compare values in a data frame in pandas [duplicate]

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

pandas create a flag when merging two dataframes

I have two df - df_a and df_b,
# df_a
number cur
1000 USD
2000 USD
3000 USD
# df_b
number amount deletion
1000 0.0 L
1000 10.0 X
1000 10.0 X
2000 20.0 X
2000 20.0 X
3000 0.0 L
3000 0.0 L
I want to left merge df_a with df_b,
df_a = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df_a.fillna(value={'amount':0}, inplace=True)
but also create a flag called deleted in the result df_a, that has three possible values - full, partial and none;
full - if all rows associated with a particular number value, have deletion = L;
partial - if some rows associated with a particular number value, have deletion = L;
none - no rows associated with a particular number value, have deletion = L;
Also when doing the merge, rows from df_b with deletion = L should not be considered; so the result looks like,
number amount deletion deleted cur
1000 10.0 X partial USD
1000 10.0 X partial USD
2000 20.0 X none USD
2000 20.0 X none USD
3000 0.0 NaN full USD
I am wondering how to achieve that.
Idea is compare deletion column and aggregate all and
any, create helper dictionary and last map for new column:
g = df_b['deletion'].eq('L').groupby(df_b['number'])
m1 = g.any()
m2 = g.all()
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
#join dictionries together
d = {**d1, **d2}
print (d)
{1000: 'partial', 3000: 'full'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d).fillna('none')
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full
For specify column none, if want create dictionary for it:
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
d3 = dict.fromkeys(m2.index[~m1], 'none')
d = {**d1, **d2, **d3}
print (d)
{1000: 'partial', 3000: 'full', 2000: 'none'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d)
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full

How to check the value is postive or negative and insert a new value in a column of a data frame

I am working with pandas and python and I have a task where, I need to check
GDP diff sign
Quarter
1999q4 12323.3 NaN None
2000q1 12359.1 35.8 None
2000q2 12592.5 233.4 None
2000q3 12607.7 15.2 None
2000q4 12679.3 71.6 None
Let the above dataframe be df and when I do
if df.iloc[2]['diff'] > 0:
df.iloc[2]['sign'] = "Positive"
The value is not getting updated on the dataframe. Is there something where I'm doing wrong. Its a direct assignment like how we do df['something'] = 'some value' and by doing this it will insert than value into df under that column. But when i do the above where i need to determine positive or negative, it is still showing as None when I do
df.iloc[2]['sign']
I tried using apply with lambdas, but couldn't get what I wanted.
Some help would be appreciated
Thank you.
You can use double numpy.where first with filtering NaN values by isnull and then by condition df['diff'] > 0:
df.sign = np.where(df['diff'].isnull(), np.nan,
np.where(df['diff'] > 0, 'Positive', 'Negative'))
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN NaN
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
because if use only df['diff'] > 0 get Negative for NaN values:
df.sign = np.where(df['diff'] > 0, 'Positive', 'Negative')
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN Negative
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
I'd create a categorical column
d = df['diff']
sign = np.where(d < 0, 'Negative',
np.where(d == 0, 'UnChanged',
np.where(d > 0, 'Positive', np.nan)))
df['sign'] = pd.Categorical(sign,
categories=['Negative', 'UnChanged', 'Positive'],
ordered=True)
df

Resources