Fill NaN if values in another column are identical - python-3.x

I have the following dataframe:
Out[117]: mydata
author email ri oi
0 X1 NaN NaN 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com NaN 0000-0001-8437-498X
4 X5 ab#ma.com NaN 0000-0001-8437-498X
where column ri represents an author's ResearcherID, and oi the ORCID. One author may has more than one email address, so column email has duplicates.
Firstly, I'm trying to fill na in ri if the corresponding rows in oi share the same value, using a non-NaN value in ri. The result I want is:
author email ri oi
0 X1 NaN K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com K-5448-2012 0000-0001-8437-498X
Secondly, merging emails and using the merged value to fill na in column email, if the values in ri (or oi) are identical. I want to get a dataframe like the following one:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
I've tried the following code:
final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
fill_df = mydata[mydata.oi == i]
fill_df.ri = fill_df.ri.fillna(method='ffill')
fill_df.ri = fill_df.ri.fillna(method='bfill')
null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])
This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email and then use the merged value as an input in the process of filling na?

Try 2 transform. One for each column. On ri, use first. On email, use combination of dropna, unique, and join
g = df.dropna(subset=['oi']).groupby('oi')
df['ri'] = g.ri.transform('first')
df['email'] = g.email.transform(lambda x: ';'.join(x.dropna().unique()))
Out[79]:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X

Related

how to eliminate 3 letter words or 4 letter words from a column of a dataframe

I have a dataframe as below:
import pandas as pd
import dask.dataframe as dd
a = {'b':['category','categorical','cater pillar','coming and going','bat','No Data','calling','cal'],
'c':['strd1','strd2','strd3', 'strd4','strd5','strd6','strd7', 'strd8']
}
df11 = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
I wanted to remove words whose length of each value is three.
I expect results to be like:
b c
category strd1
categorical strd2
cater pillar strd3
coming and going strd4
NaN strd5
No Data strd6
calling strd7
NaN strd8
Use series.str.len() to identify the length of the string in a series and then compare with series.eq(), then using df.loc[] you can assign the values of b as np.nan where the condition matches:
df11.loc[df11.b.str.len().eq(3),'b']=np.nan
b c
x1 category strd1
x2 categorical strd2
x3 cater pillar strd3
x4 coming and going strd4
x5 NaN strd5
x6 No Data strd6
x7 calling strd7
x8 NaN strd8
Use str.len to get the length of each string and then conditionally replace them toNaN with np.where if the length is equal to 3:
df11['b'] = np.where(df11['b'].str.len().eq(3), np.NaN, df11['b'])
b c
0 category strd1
1 categorical strd2
2 cater pillar strd3
3 coming and going strd4
4 NaN strd5
5 No Data strd6
6 calling strd7
7 NaN strd8
Maybe check mask
df11.b.mask(df11.b.str.len()<=3,inplace=True)
df11
Out[16]:
b c
x1 category strd1
x2 categorical strd2
x3 cater pillar strd3
x4 coming and going strd4
x5 NaN strd5
x6 No Data strd6
x7 calling strd7
x8 NaN strd8
You could use a where conditional:
df11['b'] = df11['b'].where(df11.b.map(len) != 3, np.nan)
Something like:
for i, ele in enumerate(df11['b']):
if len(ele) == 3:
df11['b'][i] = np.nan

Replace column values based on partial string match from another dataframe python pandas

I need to update some cell values, based on keys from a different dataframe. The keys are always unique strings, but the second dataframe may or may not contain some extra text at the beginning or at the end of the key. (not necessarily separated by " ")
Frame:
Keys Values
x1 1
x2 0
x3 0
x4 0
x5 1
Correction:
Name Values
SS x1 1
x2 AA 1
x4 1
Expected output Frame:
Keys Values
x1 1
x2 1
x3 0
x4 1
x5 1
I am using the following:
frame.loc[frame['Keys'].isin(correction['Keys']), ['Values']] = correction['Values']
The problem is that isin returns True only on exact mach (as far as I know), which works for only about 30% of my data.
First extract values by Frame['Keys'] joined by | for OR:
pat = '|'.join(x for x in Frame['Keys'])
Correction['Name'] = Correction['Name'].str.extract('('+ pat + ')', expand=False)
#remove non matched rows filled by NaNs
Correction = Correction.dropna(subset=['Name'])
print (Correction)
Name Values
0 x1 1
1 x2 1
2 x4 1
Then create dictionary and map for map by Correction['Name']:
d = dict(zip(Correction['Name'], Correction['Values']))
Frame['Values'] = Frame['Keys'].map(d).fillna(Frame['Values']).astype(int)
print (Frame)
Keys Values
0 x1 1
1 x2 1
2 x3 0
3 x4 1
4 x5 1

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

Create frame wih both single and multi-level columns and add data to it

I have 2 dataframes like this
frame1=pd.DataFrame(columns=['A','B','C'])
a=['d1','d2','d3']
b=['d4','d5']
tups=([('T1',x) for x in a]+
[('T2',x) for x in b])
cols=pd.MultiIndex.from_tuples(tups,names=['Trial','Data'])
frame2=pd.DataFrame(columns=cols)
My goal is to have both DataFrames in one, and then add some rows of data. The resulting DataFrame would be like
Trial A B C T1 T2
Data d1 d2 d3 d4 d5
0 1 2 3 4 5 6 7 8
1 ...
...
That could somehow be achieved if I did
frame2['A']=1
frame2['B']=2
frame2['C']=3
But this is not a clean solution, and I can't create the frame and then add data, for I would be required to at least insert manually the first row.
I tried
frame3=frame1.join(frame2)
>> A1 A2 A3 (T1, d1) (T1, d2) (T1, d3) (T2, d4) (T2, d5)
This I think is not multi column level.
My second trial
tup2=([('A1',),('A2',),('A3',)]+[('T1',x) for x in a]+
[('T2',x) for x in b])
cols2=pd.MultiIndex.from_tuples(tup2,names=['Trial','Data'])
data=[1,2,3,4,5,6,7,8]
frame20=pd.DataFrame(data,index=cols2).T
Trial A1 A2 A3 T1 T2
Data NaN NaN NaN d1 d2 d3 d4 d5
0 1 2 3 4 5 6 7 8
This one works fine when trying to query it frame20.loc[0,'A1'][0] but if for example I do
frame20['Peter']=1234
>Trial A1 A2 A3 T1 T2 Peter
Data NaN NaN NaN d1 d2 d3 d4 d5
0 1 2 3 4 5 6 7 8 1234
being the column 'Peter' what I desire as opposed to for example A1, which is what I get.
My third trial
tup3=(['A','B','C']+[('T1',x) for x in a]+
[('T2',x) for x in b])
cols3=pd.MultiIndex.from_tuples(tup3,names=['Trial','Data'])
frame21=pd.DataFrame(data,index=cols3).T
returned exactly the same as the second one.
So, what I'm looking for, is a way to do
pd.DataFrame(rows_of_data,index=alfa).T #or
pd.DataFrame(rows_of_data,columns=beta)
where either alfa or beta are in a correct format.
Also, as a bonus, let's say I finally came up with a way to do
finalframe=pd.DataFrame(columns=beta)
How do I have to use concat,append or join so I can add a random row of data such as data=[1,2,3,4,5,6,7,8] to my empty but perfectly created finalframe?
Thank you, best regards
You want to add a level to frame1 with empty strings
pandas.MultiIndex.from_tuples
idx = pd.MultiIndex.from_tuples([(c, '') for c in frame1])
f1 = frame1.set_axis(idx, axis=1, inplace=False)
frame3 = pd.concat([f1, frame2], axis=1)
frame3.reindex([0, 1])
A B C T1 T2
d1 d2 d3 d4 d5
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
pandas.concat
frame3 = pd.concat([
pd.concat([frame1], keys=[''], axis=1).swaplevel(0, 1, 1),
frame2], axis=1)
frame3.reindex([0, 1])
A B C T1 T2
d1 d2 d3 d4 d5
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN

How to display cross-tabs of all interactions?

I have a dataset which (in simplified form) looks like this:
import pandas as pd
df = pd.DataFrame({"target":[20,30,40], "x1":[1,0,1], "x2":[0,1,1], "x3":[0,0,1]}
And I want to find the average value of target for all possible two-variable (x_i, x_j)interactions. So the output should look like this:
How would I go about doing this in Pandas?
You can use pivot_table and for add not exist values reindex by MultiIndex created by from_product:
df = df.pivot_table(index='x1',columns=['x2','x3'], values='target')
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
df = df.reindex(columns=mux)
print (df)
x2 0 1
x3 0 1 0 1
x1
0 NaN NaN 30.0 NaN
1 20.0 NaN NaN 40.0
If want replace NaNs to 0:
df = df.pivot_table(index='x1',columns=['x2','x3'], values='target', fill_value=0)
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
df = df.reindex(columns=mux, fill_value=0)
print (df)
x2 0 1
x3 0 1 0 1
x1
0 0 0 30 0
1 20 0 0 40

Resources