Replace column values based on partial string match from another dataframe python pandas - python-3.x

I need to update some cell values, based on keys from a different dataframe. The keys are always unique strings, but the second dataframe may or may not contain some extra text at the beginning or at the end of the key. (not necessarily separated by " ")
Frame:
Keys Values
x1 1
x2 0
x3 0
x4 0
x5 1
Correction:
Name Values
SS x1 1
x2 AA 1
x4 1
Expected output Frame:
Keys Values
x1 1
x2 1
x3 0
x4 1
x5 1
I am using the following:
frame.loc[frame['Keys'].isin(correction['Keys']), ['Values']] = correction['Values']
The problem is that isin returns True only on exact mach (as far as I know), which works for only about 30% of my data.

First extract values by Frame['Keys'] joined by | for OR:
pat = '|'.join(x for x in Frame['Keys'])
Correction['Name'] = Correction['Name'].str.extract('('+ pat + ')', expand=False)
#remove non matched rows filled by NaNs
Correction = Correction.dropna(subset=['Name'])
print (Correction)
Name Values
0 x1 1
1 x2 1
2 x4 1
Then create dictionary and map for map by Correction['Name']:
d = dict(zip(Correction['Name'], Correction['Values']))
Frame['Values'] = Frame['Keys'].map(d).fillna(Frame['Values']).astype(int)
print (Frame)
Keys Values
0 x1 1
1 x2 1
2 x3 0
3 x4 1
4 x5 1

Related

Fill NaN if values in another column are identical

I have the following dataframe:
Out[117]: mydata
author email ri oi
0 X1 NaN NaN 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com NaN 0000-0001-8437-498X
4 X5 ab#ma.com NaN 0000-0001-8437-498X
where column ri represents an author's ResearcherID, and oi the ORCID. One author may has more than one email address, so column email has duplicates.
Firstly, I'm trying to fill na in ri if the corresponding rows in oi share the same value, using a non-NaN value in ri. The result I want is:
author email ri oi
0 X1 NaN K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com K-5448-2012 0000-0001-8437-498X
Secondly, merging emails and using the merged value to fill na in column email, if the values in ri (or oi) are identical. I want to get a dataframe like the following one:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
I've tried the following code:
final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
fill_df = mydata[mydata.oi == i]
fill_df.ri = fill_df.ri.fillna(method='ffill')
fill_df.ri = fill_df.ri.fillna(method='bfill')
null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])
This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email and then use the merged value as an input in the process of filling na?
Try 2 transform. One for each column. On ri, use first. On email, use combination of dropna, unique, and join
g = df.dropna(subset=['oi']).groupby('oi')
df['ri'] = g.ri.transform('first')
df['email'] = g.email.transform(lambda x: ';'.join(x.dropna().unique()))
Out[79]:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X

EXCEL: Average Rank of list with sumproduct or rank.avg?

e.g. I have a list
A B C D E..
Name avg sec sec
x1 ? 3 2
x2 5 1
x3 7 3
..
is it possivle with sumproduct oder rank.avg find out the avg rank from the columns C..D?
player x1 has in column C place 1 and in col D place 2 => in avg = 1.5
x1 1.5
x2 1.5
x3 3
You can try below formula
=(RANK.EQ(C2,$C$2:$C$4,1)+RANK.EQ(D2,$D$2:$D$4,1))/2

Adding and subtract inbetween row inputs and value equal to the first column next row using pandas

Assume I have a dataset with three inputs:
x1 x2 x3
0 a b c
1 d e f
2 g h i
3 j k l
4 m n o
5 p q r
6 s t u
:
:
0,1,2,3 are times, x1, x2, x3 are inputs that are measured. So here x1 inputs are measured at every one hour. x2 and x3 will be measured at different time. What I need to do , I want write that what ever the measured in x1, x2, x3 it will add and subtract the values are equal to the x1 input next time value
So here what I want to do is:
x1 x2 x3 y
0 a b c a+b-c=d
1 d e f d+e-f=g
2 g h i g+h-i=j
3 j k l j+k-l=m
4 m n o m+n-o=p
5 p q r p+q-r=s
6 s t u s+t-u=v
:
:
Here with my actual data according to my csv file:
X1 x2 x3 y
0 63 0 0 63+0-0=63
60(min) 63 0 2 63+0-2 =104
120 104 11 0 104+11-0=93
180 93 0 50 93+0-50=177
240 177 0 2 177+0-2=133
300 133 0 0 133+0-0=next value of x1
I tried shift method and it didn't work for me what I want exactly. I tried another method and it worked, but didn't came as I want. Here I upload the code.
Code :
data = pd.read_csv('data6.csv')
i=0
j=1
while j < len(data):
j=data['x1'][i] - data['x2'][i] + data['x3'][i]
i+=1
j!=i
print(j)
This is works , but it is just showing only one data
63
In my csv file this is second input value of x1 input.
I want to write this code contonously happened and read the value as I shown above.
Can anyone help me to solve this problem?
My csv file
Try:
>>> df['y'] = df['x1'] + '+' + df['x2'] + '-' + df['x3'] + '!=' + df.shift(-1)['x1']
>>> df
x1 x2 x3 y
0 a b c a+b-c!=d
1 d e f d+e-f!=g
2 g h i g+h-i!=j
3 j k l j+k-l!=m
4 m n o m+n-o!=p
5 p q r p+q-r!=s
6 s t u NaN
>>>
I found the answer for this with your help. Thank you very much for helping me. #Adam.Er8 , #U10-Forward and #anky_91
Here is my code:
df = pd.DataFrame(data)
df['y'] = 0
for i in range(len(df)-1): #iterating between all the rows of dataframe
df['y'].iloc[i] == df['x1'].iloc[i] + df['x2'].iloc[i] - df['x3'].iloc[i]
df['y'].iloc[i] = df['x1'].iloc[i+1]

Populate cells based on x by y cell value

I'm trying to populate cells based on values from two different cells.
Values in the cell needs to be (n-1) where n is the input and then repeated based on the amount of the other cell.
For example, I have input:
x y
2 5
Output should be:
x should have 0 and 1; each repeated five times
y should have 0, 1, 2, 3, 4; each repeated twice
x1 y1
0 0
0 1
0 2
0 3
0 4
1 0
1 1
1 2
1 3
1 4
I used:
=IF(ROW()<=C2+1,K2-1,"")
and
=IF(ROW()<=d2+1,K2-1,"")
but it is not repeating and I only see:
x y
0 0
1 1
__ 2
__ 3
__ 4
(C2 and D2 are where values for x and y are, K is the number of items.)
Are there any suggestions on how I can do this?
In Row2 and copied down to suit:
=IF(ROW()<=1+C$2*D$2,INT((ROW()-2)/D$2),"")
and
=IF(ROW()<=1+C$2*D$2,MOD(ROW()-2,D$2),"")

Selecting data from multiple dataframes

my workbook Rule.xlsx has following data.
sheet1:
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
Sheet 2:
group ordercode quantity
0 x 1
y 3
1 x 1
y 2
z 1
I have created dataframe using below method.
df1 =data.parse('sheet1')
df2=data.parse('sheet2')
my desired result is writing a sequence using these two dataframe.
df3:
group ordercode quantity
0 A 1
B 3
0 x 1
y 3
1 C 1
E 2
D 1
1 x 1
y 2
z 1
one from df1 and one from df2.
I wish to know how I can print the data by selecting group number (eg. group(0), group(1) etc).
any suggestion ?
After some comments solution is:
#create OrderDict of DataFrames
dfs = pd.read_excel('Rule.xlsx', sheet_name=None)
#ordering of DataFrames
order = 'SWC_1380_81,SWC_1382,SWC_1390,SWC_1391,SWM_1380_81'.split(',')
#in loops lookup dictionaries, replace NaNs and create helper column
L = [dfs[x].ffill().assign(g=i) for i, x in enumerate(order)]
#last join together, sorting and last remove helper column
df = pd.concat(L).sort_values(['group','g'])

Resources