How to replace the pandas column value based on others dataframe columns - python-3.x

I have 2 pandas dataframe as below
df1:-
col1 col2 col3
aa b c
aa d c
bb d t
bb b g
cc e c
dd g c
and 2nd dataframe:-
col1 col2
aa b
cc e
bb d
And I want to change the value of col3 of dataframe1 to 'cc'. like below. based on 2nd dataframe column col1 and col2.
col1 col2 col3
aa b cc
aa d c
bb d cc
bb b g
cc e cc
dd g c
In short, I want to map 2nd dataframe columns(col1,col2) with 1st dataframe of columns(col1,col2) and change the column(col3) of 1st dataframe where it matches.

Use DataFrame.merge with left join and indicator parameter for helper column, compare by Series.eq for == with both and last set values in DataFrame.loc:
m = df1.merge(df2, on=['col1','col2'],indicator=True, how='left')['_merge'].eq('both')
df1.loc[m, 'col3'] = 'cc'
print (df1)
col1 col2 col3
0 aa b cc
1 aa d c
2 bb d cc
3 bb b g
4 cc e cc
5 dd g c

You can use pd.concat and drop_duplicates after assign a value for 'col3' on dataframe, df2 :
df = pd.concat([df2.assign(col3='cc'), df1]).drop_duplicates(['col1','col2']).reset_index(drop=True)
df
Output:
col1 col2 col3
0 aa b cc
1 cc e cc
2 bb d cc
3 aa d c
4 bb b g
5 dd g c

Related

How to find the intersection of a pair of columns in pandas dataframe with pairs in any order?

I have below dataframe
col1 col2
a b
b a
c d
d c
e d
Desired Output should be unique pair from two columns
col1 col2
a b
c d
e d
Convert values to frozenset and then filter by DataFrame.duplicated in boolean indexing:
df = df[~df[['col1','col2']].apply(frozenset, axis=1).duplicated()]
print (df)
col1 col2
0 a b
2 c d
4 e d
Or you can sorting values by np.sort and remove duplicates by DataFrame.drop_duplicates:
df = pd.DataFrame(np.sort(df[['col1','col2']]), columns=['col1','col2']).drop_duplicates()
print (df)
col1 col2
0 a b
2 c d
4 d e

get the minimum of a column value based on other columns value

am trying to fetch the whole row's based on minimum value of a column with coditions
Df :
colA colB colC
A B 2
B C 3
C D 4
D E 5
E A 2
A A 0
B B 0
C C 0
D D 0
E E 0
trying to fetch the whole row where the colC is minimum integer but where as the colA and colB is not equal in the fastest way
output:
A B 2
E A 2
You can filter out first not same columns, then sorting and get lowest 2 values:
df1 = df[df['colA'].ne(df['colB'])].sort_values('colC').head(2)
And for all another rows remove rows by index from original:
df2 = df.drop(df1.index)
print (df1)
colA colB colC
0 A B 2
4 E A 2
print (df2)
colA colB colC
1 B C 3
2 C D 4
3 D E 5
5 A A 0
6 B B 0
7 C C 0
8 D D 0
9 E E 0

Fill Null values in Data-Frame with Column names

I have a data-frame with 55 columns and 2 million rows having mix of categorical and numeric fileds. There are null/na values in the data-set. I want to fill Null values with Column names.
The data-set I have is:
A B C D .....
1 na na 3 .....
na 3 4 na .....
........................
The output the I am trying to get is:
A B C D .....
1 B C 3 .....
A 3 4 D .....
........................
I am trying to use :
df.fillna(method='ffill')
Is there another way?
Python:3.6.5
Use DataFrame.fillna with columns converted to Series by Index.to_series:
df = df.fillna(df.columns.to_series())
print (df)
A B C D
0 1 B C 3
1 A 3 4 D
EDIT: If categorical columns in DataFrame select these columns and append non exist values by cat.add_categories:
for c in df.select_dtypes('category'):
df[c] = df[c].cat.add_categories(c)
df = df.fillna(df.columns.to_series())

Merge 2 Different Data Frames - Python 3.6

Want to merge 2 table and blank should fill with first table rows.
DF1:
Col1 Col2 Col3
A B C
DF2:
Col6 Col8
1 2
3 4
5 6
7 8
9 10
I am expecting result as below:
Col1 Col2 Col3 Col6 Col8
A B C 1 2
A B C 3 4
A B C 5 6
A B C 7 8
A B C 9 10
Use assign, but then is necessary change order of columns:
df = df2.assign(**df1.iloc[0])[df1.columns.append(df2.columns)]
print (df)
Col1 Col2 Col3 Col6 Col8
0 A B C 1 2
1 A B C 3 4
2 A B C 5 6
3 A B C 7 8
4 A B C 9 10
Or concat and replace NaNs by forward filling with ffill:
df = pd.concat([df1, df2], axis=1).ffill()
print (df)
Col1 Col2 Col3 Col6 Col8
0 A B C 1 2
1 A B C 3 4
2 A B C 5 6
3 A B C 7 8
4 A B C 9 10
you can merge both dataframes by index with outer join and forward fill the data
df1.merge(df,left_index=True,right_index=True,how='outer').fillna(method='ffill')
Out:
Col6 Col8 Col1 Col2 Col3
0 1 2 A B C
1 3 4 A B C
2 5 6 A B C
3 7 8 A B C
4 9 10 A B C

handling of unstructured data in pandas

I'm trying to read a unstructured csv file using pandas read_csv(). The problem is some of the files have rows with extra columns as shown below in the sample input.
sample input:
col0,col1,col2
a,b,c
a,b,c,d
a,b,c
a,b,c,d
While handling these kind of files the program throws some ParseError
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
sample output :
col0 | col1 | col2 | col3
a | b | c | NaN
a | b | c | d
a | b | c | NaN
a | b | c | d
I don't want to ignore the lines with error_bad_lines = False parameter in pandas read_csv().
Any kind of help will be highly appreciated.
One possible solution is preprocessing first and find max number of separators, and set parameter names by range:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
num = max(l.count(',') for l in lines) + 1
print (num)
4
df = pd.read_csv(path_csv, names=range(num))
print (df)
0 1 2 3
0 col0 col1 col2 NaN
1 a b c NaN
2 a b c d
3 a b c NaN
4 a b c d
Similar if header is not important, so possible remove it:
df = pd.read_csv(path_csv, names=range(num), skiprows=1)
print (df)
0 1 2 3
0 a b c NaN
1 a b c d
2 a b c NaN
3 a b c d
Another more dynamic solution is add values to header:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
#get header to list
header = [x.strip() for x in lines[0].split(',')]
#get max number of separator
max_num = max(l.count(',') for l in lines)
#add missing header values by range
if len(header) < max_num + 1:
header = header + list(range(max_num-len(header) + 1))
print (header)
['col0', 'col1', 'col2', 0]
df = pd.read_csv(path_csv, names=header, skiprows=1)
print (df)
col0 col1 col2 0
0 a b c NaN
1 a b c d
2 a b c NaN
3 a b c d

Resources