Identifying Duplicates and arranging corresponding data in their perspective columns using python - python-3.x

I am having a CSV file that contains multiple duplicate entries. What I'am trying to do is that, I need to gather those duplicate fields and arrange their corresponding fields.
My table:
Column A
Column B
Column C
Column D
1004.
1
1004.
3
1004.
2
What I need:
Column A
Column B
Column C
Column D
1004.
1
3
2
How do I solve this in Python.
I tried Identifying the duplicates and dunno what to do next.
import pandas
df = pandas.read_csv(csv_file, names=fields, index_col=False)
df = df[df.duplicated([column1], keep=False)]
df.to_csv(csv_file2, index=False)

you can use a lambda function. First, we group by Column A:
df=pd.DataFrame(data={'col_a':[1004,1004,1004,1005],'col_b':[1,3,2,2],'col_c':['','','',''],'col_d':['','','','']})
df=df.replace(r'^\s*$',np.nan,regex=True) #replace empty cells with nan
dfx=df.groupby('col_a').agg(list)
print(dfx)
col_a col_b col_c col_d
1004 [1, 3, 2] [nan, nan, nan] [nan, nan, nan]
1005 [2] [nan] [nan]
if you have a several columns, you can replace the nan values ​​according to the index number of the values ​​in the list:
dfx['col_c']=dfx['col_b'].apply(lambda x: x[1] if len(x) > 1 else np.nan) #get the second value from col_b
dfx['col_d']=dfx['col_b'].apply(lambda x: x[2] if len(x) > 2 else np.nan) #get the third value from col_b
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan) #finally replace col_b with its first value (col_b must be replaced last)
Note: if we replace col_b first, we will lose the list. This is why we are replacing col_b last.
if there are many columns we can use a for loop:
loop_cols=[*dfx.columns[1:]] # get the list of columns except col_b.
list_index=1
list_lenght=1
for i in loop_cols:
dfx[i]=dfx['col_b'].apply(lambda x: x[list_index] if len(x) > list_lenght else np.nan)
list_index+=1
list_lenght+=1
#finally, replace col_b.
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan)
output:
col_a col_b col_c col_d
1004 1 3.0 2.0
1005 2

Related

Python code for Multiple IF() and VLOOKUP() in Excel [duplicate]

if df['col']='a','b','c' and df2['col']='a123','b456','d789' how do I create df2['is_contained']='a','b','no_match' where if values from df['col'] are found within values from df2['col'] the df['col'] value is returned and if no match is found, 'no_match' is returned? Also I don't expect there to be multiple matches, but in the unlikely case there are, I'd want to return a string like 'Multiple Matches'.
With this toy data set, we want to add a new column to df2 which will contain no_match for the first three rows, and the last row will contain the value 'd' due to the fact that that row's col value (the letter 'a') appears in df1.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col': ['a123','b456','d789', 'a']})
In other words, values from df1 should be used to populate this new column in df2 only when a row's df2['col'] value appears somewhere in df1['col'].
In [2]: df1
Out[2]:
col
0 a
1 b
2 c
3 d
In [3]: df2
Out[3]:
col
0 a123
1 b456
2 d789
3 a
If this is the right way to understand your question, then you can do this with pandas isin:
In [4]: df2.col.isin(df1.col)
Out[4]:
0 False
1 False
2 False
3 True
Name: col, dtype: bool
This evaluates to True only when a value in df2.col is also in df1.col.
Then you can use np.where which is more or less the same as ifelse in R if you are familiar with R at all.
In [5]: np.where(df2.col.isin(df1.col), df1.col, 'NO_MATCH')
Out[5]:
0 NO_MATCH
1 NO_MATCH
2 NO_MATCH
3 d
Name: col, dtype: object
For rows where a df2.col value appears in df1.col, the value from df1.col will be returned for the given row index. In cases where the df2.col value is not a member of df1.col, the default 'NO_MATCH' value will be used.
You must first guarantee that the indexes match. To simplify, I'll show as if the columns where in the same dataframe. The trick is to use the apply method in the columns axis:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd'],
'col2': ['a123','b456','d789', 'a']})
df['contained'] = df.apply(lambda x: x.col1 in x.col2, axis=1)
df
col1 col2 contained
0 a a123 True
1 b b456 True
2 c d789 False
3 d a False
In 0.13, you can use str.extract:
In [11]: df1 = pd.DataFrame({'col': ['a', 'b', 'c']})
In [12]: df2 = pd.DataFrame({'col': ['d23','b456','a789']})
In [13]: df2.col.str.extract('(%s)' % '|'.join(df1.col))
Out[13]:
0 NaN
1 b
2 a
Name: col, dtype: object

Pandas select preferred value from one of two columns to make a new column

I have a Pandas DataFrame with two columns of "complementary" data. For any given row, there are 3 possibilities:
1) Column A has a non-null value, and column B has a null value, NaN, that I want to replace with the non-null value from column A.
2) Column A has a null value, NaN, that I want to replace with the non-null value from column B.
3) Both columns A and B have null values, NaN, which means I'll keep NaN as the value for that row.
Here's a simplified version of my DataFrame:
df1 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan]})
I was thinking that as an intermediate step, I'd create a new column C with the entries I need:
df2 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan],
'C' : ['keep1', 'keep2', np.nan, 'keep4', np.nan]}
Then I'd drop the first two rows A and B:
df_final = df2.drop(['A', 'B'], axis=1)
My actual DataFrame has hundreds of rows, and I've tried several approaches (boolean filters, looping through the DataFrame using iterrows, using DataFrame.where()) without success. I'd think this would be a simple problem, but I'm not seeing it. Any help is appreciated.
Thanks
You can use combine_first() to fill the gaps in A from B:
df1['C'] = df1['A'].combine_first(df1['B'])
#0 keep1
#1 keep2
#2 NaN
#3 keep4
#4 NaN
Use Series.fillna for replace missing values from A by B values:
df1['C'] = df1.A.fillna(df1.B)
print (df1)
A B C
0 keep1 NaN keep1
1 NaN keep2 keep2
2 NaN NaN NaN
3 keep4 NaN keep4
4 NaN NaN NaN
For avoid drop is possible use DataFrame.pop for extract columns:
df1['C'] = df1.pop('A').fillna(df1.pop('B'))
print (df1)
C
0 keep1
1 keep2
2 NaN
3 keep4
4 NaN

Combine data from two columns into one without affecting the data values

I have two columns in a data frame. I want to combine those columns into a single column.
df = pd.DataFrame({'a': [500, 200, 13, 47], 'b':['$', '€', .586,.02]})
df
Out:
a b
0 500 $
1 200 €
2 13 .586
3 47 .02
I want to merge that two columns without affecting the data.
Expected output:
df
Out:
a
0 500$
1 200€
2 13.586
3 47.02
Please help me with this...
I tried the below solution, but it does not work for me,
df.b=np.where(df.b,df.b,df.a)
df.loc[df['b'] == '', 'b'] = df['a']
First solution working with convert both columns to strings and then join with +, last convert Series to one column DataFrame - but it working only if numbers less like 1 for column b:
df1 = df.astype(str)
df = (df1.a + df1.b.str.replace(r'^0', '')).to_frame('a')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02
Or if want mixed values numeric for last 2 rows and strings for first 2 rows use:
m = df.b.apply(lambda x: isinstance(x, str))
df.loc[m, 'a'] = df.loc[m, 'a'].astype(str) + df.b
df.loc[~m, 'a'] += df.pop('b')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02

How to select columns based on criteria?

I have the following dataframe:
d2 = {('CAR','ALPHA'): pd.Series(['A22', 'A23', 'A24', 'A25'],index=[2, 3, 4, 5]),
('CAR','BETA'): pd.Series(['B22', 'B23', 'B24', 'B25'],index=[2, 3, 4, 5]),
('MOTOR','SOLO'): pd.Series(['S22', 'S23', 'S24', 'S25'], index=[2, 3, 4, 5])}
db= pd.DataFrame(data=d2)
I would like in the columns that have 'CAR' in the 0 level multiindex to delete all the values and set them to NA after a row index, ex. 4.
I am trying to use .loc but I would like the results to be saved in the same dataframe.
The second thing I would to do to set the values of columns that their 0 multiindex level is different from 'CAR' to NA after a row index, ex 3.
Use slicers for first and for second MultiIndex.get_level_values compare by level value:
idx = pd.IndexSlice
db.loc[4:, idx['CAR', :]] = np.nan
db.loc[3:, db.columns.get_level_values(0) != 'CAR'] = 'AAA'
Or:
mask = db.columns.get_level_values(0) == 'CAR'
db.loc[4:, mask] = np.nan
db.loc[3:, ~mask] = 'AAA'
print(db)
CAR MOTOR
ALPHA BETA SOLO
2 A22 B22 S22
3 A23 B23 AAA
4 NaN NaN AAA
5 NaN NaN AAA

Splitting dictionary/list into Separate Columns

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})
this is output
a b
0 1 {'c': 1}
1 2 [{'c': 4}, {'d': 3}]
2 3 [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]
I need to split this column into separate columns.
How can i do that I used apply(pd.series) method This is what I'm getting as a output
0 1 c
0 NaN NaN 1.0
1 {'c': 4} {'d': 3} NaN
2 {'c': 5, 'd': 6} {'c': 5, 'd': 6} NaN
but I want like this if possible:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8
I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.
However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.
newcols = set()
for el in df['b']:
if isinstance(el, dict):
newcols.update(el.keys())
elif isinstance(el, list):
for i in el:
newcols.update(i.keys())
def extractvalues(x, col):
if isinstance(x['b'], dict):
return x['b'].get(col, np.nan)
elif isinstance(x['b'], list):
return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')
for nc in newcols:
df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)
df.drop('b', axis=1, inplace=True)
Your dataframe is now:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8

Resources