Looking for an analogue to pd.DataFrame.drop_duplicates() where order does not matter - python-3.x

I would like to use something similar to dropping the duplicates of a DataFrame. I would like columns' order not to matter. What I mean is that the function shuold consider a row consisting of the entries 'a', 'b' to be identical to a row consisting of the entries 'b', 'a'. For example, given
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['a', 'b'], ['b', 'a']])
0 1
0 a b
1 c d
2 a b
3 b a
I would like to obtain:
0 1
0 a b
1 c d
where the preference is for efficiency, as I run this on a huge dataset within a groupby operation.

Call np.sort first, and then drop duplicates.
df[:] = np.sort(df.values, axis=1)
df.drop_duplicates()
0 1
0 a b
1 c d

Related

Python code for Multiple IF() and VLOOKUP() in Excel [duplicate]

if df['col']='a','b','c' and df2['col']='a123','b456','d789' how do I create df2['is_contained']='a','b','no_match' where if values from df['col'] are found within values from df2['col'] the df['col'] value is returned and if no match is found, 'no_match' is returned? Also I don't expect there to be multiple matches, but in the unlikely case there are, I'd want to return a string like 'Multiple Matches'.
With this toy data set, we want to add a new column to df2 which will contain no_match for the first three rows, and the last row will contain the value 'd' due to the fact that that row's col value (the letter 'a') appears in df1.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col': ['a123','b456','d789', 'a']})
In other words, values from df1 should be used to populate this new column in df2 only when a row's df2['col'] value appears somewhere in df1['col'].
In [2]: df1
Out[2]:
col
0 a
1 b
2 c
3 d
In [3]: df2
Out[3]:
col
0 a123
1 b456
2 d789
3 a
If this is the right way to understand your question, then you can do this with pandas isin:
In [4]: df2.col.isin(df1.col)
Out[4]:
0 False
1 False
2 False
3 True
Name: col, dtype: bool
This evaluates to True only when a value in df2.col is also in df1.col.
Then you can use np.where which is more or less the same as ifelse in R if you are familiar with R at all.
In [5]: np.where(df2.col.isin(df1.col), df1.col, 'NO_MATCH')
Out[5]:
0 NO_MATCH
1 NO_MATCH
2 NO_MATCH
3 d
Name: col, dtype: object
For rows where a df2.col value appears in df1.col, the value from df1.col will be returned for the given row index. In cases where the df2.col value is not a member of df1.col, the default 'NO_MATCH' value will be used.
You must first guarantee that the indexes match. To simplify, I'll show as if the columns where in the same dataframe. The trick is to use the apply method in the columns axis:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd'],
'col2': ['a123','b456','d789', 'a']})
df['contained'] = df.apply(lambda x: x.col1 in x.col2, axis=1)
df
col1 col2 contained
0 a a123 True
1 b b456 True
2 c d789 False
3 d a False
In 0.13, you can use str.extract:
In [11]: df1 = pd.DataFrame({'col': ['a', 'b', 'c']})
In [12]: df2 = pd.DataFrame({'col': ['d23','b456','a789']})
In [13]: df2.col.str.extract('(%s)' % '|'.join(df1.col))
Out[13]:
0 NaN
1 b
2 a
Name: col, dtype: object

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

fill a new column in a pandas dataframe from the value of another dataframe [duplicate]

This question already has an answer here:
Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe
(1 answer)
Closed 4 years ago.
I have two dataframes :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5, 1]})
col1 col2 col3
0 a c 1
1 b c 2
2 a d 3
3 a d 4
4 b c 5
5 h i 1
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'f'], 'col2': ['c', 'c', 'd', 'k'], 'col3': [12, 23, 45, 78]})
col1 col2 col3
0 a c 12
1 b c 23
2 a d 45
3 f k 78
and I'd like to build a new column in the first one according to the values of col1 and col2 that can be found in the second one. That is this new one :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5],'col4' : [12, 23, 45, 45, 23]})
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23
5 h i 1 NaN
How am I able to do that ?
Tks for your attention :)
Edit : it has been adviced to look for the answer in this subject Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe but it is not the same question.
In here, not only the ID does not exist since it is splitted in col1 and col2 but above all, although being unique in the second dataframe, it is not unique in the first one. This is why I think that neither a merge nor a join can be the answer to this.
Edit2 : In addition, couples col1 and col2 of df1 may not be present in df2, in this case NaN is awaited in col4, and couples col1 and col2 of df2 may not be needed in df1. To illustrate these cases, I addes some rows in both df1 and df2 to show how it could be in the worst case scenario
You could also use map like
In [130]: cols = ['col1', 'col2']
In [131]: df1['col4'] = df1.set_index(cols).index.map(df2.set_index(cols)['col3'])
In [132]: df1
Out[132]:
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23

Python :Read and assign name to sheet in Excels

I have an Excel files with multiple sheets named: a,b,c,d,e.....,z
I can read sheet a using following code
xl=pd.ExcelFile(r'path.xlsx')
a=xl.parse('a')
How can I assign the names of sheets:a,b,c...,z as dataframe names so that I can easily call it later
You can create dict of DataFrames by dict comprehension:
xl = pd.ExcelFile(r'path.xlsx')
print (xl.parse('a'))
a b
0 1 2
print (xl.sheet_names)
['a', 'b', 'c']
dfs = {sheet: xl.parse(sheet) for sheet in xl.sheet_names}
print (dfs)
{'a': a b
0 1 2, 'b': b c
0 5 5, 'c': r t
0 7 8}
print (dfs['a'])
a b
0 1 2
print (dfs['b'])
b c
0 5 5

Filter data iteratively in Python data frame

I'm wondering about existing pandas functionalities, that I might not been able to find so far.
Bascially, I have a data frame with various columns. I'd like to select specific rows depending on the values of certain colums (FYI: i was interested in the value of column D, that had several parameters described in A-C).
E.g. I want to know which row(s) have A==1 & B==2 & C==5?
df
A B C D
0 1 2 4 a
1 1 2 5 b
2 1 3 4 c
df_result
1 1 2 5 b
So far I have been able to basically reduce this:
import pandas as pd
df = pd.DataFrame({'A': [1,1,1],
'B': [2,2,3],
'C': [4,5,4],
'D': ['a', 'b', 'c']})
df_A = df[df['A'] == 1]
df_B = df_A[df_A['B'] == 2]
df_C = df_B[df_B['C'] == 5]
To this:
parameter = [['A', 1],
['B', 2],
['C', 5]]
df_filtered = df
for x, y in parameter:
df_filtered = df_filtered[df_filtered[x] == y]
which yielded the same results. But I wonder if there's another way? Maybe without loop in one line?
You could use query() method to filter data, and construct filter expression from parameters like
In [288]: df.query(' and '.join(['{0}=={1}'.format(x[0], x[1]) for x in parameter]))
Out[288]:
A B C D
1 1 2 5 b
Details
In [296]: df
Out[296]:
A B C D
0 1 2 4 a
1 1 2 5 b
2 1 3 4 c
In [297]: query = ' and '.join(['{0}=={1}'.format(x[0], x[1]) for x in parameter])
In [298]: query
Out[298]: 'A==1 and B==2 and C==5'
In [299]: df.query(query)
Out[299]:
A B C D
1 1 2 5 b
Just for the information if others are interested, I would have done it this way:
import numpy as np
matched = np.all([df[vn] == vv for vn, vv in parameters], axis=0)
df_filtered = df[matched]
But I like the query function better, now that I have seen it #John Galt.

Resources