Remove duplicate couple values [duplicate] - python-3.x

My question is similar to Pandas: remove reverse duplicates from dataframe but I have an additional requirement. I need to maintain row value pairs.
For example:
I have data where column A corresponds to column C and column B corresponds to column D.
import pandas as pd
# Initial data frame
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50],
'B': [50, 22, 35, 5, 10, 11, 21, 0],
'C': ["a", "b", "r", "x", "c", "w", "z", "y"],
'D': ["y", "c", "w", "z", "b", "r", "x", "a"]})
data
# A B C D
#0 0 50 a y
#1 10 22 b c
#2 11 35 r w
#3 21 5 x z
#4 22 10 c b
#5 35 11 w r
#6 5 21 z x
#7 50 0 y a
I would like to remove duplicates that exist in columns A and B but I need to preserve their corresponding letter value in columns C and D.
I have a solution here but is there a more elegant way of doing this?
# Desired data frame
new_data = pd.DataFrame()
# Concat numbers and corresponding letters
new_data['AC'] = data['A'].astype(str) + ',' + data['C']
new_data['BD'] = data['B'].astype(str) + ',' + data['D']
# Drop duplicates despite order
new_data = new_data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
# Recreate dataframe
new_data = pd.DataFrame.from_items(zip(new_data.index, new_data.values)).T
new_data = pd.concat([new_data.iloc[:,0].str.split(',', expand=True),
new_data.iloc[:,1].str.split(',', expand=True)], axis=1)
new_data.columns=['A', 'B', 'C', 'D']
new_data
# A B C D
#0 0 a 50 y
#1 10 b 22 c
#2 11 r 35 w
#3 21 x 5 z
EDIT technically output should look like this:
new_data.columns=['A', 'C', 'B', 'D']
new_data
# A B C D
#0 0 a 50 y
#1 10 b 22 c
#2 11 r 35 w
#3 21 x 5 z

I think that you can do this with stack, drop_duplicates and unstack:
data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()
A B C D
0 0 50 a y
1 10 22 b c
2 11 35 r w
3 21 5 x z

create two additional columns taking the data the sorted data in columns
columns = ['A', 'B']
df = pd.concat([data, pd.DataFrame(np.sort(data[columns], axis=1), axis=1)
drop duplicates using the sorted data & select the original columns
df.drop_duplicates(df.columns.difference(data.columns))[data.columns]
output:
A B C D
0 0 50 a y
1 10 22 b c
2 11 35 r w
3 21 5 x z

Base on the link you provided
newdf=data[['A','B']].apply(lambda r: sorted(r), axis = 1).drop_duplicates()
newdf['C']=newdf.A.map(dict(zip(data.A,data.C)))
newdf['D']=newdf.B.map(dict(zip(data.B,data.D)))
newdf
Out[138]:
A B C D
0 0 50 a y
1 10 22 b c
2 11 35 r w
3 5 21 z x

Related

Pandas DataFrame: Same operation on multiple sets of columns

I want to do the same operation on multiple sets of columns of a DataFrame.
Since "for-loops" are frowned upon I'm searching for a decent alternative.
An example:
df = pd.DataFrame({
'a': [1, 11, 111],
'b': [222, 22, 2],
'a_x': [10, 80, 30],
'b_x': [20, 20, 60],
})
This is a simple for-loop approach. It's short and well readable.
cols = ['a', 'b']
for col in cols:
df[f'{col}_res'] = df[[col, f'{col}_x']].min(axis=1)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
This is an alternative (w/o for-loop), but I feel that the additional complexity is not really for the better.
cols = ['a', 'b']
def res_df(df, col, name):
res = pd.Series(
df[[col, f'{col}_x']].min(axis=1), index=df.index, name=name)
return res
res = [res_df(df, col, f'{col}_res') for col in cols]
df = pd.concat([df, pd.concat(res, axis=1)], axis=1)
Does anyone have a better/more pythonic solution?
Thanks!
UPDATE 1
Inspired by the proposal from mozway I find the following solution quite appealing.
Imho it's short, readable and generic, since the particular operation can be swapped into a function and the list comprehension applies the function to the given sets of columns.
def operation(s1, s2):
# fill in any operation on pandas series'
# e.g. res = s1 * s2 / (s1 + s2)
res = np.minimum(s1, s2)
return res
df = df.join(
[operation(df[f'{col}'], df[f'{col}_x']).rename(f'{col}_res') for col in cols]
)
You can use numpy.minimum after setting the arrays to identical column names:
cols = ['a', 'b']
cols2 = [f'{x}_x' for x in cols]
df = df.join(np.minimum(df[cols],
df[cols2].set_axis(cols, axis=1))
.add_suffix('_res'))
output:
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
or, using rename as suggested in the other answer:
cols = ['a', 'b']
cols2 = {f'{x}_x': x for x in cols}
df = df.join(np.minimum(df[cols],
df[list(cols2)].rename(columns=cols2))
.add_suffix('_res'))
One idea is rename columns names by dictionary, select columns by list cols and then group by columns names with aggregate min, sum, max or use custom function:
cols = ['a', 'b']
suffix = '_x'
d = {f'{x}{suffix}':x for x in cols}
print (d)
{'a_x': 'a', 'b_x': 'b'}
print (df.rename(columns=d)[cols])
a a b b
0 1 10 222 20
1 11 80 22 20
2 111 30 2 60
df1 = df.rename(columns=d)[cols].groupby(axis=1,level=0).min().add_suffix('_res')
print (df1)
a_res b_res
0 1 20
1 11 20
2 30 2
Last add to original DataFrame:
df = df.join(df1)
print (df)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2

Can I apply vectorization here? Or should I think about this differently?

To put it simply, I have rows of activity that happen in a given month of the year. I want to append additional rows of inactivity in between this activity, while resetting the month values into a sequence. For example, if I have months 2, 5, 7, I need to map these to 1, 4, 7, while my inactive months happen in 2, 3, 5, and 6. So, I would have to add four rows with this inactivity. I've done this with dictionaries and for-loops, but I know this is not efficient, especially when I move this to thousands of rows of data to process. Any suggestions on how to optimize this? Do I need to think about the data format differently? I've had a suggestion to make lists and then move that to the dataframe at the end, but I don't see a huge gain there. I don't know enough of NumPy to figure out how to do this with vectorization since that's super fast and it would be awesome to learn something new. Below is my code with the steps I took:
df = pd.DataFrame({'col1': ['A','A', 'B','B','B','C','C'], 'col2': ['X','Y','X','Y','Z','Y','Y'], 'col3': [1, 8, 2, 5, 7, 6, 7]})
Output:
col1 col2 col3
0 A X 1
1 A Y 8
2 B X 2
3 B Y 5
4 B Z 7
5 C Y 6
6 C Y 7
I'm creating a dictionary to handle this in for loops:
df1 = df.groupby('col1')['col3'].apply(list).to_dict()
df2 = df.groupby('col1')['col2'].apply(list).to_dict()
max_num = max(df.col3)
Output:
{'A': [1, 8], 'B': [2, 5, 7], 'C': [6, 7]}
{'A': ['X', 'Y'], 'B': ['X', 'Y', 'Z'], 'C': ['Y', 'Y']}
8
And now I'm adding those rows using my dictionaries by creating a new data frame:
df_new = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})
for key in df1.keys():
k = 1
if list(df1[key])[-1] - list(df1[key])[0] + 1 < max_num:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
df_new = df_new.append({'col1': key, 'col2': 'E', 'col3': str(k)}, ignore_index=True)
else:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
Output:
col1 col2 col3
0 A X 1
1 A N 2
2 A N 3
3 A N 4
4 A N 5
5 A N 6
6 A N 7
7 A Y 8
8 B X 1
9 B N 2
10 B N 3
11 B Y 4
12 B N 5
13 B Z 6
14 B E 7
15 C Y 1
16 C Y 2
17 C E 3
And then I pivot to the form I want it:
df_pivot = df_new.pivot(index='col1', columns='col3', values='col2')
Output:
col3 1 2 3 4 5 6 7 8
col1
A X N N N N N N Y
B X N N Y N Z E NaN
C Y Y E NaN NaN NaN NaN NaN
Thanks for the help.
We can replace the steps of creating and using dictionaries by the statement below, which utilizes reindex to place the additional values N and E without explicit loops.
df_new = df.set_index('col3')\
.groupby('col1')\
.apply(lambda dg:
dg.drop('col1', 1)
.reindex(range(dg.index.min(), dg.index.max()+1), fill_value='N')
.reindex(range(dg.index.min(), min(max_num, dg.index.max()+1)+1), fill_value='E')
.set_index(pd.RangeIndex(1, min(max_num, dg.index.max()-dg.index.min()+1+1)+1, name='col3'))
)\
.reset_index()
After this, you can apply your pivot statement as it is.

Pandas: apply list of functions on columns, one function per column

Setting: for a dataframe with 10 columns I have a list of 10 functions which I wish to apply in a function1(column1), function2(column2), ..., function10(column10) fashion. I have looked into pandas.DataFrame.apply and pandas.DataFrame.transform but they seem to broadcast and apply each function on all possible columns.
IIUC, with zip and a for loop:
Example
def function1(x):
return x + 1
def function2(x):
return x * 2
def function3(x):
return x**2
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3], 'C': [1, 2, 3]})
functions = [function1, function2, function3]
print(df)
# A B C
# 0 1 1 1
# 1 2 2 2
# 2 3 3 3
for col, func in zip(df, functions):
df[col] = df[col].apply(func)
print(df)
# A B C
# 0 2 2 1
# 1 3 4 4
# 2 4 6 9
You could do something like:
# list containing functions
fun_list = []
# assume df is your dataframe
for i, fun in enumerate(fun_list):
df.iloc[:,i] = fun(df.iloc[:,i])
You can probably try to map your N functions to each row by using a lambda containing a Series with your operations, check the following code:
import pandas as pd
matrix = [(22, 34, 23), (33, 31, 11), (44, 16, 21), (55, 32, 22), (66, 33, 27),
(77, 35, 11)]
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abcdef'))
Will produce:
x y z
a 22 34 23
b 33 31 11
c 44 16 21
d 55 32 22
e 66 33 27
f 77 35 11
and then:
res_df = df.apply(lambda row: pd.Series([row[0] + 1, row[1] + 2, row[2] + 3]), axis=1)
will give you:
0 1 2
a 23 36 26
b 34 33 14
c 45 18 24
d 56 34 25
e 67 35 30
f 78 37 14
You can simply apply to specific column
df['x'] = df['X'].apply(lambda x: x*2)
Similar to #Chris Adams's answer but makes a copy of the dataframe using dictionary comprehension and zip.
def function1(x):
return x + 1
def function2(x):
return x * 2
def function3(x):
return x**2
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3], 'C': [1, 2, 3]})
functions = [function1, function2, function3]
print(df)
# A B C
# 0 1 1 1
# 1 2 2 2
# 2 3 3 3
df_2 = pd.DataFrame({col: func(df[col]) for col, func in zip(df, functions)})
print(df_2)
# A B C
# 0 2 2 1
# 1 3 4 4
# 2 4 6 9

Merge two columns into one keeping hierarchical structure using pandas or excel writer

I need to collapse two columns into one preserving hierarchical structure of the rest either using pandas or pandas and excel writer. I need to transform this:
df = pd.DataFrame({'A': [ 'p', 'p', 'q'], 'B': ['x', 'y', 'z'], 'C': [1, 2, 3]})
df
A B C
0 p x 1
1 p y 2
2 q z 3
To this:
A C
0 p
1 x 1
2 y 2
3 q
4 z 3
UPD.
Thank you for your help. I edited my question and added more details.
It seems you need:
df1 = df.stack().drop_duplicates().reset_index(drop=True).to_frame(name='A')
print (df1)
A
0 p
1 x
2 y
3 q
4 z
Detail:
print (df.stack())
0 A p
B x
1 A p
B y
2 A q
B z
dtype: object
print (df.stack().drop_duplicates())
0 A p
B x
1 B y
2 A q
B z
dtype: object
Or if need remove duplicates only in first column is possible replace them by NaNs and stack function remove this rows:
df = pd.DataFrame({'A': [ 'p', 'p', 'q'], 'B': ['x', 'z', 'z']})
print (df)
A B
0 p x
1 p z
2 q z
df['A'] = df['A'].mask(df['A'].duplicated())
df = df.stack().reset_index(drop=True).to_frame(name='A')
print (df)
A
0 p
1 x
2 z
3 q
4 z
Detail:
df['A'] = df['A'].mask(df['A'].duplicated())
print (df)
A B
0 p x
1 NaN y
2 q z
EDIT:
df1 = (df.set_index('C')
.stack()
.reset_index(name='A')
.drop('level_1', 1)
.drop_duplicates('A')[['A','C']])
df1['C'] = df1['C'].mask(df1['A'].isin(df['A']), '')
print (df1)
A C
0 p
1 x 1
3 y 2
4 q
5 z 3
Use stack as mentioned above.
Alternatively,
In [5443]: _, idx = np.unique(df, return_index=True)
In [5444]: pd.DataFrame({'A': df.values.flatten()[np.sort(idx)]})
Out[5444]:
A
0 p
1 x
2 y
3 q
4 z

return pandas dataframe column with substrings of another column

I have a pandas dataframe that looks like df and I want to add a column so it looks like df2.
import pandas as pd
df =pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7]})
df2 = pd.DataFrame({'Alternative' : ['a_x_17MAR2016_Collectedran30dom', 'b_17MAR2016_CollectedStuff', 'c_z_k_17MAR2016_Collectedan3dom'], 'Values': [34, 65, 7], 'Alts': ['a x 17MAR2016', 'b 17MAR2016', 'c z k 17MAR2016']})
df
Out[4]:
Alternative Values
0 a_x_17MAR2016_Collectedran30dom 34
1 b_17MAR2016_CollectedStuff 65
2 c_z_k_17MAR2016_Collectedan3dom 7
df2
Out[5]:
Alternative Alts Values
0 a_x_17MAR2016_Collectedran30dom a x 17MAR2016 34
1 b_17MAR2016_CollectedStuff b 17MAR2016 65
2 c_z_k_17MAR2016_Collectedan3dom c z k 17MAR2016 7
In other words I have a string that I can separate with an underscore delimeter that is of varying length. I want to separate it, then combine it delimeted by a space, but remove any string(s) after starting with the string containing the substring 'Collected'.
I can locate the index of the string containing the substring 'Collected' in an individual list as I found here and then combine the other strings, but I cannot seem to do it in a very 'pythonic' way across all of the dataframe.
Thanks in advance
I believe this would technically answer the question but not match the desired output as the date does not contain the word 'Collected'
df.Alternative.str.replace('_[^_]*Collected.*', '').str.replace('_', ' ')
Output
0 a x 17MAR2016
1 b 17MAR2016
2 c z k 17MAR2016
use
str.split
alts = df.Alternative.str.split('_').str[:-1].str.join(' ')
df.insert(1, 'Alts', alts)
df
import re
x = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x))
# x
#0 a_x_17MAR2016
#1 b_17MAR2016
#2 c_z_k_17MAR2016
y = x.str.split("_")
#0 [a, x, 17MAR2016]
#1 [b, 17MAR2016]
#2 [c, z, k, 17MAR2016]
df['newcol'] = y.apply(lambda z: ' '.join(z))
# Alternative Values newcol
#0 a_x_17MAR2016_Collectedran30dom 34 a x 17MAR2016
#1 b_17MAR2016_CollectedStuff 65 b 17MAR2016
#2 c_z_k_17MAR2016_Collectedan3dom 7 c z k 17MAR2016
all in one line :
import re
df['newcol'] = df.Alternative.apply(lambda x : re.sub("_Collected.*","",x)).str.split("_").apply(lambda z: ' '.join(z))
# Alternative Values newcol
#0 a_x_17MAR2016_Collectedran30dom 34 a x 17MAR2016
#1 b_17MAR2016_CollectedStuff 65 b 17MAR2016
#2 c_z_k_17MAR2016_Collectedan3dom 7 c z k 17MAR2016

Resources