Assigning a single key to all similar products/rows in a data frame based on product description and one other key - python-3.x

Based on 3 keys/columns uniqueid , uniqueid2 and uniqueid3 I need to generate a column new_key that will tag all associated products/rows with a single key.
```python
df = pd.DataFrame({'uniqueid': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l',12:'m'},
'uniqueid2': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l',12:'l'},
'uniqueid3': {0: 'z', 1: 'y', 2: 'x', 3: 'y',
4: 'x', 5: 'v', 6: 'x', 7: 'u',8:'h',9:'i',10:'k',11:'k',12:'n'}})
```
Data that I have based on columns uniqueid ,uniqueid2 and uniqueid3. I need to create new_key as already there. Here in this dummy data all the rows except first belong to a same product based on associations in column 1 and column2.
But I am unsure on how to proceed further. Quick help needed please
Expected Output:
[1]: https://i.stack.imgur.com/yAl56.png

This will give you the correct output, but I'm not sure this is exactly what you want to do in order to generate the new_key column. This solution checks uniqueid2 to see if all values are unique within each uniqueid group as well as the entire uniqueid2 column..
import pandas as pd
import numpy as np
df = pd.DataFrame({'uniqueid': {0: 'a', 1: 'b', 2: 'b', 3: 'c',
4: 'd', 5: 'd', 6: 'e', 7: 'e',8:'g',9:'g',10:'h',11:'l'},
'uniqueid2': {0: 'z', 1: 'y', 2: 'x', 3: 'y',
4: 'x', 5: 'v', 6: 'x', 7: 'u',8:'h',9:'i',10:'k',11:'k'}})
df['m1'] = (df.groupby('uniqueid2')['uniqueid2'].transform('count') == 1)
df['m2'] = (df.groupby('uniqueid')['m1'].transform(sum))
df['m3'] = (df.groupby('uniqueid')['uniqueid2'].transform('size'))
df['m4'] = (df.groupby('uniqueid')['uniqueid'].transform('count') == 1)
df['new_key'] = np.where((df['m2'] == df['m3']) | df['m4'], df['uniqueid'], 'b')
df
Out[13]:
uniqueid uniqueid2 m1 m2 m3 m4 new_key
0 a z True 1.0 1 True a
1 b y False 0.0 2 False b
2 b x False 0.0 2 False b
3 c y False 0.0 1 True c
4 d x False 1.0 2 False b
5 d v True 1.0 2 False b
6 e x False 1.0 2 False b
7 e u True 1.0 2 False b
8 g h True 2.0 2 False g
9 g i True 2.0 2 False g
10 h k False 0.0 1 True h
11 l k False 0.0 1 True l
I kept m1, m2 and m3, so that you could see the progression of the logic. You can drop these columns with:
df = df.drop(['m1','m2','m3'], axis=1)

This looks like a networkx problem, lets try:
import networkx as nx
G = nx.Graph()
#get first value of uniqueid based on uniqueid2
s = df.groupby('uniqueid2')['uniqueid'].transform('first')
#get connected components from unique id and the above variable s
G.add_edges_from(df[['uniqueid']].assign(k=s).to_numpy().tolist())
cc = list(nx.connected_components(G))
#[{'a'}, {'b', 'c', 'd', 'e'}, {'g'}, {'h', 'l'}]
idx = [dict.fromkeys(y,x) for x, y in enumerate(cc)]
d={k: v for d in idx for k, v in d.items()}
df['new_key'] = s.groupby(s.map(d)).transform('first')
print(df)
uniqueid uniqueid2 new_key
0 a z a
1 b y b
2 b x b
3 c y b
4 d x b
5 d v b
6 e x b
7 e u b
8 g h g
9 g i g
10 h k h
11 l k h

Related

How to create a weighted edgelist directed from a pandas dataframe with weights in two columns?

I have the following pandas DataFrame (df):
>>> import pandas as pd
>>> df = pd.DataFrame([
['A', 'B', '1'],
['A', 'B', '2'],
['B', 'A', '41'],
['A', 'C', '11'],
['C', 'B', '3'],
['B', 'D', '4'],
['D', 'B','51']
], columns=('station_i', 'station_j','UID'))
I used
>>> df2=df.groupby(by=['station_i', 'station_j']).size().to_frame(name = 'counts_ij').reset_index()
to obtain the dataframe df2:
>>> print(df2)
station_i station_j counts_ij
0 A B 2
1 A C 1
2 B A 1
3 B D 1
4 C B 1
5 D B 1
Now, I would obtain the dataframe df3, build as shown below, where couples with same values, but reversed, are dropped and counted in an extra column as showed below:
>>>print(df3)
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 C B 1 0
3 B D 1 1
Would really appreciate some suggestions
import pandas as pd
import numpy as np
# find out indices that are reverse duplicated
dupe = pd.concat([
np.maximum(df2.station_i, df2.station_j),
np.minimum(df2.station_i, df2.station_j)
], axis=1).duplicated()
df2[dupe]
station_i station_j counts_ij
2 B A 1
5 D B 1
df2[~dupe]
station_i station_j counts_ij
0 A B 2
1 A C 1
3 B D 1
4 C B 1
# split by dupe, reverse the dupe station and merge with the non dupe
df2[~dupe].merge(
df2[dupe].rename(
columns={'station_i': 'station_j', 'station_j': 'station_i', 'counts_ij': 'counts_ji'}
), how='left'
).fillna(0).astype({'counts_ji': int})
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 B D 1 1
3 C B 1 0

Add count of element for each groups within a list in pandas

I have a dataframe such as:
The_list=["A","B","D"]
Groups Values
G1 A
G1 B
G1 C
G1 D
G2 A
G2 B
G2 A
G3 A
G3 D
G4 Z
G4 D
G4 E
G4 C
And I would like to add for each Groups the number of Values element that are within The_list, and add this number within a New_column
Here I should then get;
Groups Values New_column
G1 A 3
G1 B 3
G1 C 3
G1 D 3
G2 A 2
G2 B 2
G2 A 2
G3 A 1
G3 D 1
G4 Z 0
G4 D 0
G4 E 0
G4 C 0
Thanks a lot for your help
Here is the table in dict format if it can helps:
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G2', 5: 'G2', 6: 'G2', 7: 'G3', 8: 'G3', 9: 'G4', 10: 'G4', 11: 'G4', 12: 'G4'}, 'Values': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'D', 9: 'Z', 10: 'D', 11: 'E', 12: 'C'}}
In your case do transform after isin check
df['new'] = df['Values'].isin(The_list).groupby(df['Groups']).transform('sum')
Out[37]:
0 3
1 3
2 3
3 3
4 3
5 3
6 3
7 2
8 2
9 1
10 1
11 1
12 1
Name: Values, dtype: int64

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

Python, pandas dataframe, groupby column and known in advance values

Consider this example:
>>> import pandas as pd
>>> df = pd.DataFrame(
... [
... ['X', 'R', 1],
... ['X', 'G', 2],
... ['X', 'R', 1],
... ['X', 'B', 3],
... ['X', 'R', 2],
... ['X', 'B', 2],
... ['X', 'G', 1],
... ],
... columns=['client', 'status', 'cnt']
... )
>>> df
client status cnt
0 X R 1
1 X G 2
2 X R 1
3 X B 3
4 X R 2
5 X B 2
6 X G 1
>>>
>>> df_gb = df.groupby(['client', 'status']).cnt.sum().unstack()
>>> df_gb
status B G R
client
X 5 3 4
>>>
>>> def color(row):
... if 'R' in row:
... red = row['R']
... else:
... red = 0
... if 'B' in row:
... blue = row['B']
... else:
... blue = 0
... if 'G' in row:
... green = row['G']
... else:
... green = 0
... if red > 0:
... return 'red'
... elif blue > 0 and (red + green) == 0:
... return 'blue'
... elif green > 0 and (red + blue) == 0:
... return 'green'
... else:
... return 'orange'
...
>>> df_gb.apply(color, axis=1)
client
X red
dtype: object
>>>
What this code does, is groupby in order to get counts of each category (red, green, blue).
Than apply is used in order to implement logic for determining color of the each client (in this case there is only one).
The problem here is in fact that groupby object can conain any combiantion of RGB values.
For example, I can have R and G column but not B, or I could have just R column, or I will not have any of the RGB coluimns.
Because of that fact, int the apply function, I had to introduce if statements for each column in order to have counts for each color no matter if its value is in the groupby object or not.
Do I have any other option to enforce the logic from color function, using something else instead of apply in such (ugly) way?
For example, in this case I know in advance that I need counts for exactly three categories - R, G and B. I need something like group by column and these three values.
Can I group dataframe by these three categories (series, dict, function?) and always get zero or a sum for all three categories no matter whether they exist in group or not?
Use:
#changed data for more combinations
df = pd.DataFrame(
[
['W', 'R', 1],
['X', 'G', 2],
['Y', 'R', 1],
['Y', 'B', 3],
['Z', 'R', 2],
['Z', 'B', 2],
['Z', 'G', 1],
],
columns=['client', 'status', 'cnt']
)
print (df)
client status cnt
0 W R 1
1 X G 2
2 Y R 1
3 Y B 3
4 Z R 2
5 Z B 2
6 Z G 1
Then is added fill_value=0 parameter for replace non matched values (missing values) to 0:
df_gb = df.groupby(['client', 'status']).cnt.sum().unstack(fill_value=0)
#alternative
df_gb = df.pivot_table(index='client',
columns='status',
values='cnt',
aggfunc='sum',
fill_value=0)
print (df_gb)
status B G R
client
W 0 0 1
X 0 2 0
Y 3 0 1
Z 2 1 2
Instead function is created helper DataFrame with all combinations of 0,1 and added new column for output:
from itertools import product
df1 = pd.DataFrame(product([0,1], repeat=3), columns=['R','G','B'])
#change colors like need
df1['output'] = ['no','blue','green','color2','red','red1','red2','all']
print (df1)
R G B output
0 0 0 0 no
1 0 0 1 blue
2 0 1 0 green
3 0 1 1 color2
4 1 0 0 red
5 1 0 1 red1
6 1 1 0 red2
7 1 1 1 all
Then for replace values above 1 to 1 is used DataFrame.clip:
print (df_gb.clip(upper=1))
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all
And last is used DataFrame.merge for new output column, there is no on parameter, so joined by intersection of columns in both DataFrames, here R,G,B:
df2 = df_gb.clip(upper=1).merge(df1)
print (df2)
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all

Change values in pandas dataframe based on values in certain columns

How can I convert this first dataframe to the one below it? Based on different scenarios of the first three columns matching, I want to change the values in the rest of the columns.
import pandas as pd
df = pd.DataFrame([['foo', 'foo', 'bar', 'a', 'b', 'c', 'd'], ['bar', 'foo', 'bar', 'a', 'b', 'c', 'd'],
['spa', 'foo', 'bar', 'a', 'b', 'c', 'd']], columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar a b c d
1 bar foo bar a b c d
2 spa foo bar a b c d
If col1 = col2, I want to change all a's to 2, all b's and c's to 1, and all d's to 0. This is row 1 in my example df.
If col1 = col3, I want to change all a's to 0, all b's and c's to 1, and all d's to 2. This is row 2 in my example df.
If col1 != col2/col3, I want to delete the row and add 1 to a counter so I have a total of deleted rows. This is row 3 in my example df.
So my final dataframe would look like this, with counter = 1:
df = pd.DataFrame([['foo', 'foo', 'bar', '2', '1', '1', '0'], ['bar', 'foo', 'bar', '0', '1', '1', '2']],
columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
I was reading that using df.iterrows is slow and so there must be a way to do this on the whole df at once, but my original idea was:
for row in df.iterrows:
if (row["col1"] == row["col2"]):
df.replace(to_replace=['a'], value='2', inplace=True)
df.replace(to_replace=['b', 'c'], value='1', inplace=True)
df.replace(to_replace=['d'], value='0', inplace=True)
elif (row["col1"] == row["col3"]):
df.replace(to_replace=['a'], value='0', inplace=True)
df.replace(to_replace=['b', 'c'], value='1', inplace=True)
df.replace(to_replace=['d'], value='2', inplace=True)
else:
(delete row, add 1 to counter)
The original df is massive, so speed is important to me. I'm hoping it's possible to do the conversions on the whole dataframe without iterrows. Even if it's not possible, I could use help getting the syntax right for the iterrows.
You can remove rows by boolean indexing first:
m1 = df["col1"] == df["col2"]
m2 = df["col1"] == df["col3"]
m = m1 | m2
Get number of removed rows by sum of chained condition m1 and m2 with inverting by ~:
counter = (~m).sum()
print (counter)
1
df = df[m].copy()
print (df)
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar a b c d
1 bar foo bar a b c d
and then replace with dictionary by condition:
d1 = {'a':2,'b':1,'c':1,'d':0}
d2 = {'a':0,'b':1,'c':1,'d':2}
m1 = df["col1"] == df["col2"]
#replace all columns without col1-col3
cols = df.columns.difference(['col1','col2','col3'])
df.loc[m1, cols] = df.loc[m1, cols].replace(d1)
df.loc[~m1, cols] = df.loc[~m1, cols].replace(d2)
print (df)
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
Timings:
In [138]: %timeit (jez(df))
872 ms ± 6.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [139]: %timeit (hb(df))
1.33 s ± 9.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Setup:
np.random.seed(456)
a = ['foo','bar', 'spa']
b = list('abcd')
N = 100000
df1 = pd.DataFrame(np.random.choice(a, size=(N, 3))).rename(columns=lambda x: 'col{}'.format(x+1))
df2 = pd.DataFrame(np.random.choice(b, size=(N, 20))).rename(columns=lambda x: 's{}'.format(x+1))
df = df1.join(df2)
#print (df.head())
def jez(df):
m1 = df["col1"] == df["col2"]
m2 = df["col1"] == df["col3"]
m = m1 | m2
counter = (~m).sum()
df = df[m].copy()
d1 = {'a':2,'b':1,'c':1,'d':0}
d2 = {'a':0,'b':1,'c':1,'d':2}
m1 = df["col1"] == df["col2"]
cols = df.columns.difference(['col1','col2','col3'])
df.loc[m1, cols] = df.loc[m1, cols].replace(d1)
df.loc[~m1, cols] = df.loc[~m1, cols].replace(d2)
return df
def hb(df):
counter = 0
df[df.col1 == df.col2] = df[df.col1 == df.col2].replace(['a', 'b', 'c', 'd'], [2,1,1,0])
df[df.col1 == df.col3] = df[df.col1 == df.col3].replace(['a', 'b', 'c', 'd'], [0,1,1,2])
index_drop =df[((df.col1 != df.col3) & (df.col1 != df.col2))].index
counter = counter + len(index_drop)
df = df.drop(index_drop)
return df
You can use:
import pandas as pd
df = pd.DataFrame([['foo', 'foo', 'bar', 'a', 'b', 'c', 'd'], ['bar', 'foo', 'bar', 'a', 'b', 'c', 'd'],
['spa', 'foo', 'bar', 'a', 'b', 'c', 'd']], columns=['col1', 'col2', 'col3', 's1', 's2', 's3', 's4'])
counter = 0
#
df[df.col1 == df.col2] = df[df.col1 == df.col2].replace(['a', 'b', 'c', 'd'], [2,1,1,0])
df[df.col1 == df.col3] = df[df.col1 == df.col3].replace(['a', 'b', 'c', 'd'], [0,1,1,2])
index_drop =df[((df.col1 != df.col3) & (df.col1 != df.col2))].index
counter = counter + len(index_drop)
df = df.drop(index_drop)
print(df)
print(counter)
Output:
col1 col2 col3 s1 s2 s3 s4
0 foo foo bar 2 1 1 0
1 bar foo bar 0 1 1 2
1 # counter

Resources