Finding unique ids in lines of dataframe - python-3.x

Input - dataframe with more than 50k rows.
Result expected: find unique id's by multiple columns.
F.e. there is dataframe:
id par1 par2 par3
1 a 1 AA
2 b 2 AB
3 c 3 AC
4 a 4 AD
5 d 3 AE
6 e 5 AD
7 d 1 AF
So the logic is, if any row share common parameter - that is the same unique id, the result should be something like this, made by iterations:
First by par1:
id par1 par2 par3 uniq_id
1 a 1 AA 1
2 b 2 AB 2
3 c 3 AC 3
4 a 4 AD 1
5 d 3 AE 4
6 e 5 AD 5
7 d 1 AF 4
Then by par2:
id par1 par2 par3 uniq_id
1 a 1 AA 1
2 b 2 AB 2
3 c 3 AC 3
4 a 4 AD 1
5 d 3 AE 3
6 e 5 AD 5
7 d 1 AF 1
Then by par3:
id par1 par2 par3 uniq_id
1 a 1 AA 1
2 b 2 AB 2
3 c 3 AC 3
4 a 4 AD 1
5 d 3 AE 3
6 e 5 AD 1
7 d 1 AF 1
Then it should be checked if there are still any misleads:
f.e. id=5 and id=3 should get uniq_id = 1, because —id=7isuniq_id=1andid=7sharepar1withid=5, and because of thatid=3` also changes.
I hope it is clear what I try to explain.
At the moment only working solution made by me - creating multiple for cycles and comparing values manually, but since there are lots of observations, it can take forever to execute.

Use factorize first and then Series.map with DataFrame.drop_duplicates:
df['uniq_id'] = pd.factorize(df['par1'])[0] + 1
df['uniq_id'] = df['par2'].map(df.drop_duplicates('par2').set_index('par2')['uniq_id'])
df['uniq_id'] = df['par3'].map(df.drop_duplicates('par3').set_index('par3')['uniq_id'])
print (df)
id par1 par2 par3 uniq_id
0 1 a 1 AA 1
1 2 b 2 AB 2
2 3 c 3 AC 3
3 4 a 4 AD 1
4 5 d 3 AE 3
5 6 e 5 AD 1
6 7 d 1 AF 1
If possible more columns is possible create loop:
df['uniq_id'] = pd.factorize(df['par1'])[0] + 1
for col in ['par2','par3']:
df['uniq_id'] = df[col].map(df.drop_duplicates(col).set_index(col)['uniq_id'])

Related

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.
Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l
A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

Should I stack, pivot, or groupby?

I'm still learning how to play with dataframe and still can't make this... I got a dataframe like this:
A B C D1 D2 D3
1 2 3 5 6 7
I need it to look like:
A B C DA D
1 2 3 D1 5
1 2 3 D2 6
1 2 3 D3 7
I know I should use something like groupby but I still can't find good documentation.
This is wide_to_long
ydf=pd.wide_to_long(df,'D',i=['A','B','C'],j='DA').reset_index()
ydf
A B C DA D
0 1 2 3 1 5
1 1 2 3 2 6
2 1 2 3 3 7
Use melt:
df.melt(['A','B','C'], var_name='DA', value_name='D')
Output:
A B C DA D
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
Use set_index and stack
df.set_index(['A','B','C']).stack().reset_index()
Output:
A B C level_3 0
0 1 2 3 D1 5
1 1 2 3 D2 6
2 1 2 3 D3 7
And, you can do housekeeping by renaming column headers etc....

Sum of all rows based on specific column values

I have a df like this:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
I want to add all the rows which has Parameter values as "Apple" , "Banana" and "Pear".
Output:
Index Parameters A B C D E
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Pear 4 5 5 4 3
6 Total 7 11 13 11 13
My Effort:
df[:,'Total'] = df.sum(axis=1) -- Works but I want specific values only and not all
Tried by the index in my case 1,2 and 5 but in my original df the index can vary from time to time and hence rejected that solution.
Saw various answers on SO but none of them could solve my problem!!
First idea is create index by Parameters column and select rows for sum and last convert index to column:
L = ["Apple" , "Banana" , "Pear"]
df = df.set_index('Parameters')
df.loc['Total'] = df.loc[L].sum()
df = df.reset_index()
print (df)
Parameters A B C D E
0 Apple 1 2 3 4 5
1 Banana 2 4 5 3 5
2 Potato 3 5 3 2 1
3 Tomato 1 1 1 1 1
4 Pear 4 5 5 4 3
5 Total 7 11 13 11 13
Or add new row for filtered rows by membership with Series.isin and overwrite last added value by Total:
last = len(df)
df.loc[last] = df[df['Parameters'].isin(L)].sum()
df.loc[last, 'Parameters'] = 'Total'
print (df)
Parameters A B C D E
Index
1 Apple 1 2 3 4 5
2 Banana 2 4 5 3 5
3 Potato 3 5 3 2 1
4 Tomato 1 1 1 1 1
5 Total 7 11 13 11 13
Another similar solution is filtering all columns without first and add value in one element list:
df.loc[len(df)] = ['Total'] + df.iloc[df['Parameters'].isin(L).values, 1:].sum().tolist()

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?
Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)
you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources