merging varying number of rows by multiple conditions in python - python-3.x

Problem: merging varying number of rows by multiple conditions
Here is a stylistic example of how the dataset looks like
"index" "connector" "type" "q_text" "a_text" "varx" ...
1 1111 1 aa NA xx
2 9999 2 NA tt NA
3 1111 2 NA uu NA
4 9999 1 bb NA yy
5 9999 1 cc NA zz
Goal: how the dataset should look like
"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
1 1111 1 2 aa NA NA uu xx NA
2 9999 1 2 bb NA NA tt yy NA
3 9999 1 2 cc NA NA tt zz NA
Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2
If
same values in "connector"
then
merge
rows of "type"=2 with rows of "type"=1
but
(because multiple rows of "type"=1 have the same value in "connector")
duplicate
the corresponding rows of type=2
and
merge
all of the other rows that also have the same value in "connector" and are of "type"=1
My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2
Most similar questions are answered using SQL, which i cannot use here.
df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])
Can i achieve my goal by goupby() ?
Solution by #victor__von__doom
if __name__ == '__main__':
df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)

First off, it is really messy to just keep concatenating new columns onto your original DataFrame when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame manageable:
import numpy as np
import pandas as pd
if __name__ == '__main__':
data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
[1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
['NA', 'tt', 'uu', 'NA', 'NA'],
['xx', 'NA', 'NA', 'yy', 'zz']])
df = pd.DataFrame(data.T, columns = ["index", "connector",
"type", "q_text", "a_text", "varx"])
df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[["type", "q_text", "a_text", "varx"]] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
The final DataFrame looks like:
connector type q_text a_text varx ...
0 1111 (1, 2) (aa, NA) (NA, uu) (xx, NA) ...
1 9999 (2, 1, 1) (NA, bb, cc) (tt, NA, NA) (NA, yy, zz) ...
Which is much more compact and readable.

Related

sort values a data frame with duplicates values

I have a dataframe with a format like this:
d = {'col1': ['PC', 'PO', 'PC', 'XY', 'XY', 'AB', 'AB', 'PC', 'PO'], 'col2':
[1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data=d)
df.sort_values(by = 'col1')
This gives me the result like this:
I want to sort the values based on col1 values with desired order, keep the duplicates. The result I expect would be like this:
Any idea?
Thanks in advance!
You can create an order beforehand and then sort values as below.
order = ['PO','XY','AB','PC']
df['col1'] = pd.CategoricalIndex(df['col1'], ordered=True, categories=order)
df = df.sort_values(by = 'col1')
df
col1 col2
1 PO 2
8 PO 9
3 XY 4
4 XY 5
5 AB 6
6 AB 7
0 PC 1
2 PC 3
7 PC 8

merging varying number of rows and columns by multiple conditions in python

updated Problem: Why does it not merge a_date, a_par, a_cons, a_ment and a_le. These are appended as columns without values but in the original dataset they have values.
Here is how the dataset looks like
connector type q_text a_text var1 var2
1 1111 1 aa None xx ps
2 9999 2 None tt jjjj pppp
3 1111 2 None uu None oo
4 9999 1 bb None yy Rt
5 9999 1 cc None zz tR
Goal: how the dataset should look like
connector q_text a_text var1 var1.1 var2 var2.1
1 1111 aa uu xx None ps oo
2 9999 bb tt yy jjjj Rt pppp
3 9999 cc tt zz jjjj tR pppp
Logic: Column type has either value 1 or 2 with multiple rows having value 1 but only one row (with the same value in connector) has value 2
Here are the main merging rules:
Merge every row of type=1 with its corresponding (connector) type=2 row.
Since multiple rows of type=1 have the same connector value, I don't want to merge solely one row of type=1 but all of them, each with the sole type==2 row.
Since some columns (e.g. a_text) follow left-join logic, values can be overridden without adding an extra column.
Since var2 values cannot be merged by left-join because they are non-exclusionary with regard to the rows connector value, i want to have extra columns (var1.1, var2.1) for those values (pppp, jjjj).
In summary (and having in mind that i only speak of rows that have the same connector values): If q_text is None i first, want to replace the values in a_text with the a_text value (see above table tt and uu) of the corresponding row (same connector value) and secondly, want to append some other values (var1 and var2) of the very same corresponding row as new columns.
Also, there are rows with a unique connector value that is not going to be matched. I want to keep those rows though.
I only want to "drop" the type=2 rows that get merged with their corresponding type=1 row**(s)**. In other words: I dont want to keep the rows of type=2 that have a match and get merged into their corresponding (connector) type=1 rows. I want to keep all other rows though.
Solution by #victor__von__doom here
merging varying number of rows by multiple conditions in python
was answered when i originally wanted to keep all of the "type"=2 columns(values).
Code i used: merged Perso, q_text and a_text
df.loc[df['type'] == 2, 'a_date'] = df['q_date']
df.loc[df['type'] == 2, 'a_par'] = df['par']
df.loc[df['type'] == 2, 'a_cons'] = df['cons']
df.loc[df['type'] == 2, 'a_ment'] = df['pret']
df.loc[df['type'] == 2, 'a_le'] = df['q_le']
my_cols = ['Perso', 'q_text','a_text', 'a_le', 'q_le', 'q_date', 'par', 'cons', 'pret', 'q_le', 'a_date','a_par', 'a_cons', 'a_ment', 'a_le']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['a_text', 'Perso'],inplace=True)
df.reset_index(drop=True,inplace=True)
Data: This is a representation of the core dataset. Unfortunately i cannot share the actual data due to privacy laws.
Perso
ID
per
q_le
a_le
pret
par
form
q_date
name
IO_ID
part
area
q_text
a_text
country
cons
dig
connector
type
J Ws
1-1/4/2001-11-12/1
1999-2009
None
4325
'Mi, h', 'd'
Cew
Thre
2001-11-12
None
345
rede
s — H
None
wr ede
Terd e
e r
2001-11-12.1.g9
999999999
2
S ts
9-3/6/2003-10-14/1
1994-2004
None
23
'sd, h'
d-g
Thre
2003-10-14
None
34555
The
l? I
None
Tre
Thr ede
re
2001-04-16.1.a9
333333333
2
On d
6-1/6/2005-09-03/1
1992-2006
None
434
'uu h'
d-g
Thre
2005-09-03
None
7313
Thde
l? I
None
T e
Th rede
dre
2001-08-07.1.e4
111111111
2
None
3-4/4/2000-07-07/1
1992-2006
1223
None
'uu h'
dfs
Thre
2000-07-07
Th r
7413
Thde
Tddde
Thd de
None
Thre de
2001-07-06.1.j3
111111111
1
None
2-1/6/2001-11-12/1
1999-2009
1444
None
'Mi, h', 'd'
d-g
Thre
2001-11-12
T rj
7431
Thde
l? I
Th dde
None
Thr ede
2001-11-12.1.s7
999999999
1
None
1-6/4/2007-11-01/1
1993-2010
2353
None
None
d-g
Thre
2007-11-01
Thrj
444
Thed
l. I
Tgg gg
None
Thre de
we e
2001-06-11.1.g9
654982984
1
EDIT v2 with additional columns
This version ensures the values in the additional columns are not impacted.
c = ['connector','type','q_text','a_text','var1','var2','cumsum','country','others']
d = [[1111, 1, 'aa', None, 'xx', 'ps', 0, 'US', 'other values'],
[9999, 2, None, 'tt', 'jjjj', 'pppp', 0, 'UK', 'no values'],
[1111, 2, None, 'uu', None, 'oo', 1, 'US', 'some values'],
[9999, 1, 'bb', None, 'yy', 'Rt', 1, 'UK', 'more values'],
[9999, 1, 'cc', None, 'zz', 'tR', 2, 'UK', 'less values']]
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.DataFrame(d,columns=c)
print (df)
df.loc[df['type'] == 2, 'var1.1'] = df['var1']
df.loc[df['type'] == 2, 'var2.1'] = df['var2']
my_cols = ['q_text','a_text','var1','var2','var1.1','var2.1']
df[my_cols] = df.sort_values(['connector','type']).groupby('connector')[my_cols].transform(lambda x: x.bfill())
df.dropna(subset=['q_text'],inplace=True)
df.reset_index(drop=True,inplace=True)
print (df)
Original DataFrame:
connector type q_text a_text var1 var2 cumsum country others
0 1111 1 aa None xx ps 0 US other values
1 9999 2 None tt jjjj pppp 0 UK no values
2 1111 2 None uu None oo 1 US some values
3 9999 1 bb None yy Rt 1 UK more values
4 9999 1 cc None zz tR 2 UK less values
Updated DataFrame
connector type q_text a_text var1 var2 cumsum country others var1.1 var2.1
0 1111 1 aa uu xx ps 0 US other values None oo
1 9999 1 bb tt yy Rt 1 UK more values jjjj pppp
2 9999 1 cc tt zz tR 2 UK less values jjjj pppp

Combine data from two columns into one without affecting the data values

I have two columns in a data frame. I want to combine those columns into a single column.
df = pd.DataFrame({'a': [500, 200, 13, 47], 'b':['$', '€', .586,.02]})
df
Out:
a b
0 500 $
1 200 €
2 13 .586
3 47 .02
I want to merge that two columns without affecting the data.
Expected output:
df
Out:
a
0 500$
1 200€
2 13.586
3 47.02
Please help me with this...
I tried the below solution, but it does not work for me,
df.b=np.where(df.b,df.b,df.a)
df.loc[df['b'] == '', 'b'] = df['a']
First solution working with convert both columns to strings and then join with +, last convert Series to one column DataFrame - but it working only if numbers less like 1 for column b:
df1 = df.astype(str)
df = (df1.a + df1.b.str.replace(r'^0', '')).to_frame('a')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02
Or if want mixed values numeric for last 2 rows and strings for first 2 rows use:
m = df.b.apply(lambda x: isinstance(x, str))
df.loc[m, 'a'] = df.loc[m, 'a'].astype(str) + df.b
df.loc[~m, 'a'] += df.pop('b')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02

How to select columns based on criteria?

I have the following dataframe:
d2 = {('CAR','ALPHA'): pd.Series(['A22', 'A23', 'A24', 'A25'],index=[2, 3, 4, 5]),
('CAR','BETA'): pd.Series(['B22', 'B23', 'B24', 'B25'],index=[2, 3, 4, 5]),
('MOTOR','SOLO'): pd.Series(['S22', 'S23', 'S24', 'S25'], index=[2, 3, 4, 5])}
db= pd.DataFrame(data=d2)
I would like in the columns that have 'CAR' in the 0 level multiindex to delete all the values and set them to NA after a row index, ex. 4.
I am trying to use .loc but I would like the results to be saved in the same dataframe.
The second thing I would to do to set the values of columns that their 0 multiindex level is different from 'CAR' to NA after a row index, ex 3.
Use slicers for first and for second MultiIndex.get_level_values compare by level value:
idx = pd.IndexSlice
db.loc[4:, idx['CAR', :]] = np.nan
db.loc[3:, db.columns.get_level_values(0) != 'CAR'] = 'AAA'
Or:
mask = db.columns.get_level_values(0) == 'CAR'
db.loc[4:, mask] = np.nan
db.loc[3:, ~mask] = 'AAA'
print(db)
CAR MOTOR
ALPHA BETA SOLO
2 A22 B22 S22
3 A23 B23 AAA
4 NaN NaN AAA
5 NaN NaN AAA

How to convert column into row?

Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'
Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'
Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'
But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like
1) I dont know which all columns will have same values
2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value
df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()
df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']
To be more clear, the below code creates the sample dataframe & expected output
df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
'01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
'Product': ['a','a','b','c','d','d', 'e','e']})
expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
, 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
you may do following to get expected_output from df
s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)
Out[57]:
Date Date1 Product Product1
Cust_id gender
100 M 01/01/2019 02/01/2019 a a
101 F 01/01/2019 01/01/2019 b c
102 M 03/01/2019 NaN d NaN
103 F 04/01/2019 NaN d NaN
104 F 03/01/2019 03/01/2019 e e
Next, replace columns having duplicated values with NA
df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
Out[72]:
Cust_id gender Date Date1 Product Product1
0 100 M 01/01/2019 02/01/2019 a NA
1 101 F 01/01/2019 NA b c
2 102 M 03/01/2019 NA d NA
3 103 F 04/01/2019 NA d NA
4 104 F 03/01/2019 NA e NA
You can try this code - it could be a little cleaner but I think it does the job
df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'],
'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})
df_transformed = pd.DataFrame()
for column in df.columns:
col_vals = df.groupby(column)['b'].count().index.values
for ix, col_val in enumerate(col_vals):
temp_df = pd.DataFrame({column + str(ix) : [col_val]})
df_transformed = pd.concat([df_transformed, temp_df], axis = 1)
Output for df_transformed

Resources