Merge two columns into one keeping hierarchical structure using pandas or excel writer - excel

I need to collapse two columns into one preserving hierarchical structure of the rest either using pandas or pandas and excel writer. I need to transform this:
df = pd.DataFrame({'A': [ 'p', 'p', 'q'], 'B': ['x', 'y', 'z'], 'C': [1, 2, 3]})
df
A B C
0 p x 1
1 p y 2
2 q z 3
To this:
A C
0 p
1 x 1
2 y 2
3 q
4 z 3
UPD.
Thank you for your help. I edited my question and added more details.

It seems you need:
df1 = df.stack().drop_duplicates().reset_index(drop=True).to_frame(name='A')
print (df1)
A
0 p
1 x
2 y
3 q
4 z
Detail:
print (df.stack())
0 A p
B x
1 A p
B y
2 A q
B z
dtype: object
print (df.stack().drop_duplicates())
0 A p
B x
1 B y
2 A q
B z
dtype: object
Or if need remove duplicates only in first column is possible replace them by NaNs and stack function remove this rows:
df = pd.DataFrame({'A': [ 'p', 'p', 'q'], 'B': ['x', 'z', 'z']})
print (df)
A B
0 p x
1 p z
2 q z
df['A'] = df['A'].mask(df['A'].duplicated())
df = df.stack().reset_index(drop=True).to_frame(name='A')
print (df)
A
0 p
1 x
2 z
3 q
4 z
Detail:
df['A'] = df['A'].mask(df['A'].duplicated())
print (df)
A B
0 p x
1 NaN y
2 q z
EDIT:
df1 = (df.set_index('C')
.stack()
.reset_index(name='A')
.drop('level_1', 1)
.drop_duplicates('A')[['A','C']])
df1['C'] = df1['C'].mask(df1['A'].isin(df['A']), '')
print (df1)
A C
0 p
1 x 1
3 y 2
4 q
5 z 3

Use stack as mentioned above.
Alternatively,
In [5443]: _, idx = np.unique(df, return_index=True)
In [5444]: pd.DataFrame({'A': df.values.flatten()[np.sort(idx)]})
Out[5444]:
A
0 p
1 x
2 y
3 q
4 z

Related

Can I apply vectorization here? Or should I think about this differently?

To put it simply, I have rows of activity that happen in a given month of the year. I want to append additional rows of inactivity in between this activity, while resetting the month values into a sequence. For example, if I have months 2, 5, 7, I need to map these to 1, 4, 7, while my inactive months happen in 2, 3, 5, and 6. So, I would have to add four rows with this inactivity. I've done this with dictionaries and for-loops, but I know this is not efficient, especially when I move this to thousands of rows of data to process. Any suggestions on how to optimize this? Do I need to think about the data format differently? I've had a suggestion to make lists and then move that to the dataframe at the end, but I don't see a huge gain there. I don't know enough of NumPy to figure out how to do this with vectorization since that's super fast and it would be awesome to learn something new. Below is my code with the steps I took:
df = pd.DataFrame({'col1': ['A','A', 'B','B','B','C','C'], 'col2': ['X','Y','X','Y','Z','Y','Y'], 'col3': [1, 8, 2, 5, 7, 6, 7]})
Output:
col1 col2 col3
0 A X 1
1 A Y 8
2 B X 2
3 B Y 5
4 B Z 7
5 C Y 6
6 C Y 7
I'm creating a dictionary to handle this in for loops:
df1 = df.groupby('col1')['col3'].apply(list).to_dict()
df2 = df.groupby('col1')['col2'].apply(list).to_dict()
max_num = max(df.col3)
Output:
{'A': [1, 8], 'B': [2, 5, 7], 'C': [6, 7]}
{'A': ['X', 'Y'], 'B': ['X', 'Y', 'Z'], 'C': ['Y', 'Y']}
8
And now I'm adding those rows using my dictionaries by creating a new data frame:
df_new = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})
for key in df1.keys():
k = 1
if list(df1[key])[-1] - list(df1[key])[0] + 1 < max_num:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
df_new = df_new.append({'col1': key, 'col2': 'E', 'col3': str(k)}, ignore_index=True)
else:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
Output:
col1 col2 col3
0 A X 1
1 A N 2
2 A N 3
3 A N 4
4 A N 5
5 A N 6
6 A N 7
7 A Y 8
8 B X 1
9 B N 2
10 B N 3
11 B Y 4
12 B N 5
13 B Z 6
14 B E 7
15 C Y 1
16 C Y 2
17 C E 3
And then I pivot to the form I want it:
df_pivot = df_new.pivot(index='col1', columns='col3', values='col2')
Output:
col3 1 2 3 4 5 6 7 8
col1
A X N N N N N N Y
B X N N Y N Z E NaN
C Y Y E NaN NaN NaN NaN NaN
Thanks for the help.
We can replace the steps of creating and using dictionaries by the statement below, which utilizes reindex to place the additional values N and E without explicit loops.
df_new = df.set_index('col3')\
.groupby('col1')\
.apply(lambda dg:
dg.drop('col1', 1)
.reindex(range(dg.index.min(), dg.index.max()+1), fill_value='N')
.reindex(range(dg.index.min(), min(max_num, dg.index.max()+1)+1), fill_value='E')
.set_index(pd.RangeIndex(1, min(max_num, dg.index.max()-dg.index.min()+1+1)+1, name='col3'))
)\
.reset_index()
After this, you can apply your pivot statement as it is.

Python, pandas dataframe, groupby column and known in advance values

Consider this example:
>>> import pandas as pd
>>> df = pd.DataFrame(
... [
... ['X', 'R', 1],
... ['X', 'G', 2],
... ['X', 'R', 1],
... ['X', 'B', 3],
... ['X', 'R', 2],
... ['X', 'B', 2],
... ['X', 'G', 1],
... ],
... columns=['client', 'status', 'cnt']
... )
>>> df
client status cnt
0 X R 1
1 X G 2
2 X R 1
3 X B 3
4 X R 2
5 X B 2
6 X G 1
>>>
>>> df_gb = df.groupby(['client', 'status']).cnt.sum().unstack()
>>> df_gb
status B G R
client
X 5 3 4
>>>
>>> def color(row):
... if 'R' in row:
... red = row['R']
... else:
... red = 0
... if 'B' in row:
... blue = row['B']
... else:
... blue = 0
... if 'G' in row:
... green = row['G']
... else:
... green = 0
... if red > 0:
... return 'red'
... elif blue > 0 and (red + green) == 0:
... return 'blue'
... elif green > 0 and (red + blue) == 0:
... return 'green'
... else:
... return 'orange'
...
>>> df_gb.apply(color, axis=1)
client
X red
dtype: object
>>>
What this code does, is groupby in order to get counts of each category (red, green, blue).
Than apply is used in order to implement logic for determining color of the each client (in this case there is only one).
The problem here is in fact that groupby object can conain any combiantion of RGB values.
For example, I can have R and G column but not B, or I could have just R column, or I will not have any of the RGB coluimns.
Because of that fact, int the apply function, I had to introduce if statements for each column in order to have counts for each color no matter if its value is in the groupby object or not.
Do I have any other option to enforce the logic from color function, using something else instead of apply in such (ugly) way?
For example, in this case I know in advance that I need counts for exactly three categories - R, G and B. I need something like group by column and these three values.
Can I group dataframe by these three categories (series, dict, function?) and always get zero or a sum for all three categories no matter whether they exist in group or not?
Use:
#changed data for more combinations
df = pd.DataFrame(
[
['W', 'R', 1],
['X', 'G', 2],
['Y', 'R', 1],
['Y', 'B', 3],
['Z', 'R', 2],
['Z', 'B', 2],
['Z', 'G', 1],
],
columns=['client', 'status', 'cnt']
)
print (df)
client status cnt
0 W R 1
1 X G 2
2 Y R 1
3 Y B 3
4 Z R 2
5 Z B 2
6 Z G 1
Then is added fill_value=0 parameter for replace non matched values (missing values) to 0:
df_gb = df.groupby(['client', 'status']).cnt.sum().unstack(fill_value=0)
#alternative
df_gb = df.pivot_table(index='client',
columns='status',
values='cnt',
aggfunc='sum',
fill_value=0)
print (df_gb)
status B G R
client
W 0 0 1
X 0 2 0
Y 3 0 1
Z 2 1 2
Instead function is created helper DataFrame with all combinations of 0,1 and added new column for output:
from itertools import product
df1 = pd.DataFrame(product([0,1], repeat=3), columns=['R','G','B'])
#change colors like need
df1['output'] = ['no','blue','green','color2','red','red1','red2','all']
print (df1)
R G B output
0 0 0 0 no
1 0 0 1 blue
2 0 1 0 green
3 0 1 1 color2
4 1 0 0 red
5 1 0 1 red1
6 1 1 0 red2
7 1 1 1 all
Then for replace values above 1 to 1 is used DataFrame.clip:
print (df_gb.clip(upper=1))
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all
And last is used DataFrame.merge for new output column, there is no on parameter, so joined by intersection of columns in both DataFrames, here R,G,B:
df2 = df_gb.clip(upper=1).merge(df1)
print (df2)
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all

Pandas dataframe merge by function on column names

I say to dataframes.
df_A has columns A__a, B__b, C. (shape 5,3)
df_B has columns A_a, B_b, D. (shape 4,3)
How can I unify them (without having to iterate over all columns) to get one df with columns A,B ? (shape 9,2) - meaning A__a and A_a should be unified to the same column.
I need to use merge with applying the function lambda x: x.replace("_",""). Is it possible?
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 3)), columns=['A__a', 'B__b', 'C'])
df:
A__a B__b C
0 3 0 2
1 0 3 4
2 0 4 4
3 4 2 1
4 3 4 3
df2:
df2 = pd.DataFrame(np.random.randint(0,4,size=(4, 3)), columns=['A__a', 'B__b', 'D'])
A__a B__b D
0 3 2 0
1 3 1 1
2 0 2 0
3 3 2 0
df3 = pd.concat([df, df2], join='inner', ignore_index=True)
df_final = df3.rename(lambda x: str(x).split("__")[0],axis='columns')
df_final
df_final:
A B
0 3 0
1 0 3
2 0 4
3 4 2
4 3 4
5 3 2
6 3 1
7 0 2
8 3 2
A simple concatenation will do
pd.concat([df_A, df_B], join='outer')[['A', 'B']].copy().
or
'pd.concat([df_A, df_B], join='inner')
You have to merge Dataframe using 'outer'
import pandas as pd
import numpy as np
df_A = pd.DataFrame(np.random.randint(10,size=(5,3)), columns=['A','B','C'])
df_B = pd.DataFrame(np.random.randint(10,size=(4,3)), columns=['A','B','D'])
print(df_A.shape,df_B.shape)
#(5, 3) (4, 3)
new_df = df_A.merge(df_B , how= 'outer', on = ['A','B'])[['A','B']]
print(new_df.shape)
#(9,2)
If you cant change the name of the columns in advance and you want to use lambda x: x.replace("_",""), this is a way:
df = pd.concat([df1.rename_axis(lambda x: str(x).replace("_",""),axis='columns'), df2.rename_axis(lambda x: str(x).replace("_",""),axis='columns')], join='inner', ignore_index=True)
Example:
d1 = {'A__a' : ('A', 'B', 'C', 'D', 'E') , 'B__b' : ('a', 'b', 'c', 'd', 'e') ,'C': (1,2,3,4,5)}
df1 = pd.DataFrame(d1)
A__a B__b C
0 A a 1
1 B b 2
2 C c 3
3 D d 4
4 E e 5
d2 = {'A_a' : ('B', 'C', 'D','G') , 'B_b' : ('l','m','n','o') ,'D': (6,7,8,9)}
df2=pd.DataFrame(d2)
A_a B_b D
0 B l 6
1 C m 7
2 D n 8
3 G o 9
Output:
Aa Bb
0 A a
1 B b
2 C c
3 D d
4 E e
5 B l
6 C m
7 D n
8 G o
Alternative with:
df = pd.concat([df1.rename(columns={'A__a':'A', 'B__b':'B'}), df2.rename(columns={'A_a':'A', 'B_b':'B'})], join='inner', ignore_index=True)

pandas dataframe concatenate strings from a subset of columns and put them into a list

I tried to retrieve strings from a subset of columns from a DataFrame, concatenate the strings into one string, and then put these into a list,
# row_subset is a sub-DataFrame of some DataFrame
sub_columns = ['A', 'B', 'C']
string_list = [""] * row_subset.shape[0]
for x in range(0, row_subset.shape[0]):
for y in range(0, len(sub_columns)):
string_list[x] += str(row_subset[sub_columns[y]].iloc[x])
so the result is like,
['row 0 string concatenation','row 1 concatenation','row 2 concatenation','row3 concatenation']
I am wondering what is the best way to do this, more efficiently?
I think you need select columns by subset by [] first and then sum or if need separator use join:
df = pd.DataFrame({'A':list('abcdef'),
'B':list('qwerty'),
'C':list('fertuj'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a q f 1 5 a
1 b w e 3 3 a
2 c e r 5 6 a
3 d r t 7 9 b
4 e t u 1 2 b
5 f y j 0 4 b
sub_columns = ['A', 'B', 'C']
print (df[sub_columns].sum(axis=1).tolist())
['aqf', 'bwe', 'cer', 'drt', 'etu', 'fyj']
print (df[sub_columns].apply(' '.join, axis=1).tolist())
['a q f', 'b w e', 'c e r', 'd r t', 'e t u', 'f y j']
Very similar numpy solution:
print (df[sub_columns].values.sum(axis=1).tolist())
['aqf', 'bwe', 'cer', 'drt', 'etu', 'fyj']

How to get Series list elements vertically in pandas

I need to get a transpose or columnar representation for list of Series in pandas.Below is code snippet which i have used to form lists from Series-
series1.index.values.tolist()
series1.values.tolist()
It gives below lists as output-
['A', 'B'....'Z'] , [4424180.0, 7463.0.....,34]
Current Output-
['A', 'B'....'Z'] , [4424180.0, 7463.0.....,34].
Output required-
'A' 4424180
'B' 7463
You need reset_index, optionaly rename_axis:
series1 = pd.Series([4424180.0, 7463.0,34], index=['A', 'B', 'Z'])
print (series1)
A 4424180.0
B 7463.0
Z 34.0
dtype: float64
df = series1.rename_axis('a').reset_index(name='b')
print (df)
a b
0 A 4424180.0
1 B 7463.0
2 Z 34.0
df = series1.reset_index()
df.columns = ['a','b']
print (df)
a b
0 A 4424180.0
1 B 7463.0
2 Z 34.0

Resources