Create a new column in a dataframe, based on Groupby and values in a separate column - python-3.x

I have a df like so:
df = pd.DataFrame({'Info': ['A','B','C', 'D', 'E'], 'Section':['1','1', '2', '2', '3']})
I want to be able to create a new column, like 'Unique_Info', like so:
df = pd.DataFrame({'Info': ['A','B','C', 'D', 'E'], 'Section':['1','1', '2', '2', '3'],
'Unique_Info':[['A', 'B'], ['A', 'B'], ['C', 'D'], ['C', 'D'], ['E']]})
So a list is created with all unique values from the Info column, belonging to that section. So Section=1, hence ['A', 'B'].
I assume groupby is the most convenient way, and I've used the following:
df['Unique_Info'] = df.groupby('Section').agg({'Info':'unique'})
Any ideas where I'm going wrong?

df.groupby().agg returns a series with different indexing, which is the Section number. You should use map to assign back to your dataframe:
s = df.groupby('Section')['Info'].agg('unique')
df['Unique_Info'] = df['Section'].map(s)
Output:
Info Section Unique_Info
0 A 1 [A, B]
1 B 1 [A, B]
2 C 2 [C, D]
3 D 2 [C, D]
4 E 3 [E]

Use df.merge and df.agg:
In [1531]: grp = df.groupby('Section')['Info'].agg(list).reset_index()
In [1535]: df.merge(grp, on='Section').rename(columns={'Info_y': 'unique'})
Out[1535]:
Info_x Section unique
0 A 1 [A, B]
1 B 1 [A, B]
2 C 2 [C, D]
3 D 2 [C, D]
4 E 3 [E]

Related

Unique values across columns row-wise in pandas with missing values

I have a dataframe like
import pandas as pd
import numpy as np
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']})
I want to get the unique combinations across columns for each row and create a new column with those values, excluding the missing values.
The code I have right now to do this is
def handle_missing(s):
return np.unique(s[s.notnull()])
def unique_across_rows(data):
unique_vals = data.apply(handle_missing, axis = 1)
# numpy unique sorts the values automatically
merged_vals = unique_vals.apply(lambda x: x[0] if len(x) == 1 else '_'.join(x))
return merged_vals
df['Combos'] = unique_across_rows(df)
This returns the expected output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
It seems to me that there should be a more vectorized approach that exists within Pandas to do this: how could I do that?
You can try a simple list comprehension which might be more efficient for larger dataframes:
df['combos'] = ['_'.join(sorted(k for k in set(v) if pd.notnull(k))) for v in df.values]
Or you can wrap the above list comprehension in a more readable function:
def combos():
for v in df.values:
unique = set(filter(pd.notnull, v))
yield '_'.join(sorted(unique))
df['combos'] = list(combos())
Col1 Col2 Col3 combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
You can also use agg/apply on axis=1 like below:
df['Combos'] = df.agg(lambda x: '_'.join(sorted(x.dropna().unique())),axis=1)
print(df)
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
Try (explanation to follow)
df['Combos'] = (df.stack() # this removes NaN values
.sort_values() # so we have A_B instead of B_A in 3rd row
.groupby(level=0) # group by original index
.agg(lambda x: '_'.join(x.unique())) # join the unique values
)
Output:
Col1 Col2 Col3 Combos
0 A A A A
1 NaN B B B
2 B B C B_C
3 B A A A_B
4 C C C C
fill the nan with a string place-holder '-'. Create a unique array from the col1,col2,col3 list and remove the placeholder. join the unique array values with a '-'
import pandas as pd
import numpy as np
def unique(list1):
if '-' in list1:
list1.remove('-')
x = np.array(list1)
return (np.unique(x))
df = pd.DataFrame({"Col1": ['A', np.nan, 'B', 'B', 'C'],
"Col2": ['A', 'B', 'B', 'A', 'C'],
"Col3": ['A', 'B', 'C', 'A', 'C']}).fillna('-')
s="-"
for key,row in df.iterrows():
df.loc[key,'combos']=s.join(unique([row.Col1, row.Col2, row.Col3]))
print(df.head())

Filling NA s when using pd.merge

I have two data frames and I want to merge them on common columns as seen below. There is also a new column in the second data frame.
dummy_data1 = {'id': ['1', '2', '3', '4'],'name': ['A', 'C', 'E', 'G'],
'year':['2012','2012','2012','2012']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'name', 'year'])
dummy_data2 = {
'id': ['1', '2', '3', '7',],
'name': ['A', 'C', 'E', 'P'],
'ADDRESS': ['X', 'Y', 'Z', 'P'],'year':['2013','2013','2013','2013']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'name','ADDRESS','year'])
when I merge these two data frames with
df_merge = pd.merge(df1, df2, on=['name','id','year'],how='outer')
I get NaN s for some rows because of the newly added column, as expected:
enter image description here
My question is about the NaN s, is there a way to just repeat the data for the NaN if the data for that id is available in the other data frame. So for index 0, it brings 'X' instead of the NaNs, for index 1, 'Y' and so forth. I just want to assume that 'Address' for different years doesn't change.
Thanks!
I would suggest pandas merge ordered and use a backward fill
merge ordered works for sorted data; as such, I would advise before using it to sort the data. In your case, it already is.
pd.merge_ordered(df1,df2).bfill()
id name year ADDRESS
0 1 A 2012 X
1 1 A 2013 X
2 2 C 2012 Y
3 2 C 2013 Y
4 3 E 2012 Z
5 3 E 2013 Z
6 4 G 2012 P
7 7 P 2013 P

Add a row to pandas dataframe based on dictionary

Here is my example dataframe row:
A B C D E
I have a dictionary formatted like:
{'foo': ['A', 'B', 'C'], 'bar': ['D', 'E']}
I would like to add a row above my original dataframe so my new dataframe is:
foo foo foo bar bar
A B C D E
I think maybe the df.map function should be able to do it, but I've tried it and can't seem to get the syntax right.
I believe you want set columns names by row of DataFrame with dict and map:
d = {'foo': ['A', 'B', 'C'], 'bar': ['D', 'E']}
#swap keys with values
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'E': 'bar', 'A': 'foo', 'D': 'bar', 'B': 'foo', 'C': 'foo'}
df = pd.DataFrame([list('ABCDE')])
df.columns = df.iloc[0].map(d1).values
print (df)
foo foo foo bar bar
0 A B C D E
If need set first row in one row DataFrame:
df = pd.DataFrame([list('ABCDE')])
df.loc[-1] = df.iloc[0].map(d1)
df = df.sort_index().reset_index(drop=True)
print (df)
0 1 2 3 4
0 foo foo foo bar bar
1 A B C D E

Looking for an analogue to pd.DataFrame.drop_duplicates() where order does not matter

I would like to use something similar to dropping the duplicates of a DataFrame. I would like columns' order not to matter. What I mean is that the function shuold consider a row consisting of the entries 'a', 'b' to be identical to a row consisting of the entries 'b', 'a'. For example, given
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['a', 'b'], ['b', 'a']])
0 1
0 a b
1 c d
2 a b
3 b a
I would like to obtain:
0 1
0 a b
1 c d
where the preference is for efficiency, as I run this on a huge dataset within a groupby operation.
Call np.sort first, and then drop duplicates.
df[:] = np.sort(df.values, axis=1)
df.drop_duplicates()
0 1
0 a b
1 c d

Pandas Get All Values from Multiindex levels

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to access each value of 'C' (or level 2) as a list to use for plotting.
I'd like to do the same for 'A' and 'B' (levels 0 and 1) in such a way that it preserves spacing so that I can use those lists as well. I'm ultimately trying to use them to create something like this via plotting:
Here's the question from which this one stemmed.
Thanks in advance!
You can use get_level_values to get the index values at a specific level from a multi-index:
In [127]:
table.index.get_level_values('C')
Out[127]:
Index(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'a', 'b'], dtype='object', name='C')
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
get_level_values accepts an int param for the level or a label
Note that for the higher levels, the values are repeated to correspond with the index length at the lowest level, for display purposes you don't see this

Resources