how to remain max column in groupby table? - python-3.x

I made summarized table like below using pandas groupby function
I
II
A
apple
3
banana
4
B
dog
1
cat
2
C
seoul
9
tokyo
5
I want to remain if II column has max value in each category.
For example, in A category I want to remain banana row only because it has max value in II column.
the result table what I want to get is like below.
I
II
A
banana
4
B
cat
2
C
seoul
9
Thanks.

Dataframe used by me:
df=pd.DataFrame({'II': {('A', 'apple'): 3,
('A', 'banana'): 4,
('B', 'dog'): 1,
('B', 'cat'): 2,
('C', 'seoul'): 9,
('C', 'tokyo'): 5}})
Try via sort_values(),reset_index() and drop_duplicates():
out=(df.sort_values('II',ascending=False)
.reset_index()
.drop_duplicates('level_0')
.set_index('level_0')
.rename_axis(index=None)
.rename(columns={'level_1':'I'}))
OR
out=(df.reset_index()
.sort_values('II',ascending=False)
.groupby('level_0')
.first()
.rename(columns={'level_1':'I'})
.rename_axis(index=None))
output of out:
I II
C seoul 9
A banana 4
B cat 2

Not sure if this is the most elegant solution, but if you want this should work with a groupby object.
# Creating the Dummy DataFrame
d = {
'Letter': ['A', 'A', 'B', 'B', 'C', 'C'], 'Word': ['apple', 'banana',
'dog', 'cat', 'seoul', 'tokyo'], 'II': [3, 4, 1, 2, 9, 5]
}
df = pd.DataFrame(data=d)
df_max = df.groupby('Letter')[['II']].agg('max')
df_max = df_max.merge(df, how='left', on='II') # merge the "Word" column back into df_max
You could then reorder the columns if you need them to be in a specific order.

Related

Extract all rows of an specific columns with respect to another column in pandas DataFrame() [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

Zip a column of a Dataframe to a list based on another column with same values in Python3 [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

How to select rows and columns that meet criteria from a list

Let's say I've got a pandas dataframe that looks like:
df1 = pd.DataFrame({"Item ID":["A", "B", "C", "D", "E"], "Value1":[1, 2, 3, 4, 0],
"Value2":[4, 5, 1, 8, 7], "Value3":[3, 8, 1, 2, 0],"Value4":[4, 5, 7, 9, 4]})
print(df1)
Item_ID Value1 Value2 Value3 Value4
0 A 1 4 3 4
1 B 2 5 8 5
2 C 3 1 1 7
3 D 4 8 2 9
4 E 0 7 0 4
Now I've got a second dataframe that looks like:
df2 = {"Item ID":["A", "C", "D"], "Value5":[4, 5, 7]}
print(df2)
Item_ID Value5
0 A 4
1 C 5
2 D 7
What I want do is find where the Item ID's match between my two data frames, and then add the "Value5" column values to the intersection of the rows AND ONLY columns Value1 and Value2 from df1 (these columns could change every iteration, so these columns need to be contained in a variable).
My output should show:
4 added to Row A, columns "Value1" and "Value2"
5 added to Row C, columns "Value1" and "Value2"
7 added to Row D, columns "Value1" and "Value2"
Item_ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4
Of course my data is many thousand rows long. I can do it using a for loop, but this is taking way too long. I want to be able to vectorize this in some way. Any ideas?
This is what I ended up doing based on #sammywemmy's suggestions
#Takes columns names and changes them into a list
names = df1.colnames.tolist()
#Merge df1 and df2 based on 'Item_ID'
merged = df1.merge(df2, on='Item_ID', how='outer')
for i in range(len(names)):
#using assign and **, we can bring in variable names with assign.
#Then add our Value 5 column
merged = merged.assign(**{names[i] : lambda x : x[names[i]] + x.Value5})
#Only keep all the columns before and including 'Value4'
df1= merged.loc[:,:'Value4']
Try this:
#set 'Item ID' as the index
df1 = df1.set_index('Item ID')
df2 = df2.set_index('Item ID')
#create list of columns that you are interested in
list_of_cols = ['Value1','Value2']
#create two separate dataframes
#unselected will not contain the columns you want to add
unselected = df1.drop(list_of_cols,axis=1)
#this will contain the columns you wish to add
selected = df1.filter(list_of_cols)
#reindex df2 so it has the same indices as df1
#then convert to a series
#fill the null values with 0
A = df2.reindex(index=selected.index,fill_value=0).loc[:,'Value5']
#add the series A to selected
selected = selected.add(A,axis='index')
#combine selected and unselected into one dataframe
result = pd.concat([unselected,selected],axis=1)
#this part is extra to get ur dataframe back to the way it was
#assumption here is that it is value1, value 2, bla bla
#so 1>2>3
#if ur columns are not actually Value1, Value2,
#bla bla, then a different sorting has to be used
#alternatively before the calculations,
#you could create a mapping of the columns to numbers
#that will give u a sorting mechanism and
#restore ur dataframe after calculations are complete
columns = sorted(result.columns,key = lambda x : x[-1])
#reindex back to the way it was
result = result.reindex(columns,axis='columns')
print(result)
Value1 Value2 Value3 Value4
Item ID
A 5 8 3 4
B 2 5 8 5
C 8 6 1 7
D 11 15 2 9
E 0 7 0 4
Alternative solution, using python's built-in dictionaries:
#create dictionaries
dict1 = (df1
#create temporary column
#and set as index
.assign(temp=df1['Item ID'])
.set_index('temp')
.to_dict('index')
)
dict2 = (df2
.assign(temp=df2['Item ID'])
.set_index('temp')
.to_dict('index')
)
list_of_cols = ['Value1','Value2']
intersected_keys = dict1.keys() & dict2.keys()
key_value_pair = [(key,col) for key in intersected_keys
for col in list_of_cols ]
#check for keys that are in both dict1 and 2
#loop through dict 1 and add values from dict2
#can be optimized with a dict comprehension
#leaving as is for better clarity IMHO
for key, val in key_value_pair:
dict1[key][val] = dict1[key][val] + dict2[key]['Value5']
#print(dict1)
{'A': {'Item ID': 'A', 'Value1': 5, 'Value2': 8, 'Value3': 3, 'Value4': 4},
'B': {'Item ID': 'B', 'Value1': 2, 'Value2': 5, 'Value3': 8, 'Value4': 5},
'C': {'Item ID': 'C', 'Value1': 8, 'Value2': 6, 'Value3': 1, 'Value4': 7},
'D': {'Item ID': 'D', 'Value1': 11, 'Value2': 15, 'Value3': 2, 'Value4': 9},
'E': {'Item ID': 'E', 'Value1': 0, 'Value2': 7, 'Value3': 0, 'Value4': 4}}
#create dataframe
pd.DataFrame.from_dict(dict1,orient='index').reset_index(drop=True)
Item ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4

Splitting dictionary/list into Separate Columns

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})
this is output
a b
0 1 {'c': 1}
1 2 [{'c': 4}, {'d': 3}]
2 3 [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]
I need to split this column into separate columns.
How can i do that I used apply(pd.series) method This is what I'm getting as a output
0 1 c
0 NaN NaN 1.0
1 {'c': 4} {'d': 3} NaN
2 {'c': 5, 'd': 6} {'c': 5, 'd': 6} NaN
but I want like this if possible:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8
I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.
However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.
newcols = set()
for el in df['b']:
if isinstance(el, dict):
newcols.update(el.keys())
elif isinstance(el, list):
for i in el:
newcols.update(i.keys())
def extractvalues(x, col):
if isinstance(x['b'], dict):
return x['b'].get(col, np.nan)
elif isinstance(x['b'], list):
return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')
for nc in newcols:
df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)
df.drop('b', axis=1, inplace=True)
Your dataframe is now:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8

using pandas to combine rows based on value in column [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

Resources