Python: How to use value_counts() inside .agg function in pandas? - python-3.x

Input dataframe df looks like:
item row
Apple 12
Apple 12
Apple 13
Orange 13
Orange 14
Lemon 14
Output dataframe need to be
item unique_row nunique_row count
Apple {12,13} 2 {2,1}
Orange {13,14} 2 {1,1}
Lemon {14} 1 {1}
Tried Code:
df.groupby('item', as_index=False)['row'].agg({'unique_row': lambda x: set(x)
,'nunique_row': lambda x: len(set(x))})
So here, not sure how to add condition inside .agg function to generate column 'count'. Column 'count' represents number of value_count for each row value.
Any help will be appreciated. Thank You!

Solution
s = df.value_counts()
g = s.reset_index(name='count').groupby('item')
g.agg(list).join(g.size().rename('nunique_row'))
Working
Calculate the groupsize per item and row using value_counts
group the preceding counts by item
agg with list to get the list of unique rows and corresponding counts
agg with size to get number of unique rows
Result
row count nunique_row
item
Apple [12, 13] [2, 1] 2
Lemon [14] [1] 1
Orange [13, 14] [1, 1] 2

You need to convert to list or set:
(df.groupby('item', as_index=False)['row']
.agg({'unique_row': lambda x: list(x.unique()),
'nunique_row': lambda x: len(set(x)),
'count': lambda x: list(x.value_counts(sort=False)), # or set(x.value_counts())
})
)
output:
item unique_row nunique_row count
0 Apple [12, 13] 2 [2, 1]
1 Lemon [14] 1 [1]
2 Orange [13, 14] 2 [1, 1]

Related

Compare value in a dataframe to multiple columns of another dataframe to get a list of lists where entries match in an efficient way

I have two pandas dataframes and i want to find all entries of the second dataframe where a specific value occurs.
As an example:
df1:
NID
0 1
1 2
2 3
3 4
4 5
df2:
EID N1 N2 N3 N4
0 1 1 2 13 12
1 2 2 3 14 13
2 3 3 4 15 14
3 4 4 5 16 15
4 5 5 6 17 16
5 6 6 7 18 17
6 7 7 8 19 18
7 8 8 9 20 19
8 9 9 10 21 20
9 10 10 11 22 21
Now, what i basically want, is a list of lists with the values EID (from df2) where the values NID (from df1) occur in any of the columns N1,N2,N3,N4:
Solution would be:
sol = [[1], [1, 2], [2, 3], [3, 4], [4, 5]]
The desired solution explained:
The solution has 5 entries (len(sol = 5)) since I have 5 entries in df1.
The first entry in sol is 1 because the value NID = 1 only appears in the columns N1,N2,N3,N4 for EID=1 in df2.
The second entry in sol refers to the value NID=2 (of df1) and has the length 2 because NID=2 can be found in column N1 (for EID=2) and in column N2 (for EID=1). Therefore, the second entry in the solution is [1,2] and so on.
What I tried so far is looping for each element in df1 and then looping for each element in df2 to see if NID is in any of the columns N1,N2,N3,N4. This solution works but for huge dataframes (each df can have up to some thousand entries) this solution becomes extremely time-consuming.
Therefore I was looking for a much more efficient solution.
My code as implemented:
Input data:
import pandas as pd
df1 = pd.DataFrame({'NID':[1,2,3,4,5]})
df2 = pd.DataFrame({'EID':[1,2,3,4,5,6,7,8,9,10],
'N1':[1,2,3,4,5,6,7,8,9,10],
'N2':[2,3,4,5,6,7,8,9,10,11],
'N3':[13,14,15,16,17,18,19,20,21,22],
'N4':[12,13,14,15,16,17,18,19,20,21]})
solution acquired using looping:
sol= []
for idx,node in df1.iterrows():
x = []
for idx2,elem in df2.iterrows():
if node['NID'] == elem['N1']:
x.append(elem['EID'])
if node['NID'] == elem['N2']:
x.append(elem['EID'])
if node['NID'] == elem['N3']:
x.append(elem['EID'])
if node['NID'] == elem['N4']:
x.append(elem['EID'])
sol.append(x)
print(sol)
If anyone has a solution where I do not have to loop, I would be very happy. Maybe using a numpy function or something like cKDTrees but unfortunately I have no idea on how to get this problem solved in a faster way.
Thank you in advance!
You can reshape with melt, filter with loc, and groupby.agg as list. Then reindex and convert tolist:
out = (df2
.melt('EID') # reshape to long form
# filter the values that are in df1['NID']
.loc[lambda d: d['value'].isin(df1['NID'])]
# aggregate as list
.groupby('value')['EID'].agg(list)
# ensure all original NID are present in order
# and convert to list
.reindex(df1['NID']).tolist()
)
Alternative with stack:
df3 = df2.set_index('EID')
out = (df3
.where(df3.isin(df1['NID'].tolist())).stack()
.reset_index(name='group')
.groupby('group')['EID'].agg(list)
.reindex(df1['NID']).tolist()
)
Output:
[[1], [2, 1], [3, 2], [4, 3], [5, 4]]

Python: Fetch value from dataframe list with min value from another column list of same dataframe

input dataframe
Flow row count
Apple [45,46] [2,1]
Orange [13,14] [1,5]
need to find min value of each list column 'count' and fetch respective row value from row column.
Expected output:
Flow row count
Apple 46 1
Orange 13 1
A possible solution (the part .astype('int') may be unnecessary in your case):
df['row'] = list(df.explode(['row', 'count']).reset_index().groupby('flow')
.apply(lambda x: x['row'][x['count'].astype('int').idxmin()]))
df['count'] = df['count'].map(min)
A shorter solution than my previous one, based on sorted with key:
df.assign(row=df.apply(
lambda x: sorted(x['row'], key=lambda z: x['count'][x['row'].index(z)])[0],axis=1),
count=df['count'].map(min))
Output:
flow row count
0 apple 46 1
1 orange 13 1
In Python, the list has a function called index where you can get the position of the value you want to find. So, by utilizing this function with a min, you can get your desired result.
df['min_index'] = df['count'].apply(lambda x: x.index(min(x)))
df[['row_res','count_res']] = [[row[j],count[j]] for row, count, j in zip(df['row'], df['count'], df['min_index'])]
Flow row count min_index row_res count_res
0 Apple [45, 46] [2, 1] 1 46 1
1 Orange [13, 14] [1, 5] 0 13 1

Pandas convert column where every cell is list of strings to list of integers

I have a dataframe with columns that has list of numbers as strings:
C1 C2 l
1 3 ['5','9','1']
7 1 ['7','1','6']
What is the best way to convert it to list of ints?
C1 C2 l
1 3 [5,9,1]
7 1 [7,1,6]
Thanks
You can try
df['l'] = df['l'].apply(lambda lst: list(map(int, lst)))
print(df)
C1 C2 l
0 1 7 [5, 9, 1]
1 3 1 [7, 1, 6]
Pandas' dataframes are not designed to work with nested structures such as lists. Thus, there is no vectorial method for this task.
You need to loop. The most efficient is to use a list comprehension (apply would also work but with much less efficiency).
df['l'] = [[int(x) for x in l] for l in df['l']]
NB. There is no check. If you have anything that cannot be converted to integers, this will trigger an error!
Output:
C1 C2 l
0 1 3 [5, 9, 1]
1 7 1 [7, 1, 6]

How to get duplicated values in a data frame when the column is a list?

Good morning!
I have a data frame with several columns. One of this columns, data, has lists as content. Below I show a little example (id is just an example with random information):
df =
id data
0 a [1, 2, 3]
1 h [3, 2, 1]
2 bf [1, 2, 3]
What I want is to get rows with duplicated values in column data, I mean, in this example, I should get rows 0 and 2, because the values in its column data are the same (list [1, 2, 3]). However, this can't be achieved with df.duplicated(subset = ['data']) due to list is an unhashable type.
I know that it can be done getting two rows and comparing data directly, but my real data frame can have 1000 rows or more, so I can't compare one by one.
Hope someone knows it!
Thanks you very much in advance!
IIUC, We can create a new DataFrame from df['data'] and then check with DataFrame.duplicated
You can use:
m = pd.DataFrame(df['data'].tolist()).duplicated(keep=False)
df.loc[m]
id data
0 a [1, 2, 3]
2 bf [1, 2, 3]
Expanding on Quang's comment:
Try
In [2]: elements = [(1,2,3), (3,2,1), (1,2,3)]
...: df = pd.DataFrame.from_records(elements)
...: df
Out[2]:
0 1 2
0 1 2 3
1 3 2 1
2 1 2 3
In [3]: # Add a new column of tuples
...: df["new"] = df.apply(lambda x: tuple(x), axis=1)
...: df
Out[3]:
0 1 2 new
0 1 2 3 (1, 2, 3)
1 3 2 1 (3, 2, 1)
2 1 2 3 (1, 2, 3)
In [4]: # Remove duplicate rows (Keeping the first one)
...: df.drop_duplicates(subset="new", keep="first", inplace=True)
...: df
Out[4]:
0 1 2 new
0 1 2 3 (1, 2, 3)
1 3 2 1 (3, 2, 1)
In [5]: # Remove the new column if not required
...: df.drop("new", axis=1, inplace=True)
...: df
Out[5]:
0 1 2
0 1 2 3
1 3 2 1

pandas data frame effeciently remove duplicates and keep records largest int value

I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?
import pandas
import io
import json
input = """
KEY VALUE
apple 0
apple 1
apple 2
bannana 0
bannana 1
bannana 2
pear 0
pear 1
pear 2
pear 3
orange 0
orange 1
orange 2
orange 3
orange 4
"""
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))
running this yields the expected output:
{
"apple": 2,
"bannana": 2,
"pear": 3,
"orange": 4
}
Is there a more efficient way to achieve this transformation with pandas?
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()
If your input needs to be a dictionary, just add to_dict() :
df.groupby('KEY')['VALUE'].max().to_dict()
Also you can try:
[*df.groupby('KEY',sort=False).last().to_dict().values()][0]
{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}

Resources