Filter dataframe by minimum number of values in groups - python-3.x

I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6.
I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!

This will do the trick for you:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
Outputs:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool

Related

Compare value in a dataframe to multiple columns of another dataframe to get a list of lists where entries match in an efficient way

I have two pandas dataframes and i want to find all entries of the second dataframe where a specific value occurs.
As an example:
df1:
NID
0 1
1 2
2 3
3 4
4 5
df2:
EID N1 N2 N3 N4
0 1 1 2 13 12
1 2 2 3 14 13
2 3 3 4 15 14
3 4 4 5 16 15
4 5 5 6 17 16
5 6 6 7 18 17
6 7 7 8 19 18
7 8 8 9 20 19
8 9 9 10 21 20
9 10 10 11 22 21
Now, what i basically want, is a list of lists with the values EID (from df2) where the values NID (from df1) occur in any of the columns N1,N2,N3,N4:
Solution would be:
sol = [[1], [1, 2], [2, 3], [3, 4], [4, 5]]
The desired solution explained:
The solution has 5 entries (len(sol = 5)) since I have 5 entries in df1.
The first entry in sol is 1 because the value NID = 1 only appears in the columns N1,N2,N3,N4 for EID=1 in df2.
The second entry in sol refers to the value NID=2 (of df1) and has the length 2 because NID=2 can be found in column N1 (for EID=2) and in column N2 (for EID=1). Therefore, the second entry in the solution is [1,2] and so on.
What I tried so far is looping for each element in df1 and then looping for each element in df2 to see if NID is in any of the columns N1,N2,N3,N4. This solution works but for huge dataframes (each df can have up to some thousand entries) this solution becomes extremely time-consuming.
Therefore I was looking for a much more efficient solution.
My code as implemented:
Input data:
import pandas as pd
df1 = pd.DataFrame({'NID':[1,2,3,4,5]})
df2 = pd.DataFrame({'EID':[1,2,3,4,5,6,7,8,9,10],
'N1':[1,2,3,4,5,6,7,8,9,10],
'N2':[2,3,4,5,6,7,8,9,10,11],
'N3':[13,14,15,16,17,18,19,20,21,22],
'N4':[12,13,14,15,16,17,18,19,20,21]})
solution acquired using looping:
sol= []
for idx,node in df1.iterrows():
x = []
for idx2,elem in df2.iterrows():
if node['NID'] == elem['N1']:
x.append(elem['EID'])
if node['NID'] == elem['N2']:
x.append(elem['EID'])
if node['NID'] == elem['N3']:
x.append(elem['EID'])
if node['NID'] == elem['N4']:
x.append(elem['EID'])
sol.append(x)
print(sol)
If anyone has a solution where I do not have to loop, I would be very happy. Maybe using a numpy function or something like cKDTrees but unfortunately I have no idea on how to get this problem solved in a faster way.
Thank you in advance!
You can reshape with melt, filter with loc, and groupby.agg as list. Then reindex and convert tolist:
out = (df2
.melt('EID') # reshape to long form
# filter the values that are in df1['NID']
.loc[lambda d: d['value'].isin(df1['NID'])]
# aggregate as list
.groupby('value')['EID'].agg(list)
# ensure all original NID are present in order
# and convert to list
.reindex(df1['NID']).tolist()
)
Alternative with stack:
df3 = df2.set_index('EID')
out = (df3
.where(df3.isin(df1['NID'].tolist())).stack()
.reset_index(name='group')
.groupby('group')['EID'].agg(list)
.reindex(df1['NID']).tolist()
)
Output:
[[1], [2, 1], [3, 2], [4, 3], [5, 4]]

Filter simultaneously by different values of rows Pandas

I have a huge dataframe with product_id and their property_id's. Note that for each property starts with new index. I need to filter simultaneously by different property_id values for each product_id. Is there any way to do it fast?
out_df
product_id property_id
0 3588 1
1 3588 2
2 3588 5
3 3589 1
4 3589 3
5 3589 5
6 3590 1
7 3590 2
8 3590 5
For example want kinda that to filter for each product_id by two properties that are assigned at different rows like out_df.loc[(out_df['property_id'] == 1) & (out_df['property_id'] == 2)] but instead of it).
I need something like that but working at the same time for all rows of each product_id column.
I know that it can be done via groupby into lists
3587 [2, 1, 5]
3588 [1, 3, 5]
3590 [1, 2, 5]
and finding intersections inside lists.
gp_df.apply(lambda r: {1, 2} < (set(r['property_id'])), axis=1)
But it takes time and at the same time Pandas common filtering is greatly optimized for speed (believe in using some tricky right and inverse indexes inside what do search engines like ElasticSearch, Sphinx etc) .
Expected output: where both {1 and 2} are having.
3587 [2, 1, 5]
3590 [1, 2, 5]
Since this is just as much a performance as a functional question, I would go with an intersection approach like this:
df = pd.DataFrame({'product_id': [3588, 3588, 3588, 3589, 3589, 3589, 3590, 3590,3590],
'property_id': [1, 2, 5, 1, 3, 5, 1, 2, 5]})
df = df.set_index(['property_id'])
print("The full DataFrame:")
print(df)
start = time()
for i in range(1000):
s1 = df.loc[(1), 'product_id']
s2 = df.loc[(2), 'product_id']
s_done = pd.Series(list(set(s1).intersection(set(s2))))
print("Overlapping product_id's")
print(time()-start)
Iterating the lookup 1000 times takes 0.93 seconds on my ThinkPad T450s. I took the liberty to test #jezrael's two suggestions and they come in at 2.11 and 2.00 seconds, the groupby approach is, software engineering wise, more elegant though.
Depending on the size of your data set and the importance of performance, you can also switch to more simple datatypes, like classic dictionaries and gain further speed.
Jupyter Notebook can be found here: pandas_fast_lookup_using_intersection.ipynb
do you mean something like this?
result = out_df.loc[out_df['property_id'].isin([1,2]), :]
If you want you can then drop duplicates based on product_id...
The simpliest is use GroupBy.transform with compare sets:
s = {1, 2}
a = df[df.groupby('product_id')['property_id'].transform(lambda r: s < set(r))]
print (a)
product_id property_id
0 3588 1
1 3588 2
2 3588 5
6 3590 1
7 3590 2
8 3590 5
Another solution is filter only values of sets, removing duplicates first:
df1 = df[df['property_id'].isin(s) & ~df.duplicated(['product_id', 'property_id'])]
Then is necessary check if lengths of each group is same as length of set with this solution:
f, u = df1['product_id'].factorize()
ids = df1.loc[np.bincount(f)[f] == len(s), 'product_id'].unique()
Last filter all rows with product_id by condition:
a = df[df['product_id'].isin(ids)]
print (a)
product_id property_id
0 3588 1
1 3588 2
2 3588 5
6 3590 1
7 3590 2
8 3590 5

Sum and collapse two rows in pandas if two values are equal (order does not matter)

I am analyzing a dataset that has an Origin ID (Column A), a Destination ID (Column B), and how many trips have happened between them (Column Count). Now I want to sum the A-B trips with the B-A trips. This sum is the total number of trips between A and B.
Here is how my data looks like (it is not necessarily ordered in the same way):
In [1]: group_station = pd.DataFrame([[1, 2, 100], [2, 1, 200], [4, 6, 5] , [6, 4, 10], [1, 4, 70]], columns=['A', 'B', 'Count'])
Out[2]:
A B Count
0 1 2 100
1 2 1 200
2 4 6 5
3 6 4 10
4 1 4 70
And I want the following output:
A B C
0 1 2 300
1 4 6 15
4 1 4 70
I have tried groupby and setting the index to both variables with no success. Right now I am doing a very inefficient double loop, that is too slow for the size of my dataset.
If it helps this is the code for the double loop (I removed some efficiency modifications to make it more clear):
# group_station is the dataframe
collapsed_group_station = np.zeros(len(group_station), 3))
for i, row in enumerate(group_station.iterrows()):
start_id = row[0][0]
end_id = row[0][1]
count = row[1][0]
for check_row in group_station.iterrows():
check_start_id = check_row[0][0]
check_end_id = check_row[0][1]
check_time = check_row[1][0]
if start_id == check_end_id and end_id == check_start_id:
new_group_station[i][0] = start_id
new_group_station[i][1] = end_id
new_group_station[i][2] = time + check_time
break
I have ideas of how to make this code more efficient, but I wanted to know if there is a way of doing it without looping.
You can using np.sort with groupby.sum()
import numpy as np; import pandas as pd
group_station[['A','B']]=np.sort(group_station[['A','B']],axis=1)
group_station.groupby(['A','B'],as_index=False).Count.sum()
Out[175]:
A B Count
0 1 2 300
1 1 4 70
2 4 6 15

pandas how to derived values for a new column base on another column

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id.
A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[4,5,6,7] 2
[8] nan
[9,10] 3
column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.
I am wondering how to do that in pandas.
We can use np.random.choice for unique random values with .loc for assignment i.e
df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})
x = df['document_no_list'].apply(len) > 1
df.loc[x,'Cluster'] = np.random.choice(range(len(df)),x.sum(),replace=False)
Output :
document_no_list Cluster
0 [1, 2, 3] 2.0
1 [4, 5, 6, 7] 1.0
2 [8] NaN
3 [9, 10] 3.0
If you want continuous numbers then you can use
df.loc[x,'Cluster'] = np.arange(x.sum())+1
document_no_list Cluster
0 [1, 2, 3] 1.0
1 [4, 5, 6, 7] 2.0
2 [8] NaN
3 [9, 10] 3.0
Hope it helps
Create a boolean column based on condition and apply cumsum() on rows with 1's
df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)
df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()
document_no_list cluster_id
0 [1, 2, 3] 1
1 [4, 5, 6, 7] 2
2 [8] 0
3 [9, 10] 3

Why is order of data items reversed while creating a pandas series?

I am new to python and pandas so please bear with me. I tried searching the answer everywhere but couldn't find it. Here's my question:
This is my input code:
list = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], list)
The output is:
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
Now, my question is why the "list" is coming before the first list specified while creating the series? I tried running the same code multiple times to check if the series creation is orderless. Any help would be highly appreciated.
Python Version:
Python 3.6.0
Pandas Version:
'0.19.2'
I think you omit index which specify first column called index - so Series construction now is:
#dont use list as variable, because reversed word in python
L = [1, 2, 3, 1, 2, 3]
s = pd.Series(data=[1, 2, 3, 10, 20, 30], index=L)
print (s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
You can also check Series documentation.

Resources