combine multiindex dataframe with Int64Index dataframe - python-3.x

i have one dataframe with multiindex
result =
MultiIndex([(1, 'HK_MN001'),
(2, 'HK_MN001'),
(3, 'HK_MN002'),
(4, 'HK_MN003'),
(5, 'HK_MN004'),
(6, 'HK_MN005'),
(7, 'HK_MN005'),
(8, 'HK_MN005')],
names=['ID1', 'ID2'])
Another dataframe with index:
photo_df:
Int64Index([1, 2, 3, 4, 5, 6, 7, 8], dtype='int64', name='ID1')
I want to concatenate both the dataframes but it gives me error:
Code:
result = pd.concat([result,photo_df], axis = 1,sort=False)
error:
"Can only union MultiIndex with MultiIndex or Index of tuples, "
NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
Result Dataframe is:
photo_df dataframe is:
PhotoID raw_photo
0 1 HK_MN001_DSC_2160_161014Ushio.JPG
1 2 HK_MN001_DSC_2308_161014Ushio.JPG
2 3 HK_MN002_DSC_2327_161014Ushio.JPG
3 4 HK_MN003_DSC_1474_181015Ushio.jpg
4 5 HK_MN004_DSC_1491_181015Ushio.jpg
5 6 HK_MN005_DSC_1506_181015Ushio.JPG
6 7 HK_MN005_DSC_1527_181015Ushio.JPG
7 8 HK_MN005_DSC_1528_181015Ushio.jpg
Required Output dataframe:(If possible drop the index = Id1)

I think you need in both DataFrames create MultiIndex:
photo_df = photo_df.set_index('PhotoID', drop=False)
photo_df.columns = pd.MultiIndex.from_product([photo_df.columns, ['']])
print (photo_df)
PhotoID raw_photo
PhotoID
1 1 HK_MN001_DSC_2160_161014Ushio.JPG
2 2 HK_MN001_DSC_2308_161014Ushio.JPG
3 3 HK_MN002_DSC_2327_161014Ushio.JPG
4 4 HK_MN003_DSC_1474_181015Ushio.jpg
5 5 HK_MN004_DSC_1491_181015Ushio.jpg
6 6 HK_MN005_DSC_1506_181015Ushio.JPG
7 7 HK_MN005_DSC_1527_181015Ushio.JPG
8 8 HK_MN005_DSC_1528_181015Ushio.jpg
#second level ID2 is column
result = pd.concat([result.reset_index(level=1),photo_df], axis = 1,sort=False)

Related

Compare value in a dataframe to multiple columns of another dataframe to get a list of lists where entries match in an efficient way

I have two pandas dataframes and i want to find all entries of the second dataframe where a specific value occurs.
As an example:
df1:
NID
0 1
1 2
2 3
3 4
4 5
df2:
EID N1 N2 N3 N4
0 1 1 2 13 12
1 2 2 3 14 13
2 3 3 4 15 14
3 4 4 5 16 15
4 5 5 6 17 16
5 6 6 7 18 17
6 7 7 8 19 18
7 8 8 9 20 19
8 9 9 10 21 20
9 10 10 11 22 21
Now, what i basically want, is a list of lists with the values EID (from df2) where the values NID (from df1) occur in any of the columns N1,N2,N3,N4:
Solution would be:
sol = [[1], [1, 2], [2, 3], [3, 4], [4, 5]]
The desired solution explained:
The solution has 5 entries (len(sol = 5)) since I have 5 entries in df1.
The first entry in sol is 1 because the value NID = 1 only appears in the columns N1,N2,N3,N4 for EID=1 in df2.
The second entry in sol refers to the value NID=2 (of df1) and has the length 2 because NID=2 can be found in column N1 (for EID=2) and in column N2 (for EID=1). Therefore, the second entry in the solution is [1,2] and so on.
What I tried so far is looping for each element in df1 and then looping for each element in df2 to see if NID is in any of the columns N1,N2,N3,N4. This solution works but for huge dataframes (each df can have up to some thousand entries) this solution becomes extremely time-consuming.
Therefore I was looking for a much more efficient solution.
My code as implemented:
Input data:
import pandas as pd
df1 = pd.DataFrame({'NID':[1,2,3,4,5]})
df2 = pd.DataFrame({'EID':[1,2,3,4,5,6,7,8,9,10],
'N1':[1,2,3,4,5,6,7,8,9,10],
'N2':[2,3,4,5,6,7,8,9,10,11],
'N3':[13,14,15,16,17,18,19,20,21,22],
'N4':[12,13,14,15,16,17,18,19,20,21]})
solution acquired using looping:
sol= []
for idx,node in df1.iterrows():
x = []
for idx2,elem in df2.iterrows():
if node['NID'] == elem['N1']:
x.append(elem['EID'])
if node['NID'] == elem['N2']:
x.append(elem['EID'])
if node['NID'] == elem['N3']:
x.append(elem['EID'])
if node['NID'] == elem['N4']:
x.append(elem['EID'])
sol.append(x)
print(sol)
If anyone has a solution where I do not have to loop, I would be very happy. Maybe using a numpy function or something like cKDTrees but unfortunately I have no idea on how to get this problem solved in a faster way.
Thank you in advance!
You can reshape with melt, filter with loc, and groupby.agg as list. Then reindex and convert tolist:
out = (df2
.melt('EID') # reshape to long form
# filter the values that are in df1['NID']
.loc[lambda d: d['value'].isin(df1['NID'])]
# aggregate as list
.groupby('value')['EID'].agg(list)
# ensure all original NID are present in order
# and convert to list
.reindex(df1['NID']).tolist()
)
Alternative with stack:
df3 = df2.set_index('EID')
out = (df3
.where(df3.isin(df1['NID'].tolist())).stack()
.reset_index(name='group')
.groupby('group')['EID'].agg(list)
.reindex(df1['NID']).tolist()
)
Output:
[[1], [2, 1], [3, 2], [4, 3], [5, 4]]

pandas expand dataframe column with tuples, into multiple columns and rows

I have a data frame where one column contains elements that are a list containing several tuples. I want to turn each tuple in to a column for each element and create a new row for each tuple. So this code shows what I mean and the solution I came up with:
import numpy as np
import pandas as pd
a = pd.DataFrame(data=[['a','b',[(1,2,3),(6,7,8)]],
['c','d',[(10,20,30)]]], columns=['one','two','three'])
df2 = pd.DataFrame(columns=['one', 'two', 'A', 'B','C'])
print(a)
for index,item in a.iterrows():
for xtup in item.three:
temp = pd.Series(item)
temp['A'] = xtup[0]
temp['B'] = xtup[1]
temp['C'] = xtup[2]
temp = temp.drop('three')
df2 = df2.append(temp)
print(df2)
The output is:
one two three
0 a b [(1, 2, 3), (6, 7, 8)]
1 c d [(10, 20, 30)]
one two A B C
0 a b 1 2 3
0 a b 6 7 8
1 c d 10 20 30
Unfortunately, my solution takes 2 hours to run on 55,000 rows! Is there a more efficient way to do this?
We do explode column then explode row
a=a.explode('three')
a=pd.concat([a,pd.DataFrame(a.pop('three').tolist(),index=a.index)],axis=1)
one two 0 1 2
0 a b 1 2 3
0 a b 6 7 8
1 c d 10 20 30

Filter dataframe by minimum number of values in groups

I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6.
I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!
This will do the trick for you:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
Outputs:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool

How to get duplicated values in a data frame when the column is a list?

Good morning!
I have a data frame with several columns. One of this columns, data, has lists as content. Below I show a little example (id is just an example with random information):
df =
id data
0 a [1, 2, 3]
1 h [3, 2, 1]
2 bf [1, 2, 3]
What I want is to get rows with duplicated values in column data, I mean, in this example, I should get rows 0 and 2, because the values in its column data are the same (list [1, 2, 3]). However, this can't be achieved with df.duplicated(subset = ['data']) due to list is an unhashable type.
I know that it can be done getting two rows and comparing data directly, but my real data frame can have 1000 rows or more, so I can't compare one by one.
Hope someone knows it!
Thanks you very much in advance!
IIUC, We can create a new DataFrame from df['data'] and then check with DataFrame.duplicated
You can use:
m = pd.DataFrame(df['data'].tolist()).duplicated(keep=False)
df.loc[m]
id data
0 a [1, 2, 3]
2 bf [1, 2, 3]
Expanding on Quang's comment:
Try
In [2]: elements = [(1,2,3), (3,2,1), (1,2,3)]
...: df = pd.DataFrame.from_records(elements)
...: df
Out[2]:
0 1 2
0 1 2 3
1 3 2 1
2 1 2 3
In [3]: # Add a new column of tuples
...: df["new"] = df.apply(lambda x: tuple(x), axis=1)
...: df
Out[3]:
0 1 2 new
0 1 2 3 (1, 2, 3)
1 3 2 1 (3, 2, 1)
2 1 2 3 (1, 2, 3)
In [4]: # Remove duplicate rows (Keeping the first one)
...: df.drop_duplicates(subset="new", keep="first", inplace=True)
...: df
Out[4]:
0 1 2 new
0 1 2 3 (1, 2, 3)
1 3 2 1 (3, 2, 1)
In [5]: # Remove the new column if not required
...: df.drop("new", axis=1, inplace=True)
...: df
Out[5]:
0 1 2
0 1 2 3
1 3 2 1

Combining dataframes in pandas and populating with maximum values

I'm trying to combine multiple data frames in pandas and I want the new dataframe to contain the maximum element within the various dataframes. All of the dataframes have the same row and column labels. How can I do this?
Example:
df1 = Date A B C
1/1/15 3 5 1
2/1/15 2 4 7
df2 = Date A B C
1/1/15 7 2 2
2/1/15 1 5 4
I'd like the result to look like this.
df = Date A B C
1/1/15 7 5 2
2/1/15 2 5 7
You can use np.where to return an array of the values that satisfy your boolean condition, this can then be used to construct a df:
In [5]:
vals = np.where(df1 > df2, df1, df2)
vals
Out[5]:
array([['1/1/15', 7, 5, 2],
['2/1/15', 2, 5, 7]], dtype=object)
In [6]:
pd.DataFrame(vals, columns = df1.columns)
Out[6]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7
I don't know if Date is a column or index but the end result will be the same.
EDIT
Actually just use np.maximum:
In [8]:
np.maximum(df1,df2)
Out[8]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7

Resources