Pandas: How to aggregate by range inclusion? - python-3.x

I have a dataframe with a "range" column and some value columns:
In [1]: df = pd.DataFrame({
"range": [[1,2], [[1,2], [6,11]], [4,5], [[1,3], [5,7], [9, 11]], [9,10], [[5,6], [9,11]]],
"A": range(1, 7),
"B": range(6, 0, -1)
})
Out[1]:
range A B
0 [1, 2] 1 6
1 [[1, 2], [6, 11]] 2 5
2 [4, 5] 3 4
3 [[1, 3], [5, 7], [9, 11]] 4 3
4 [9, 10] 5 2
5 [[5, 6], [9, 11]] 6 1
For every row I need to check if the range is entirely included (with all of its parts) in the range of another row and then sum the other columns (A and B) up, keeping the longer range. The rows are arbitarily ordered.
The detailed steps for the example dataframe would look like: Row 0 is entirely included in row 1 and 3, row 1, 2 and 3 have no other rows where their ranges are entirely included and row 4 is included in row 1, 3 and 5, but because row 5 is also included in 3 row 4 should only be merged once.
Hence my output dataframe would be:
Out[2]:
range A B
0 [[1, 2], [6, 11]] 8 13
1 [4, 5] 3 4
2 [[1, 3], [5, 7], [9, 11]] 16 12
I thought about sorting the rows first in order to put the longest ranges at the top so it would be easier and more efficient to merge the ranges, but unfortunately I have no idea how to perform this in pandas...

Related

Smallest difference from every row in a dataframe

A = [1,3,7]
B = [6,4,8]
C = [2, 2, 8]
datetime = ['2022-01-01', '2022-01-02', '2022-01-03']
df1 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df1.set_index('DATETIME', inplace = True)
df1
A = [1,3,7,6, 8]
B = [3,8,10,5, 8]
C = [5, 7, 9, 6, 5]
datetime = ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05']
df2 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df2.set_index('DATETIME', inplace = True)
df2
I want to compare the difference between every row of df1 to that of df2 and output that date for each row in df1. Lets take the first row in df1 (2022-01-01) where A=1, B=6, and C = 2. Comparing that to df2 2022-03-01 where A=1, B = 3, and C = 5, we get a total difference of 1-1=0, 6-3=3, and 2-5 = 3 for a total of 0+3+3= 6 total difference. Comparing that 2022-01-01 to the rest of df2 we see that 2022-03-01 is the lowest total difference and would like the date in df1.
I'm assuming that you want the lowest total absolute difference.
The fastest way is probably to convert the DataFrames to numpy arrays, and use numpy broadcasting to efficiently perform the computations.
# for each row of df1 get the (positional) index of the df2 row corresponding to the lowest total absolute difference
min_idx = abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(axis=-1).argmin(axis=1)
df1['min_diff_date'] = df2.index[min_idx]
Output:
>>> df1
A B C min_diff_date
DATETIME
2022-01-01 1 6 2 2022-03-01
2022-01-02 3 4 2 2022-03-01
2022-01-03 7 8 8 2022-03-03
Steps:
# Each 'block' corresponds to the absolute difference between a row of df1 and all the rows of df2
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy())
array([[[0, 3, 3],
[2, 2, 5],
[6, 4, 7],
[5, 1, 4],
[7, 2, 3]],
[[2, 1, 3],
[0, 4, 5],
[4, 6, 7],
[3, 1, 4],
[5, 4, 3]],
[[6, 5, 3],
[4, 0, 1],
[0, 2, 1],
[1, 3, 2],
[1, 0, 3]]])
# sum the absolute differences over the columns of each block
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1)
array([[ 6, 9, 17, 10, 12],
[ 6, 9, 17, 8, 12],
[14, 5, 3, 6, 4]])
# for each row of the previous array get the column index of the lowest value
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1).argmin(1)
array([0, 0, 2])

Is there any function to create pairing of values from columns in pandas

I have to make the pairing of values in particular column like 3 2 2 4 2 2 to [3,2][2,2][2,4][4,2][2,2] in whole of the data set.
Expected output
[[3, 2], [2, 2], [2, 4], [4, 2], [2, 2]] Every row in separate columns like Pair 1 , Pair 2 ,Pair 3 ....
content = pd.read_csv('temp2.csv')
df = ([content], columns=['V2','V3','V4','V5','V6','V7'])
def get_pairs(x):
arr = x.split(' ')
return list(map(list, zip(arr,arr[1:])))
df['pairs'] = df.applymap(get_pairs)
df
IIUC, you can use list comprehension and zip:
# Setup
df = pd.DataFrame([3, 2, 2, 4, 2, 2], columns=['col1'])
[[x, y] for x, y in zip(df.loc[:, 'col1'], df.loc[1:, 'col1'])]
or alternatively using map and list constructor:
list(map(list, zip(df.loc[:, 'col1'], df.loc[1:, 'col1'])))
[out]
[[3, 2], [2, 2], [2, 4], [4, 2], [2, 2]]
Or if this is how your data is structured you could use applymap with your own function:
# Setup
df = pd.DataFrame(['3 2 2 4 2 2', '1 2 3 4 5 6'], columns=['col1'])
# col1
# 0 3 2 2 4 2 2
# 1 1 2 3 4 5 6
def get_pairs(x):
arr = x.split(' ')
return list(map(list, zip(arr, arr[1:])))
df['pairs'] = df.applymap(get_pairs)
[out]
col1 pairs
0 3 2 2 4 2 2 [[3, 2], [2, 2], [2, 4], [4, 2], [2, 2]]
1 1 2 3 4 5 6 [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]

Groupby arrays in a pandas dataframe

Consider a dataframe with numpy arrays as entries for lat/lon:
lat lon min max
[1, 2, 3] [4, 5, 6] 10 90
[1, 2, 3] [4, 5, 6] 80 120
[7, 8, 9] [4, 5, 6] 10 20
[7, 8, 9] [4, 5, 6] 30 40
How can I group the dataset by unique lat/lon combinations when the entries are numpy arrays? The goal is to check if min/max ranges intersect for unique lat/lon combinations and then combine them to a single row with new min/max. The result should look like this:
lat lon min max
[1, 2, 3] [4, 5, 6] 10 120
[7, 8, 9] [4, 5, 6] 10 20
[7, 8, 9] [4, 5, 6] 30 40
What I have tried so far is:
grouped = sectors.groupby(['lat', 'lon'])
But I can not access the groups in grouped. The following will result in an Error (TypeError: unhashable type: 'numpy.ndarray'):
for name, group in grouped:
print(name)
print(group)

pandas how to derived values for a new column base on another column

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id.
A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[4,5,6,7] 2
[8] nan
[9,10] 3
column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.
I am wondering how to do that in pandas.
We can use np.random.choice for unique random values with .loc for assignment i.e
df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})
x = df['document_no_list'].apply(len) > 1
df.loc[x,'Cluster'] = np.random.choice(range(len(df)),x.sum(),replace=False)
Output :
document_no_list Cluster
0 [1, 2, 3] 2.0
1 [4, 5, 6, 7] 1.0
2 [8] NaN
3 [9, 10] 3.0
If you want continuous numbers then you can use
df.loc[x,'Cluster'] = np.arange(x.sum())+1
document_no_list Cluster
0 [1, 2, 3] 1.0
1 [4, 5, 6, 7] 2.0
2 [8] NaN
3 [9, 10] 3.0
Hope it helps
Create a boolean column based on condition and apply cumsum() on rows with 1's
df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)
df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()
document_no_list cluster_id
0 [1, 2, 3] 1
1 [4, 5, 6, 7] 2
2 [8] 0
3 [9, 10] 3

Find all combinations by columns

I have n-raws m-columns matrix and want to find all combinations. For example:
2 5 6 9
5 2 8 3
1 1 9 4
2 5 3 9
my program will print
2-5-6-9
2-5-6-3
2-5-6-4
2-5-6-9
2-5-8-9
2-5-8-3...
Can't define m x for loops. How to do that?
Use a recursion. It is enough to specify for each position which values can be there (columns), and make a recursion which has as parameters list of numbers for passed positions. In recursion iteration make iteration through possibilities of next position.
Python implementation:
def C(choose_numbers, possibilities):
if len(choose_numbers) >= len(possibilities):
print '-'.join(map(str, choose_numbers)) # format output
else:
for i in possibilities[len(choose_numbers)]:
C(choose_numbers+[i], possibilities)
c = [[2, 5, 1, 2], [5, 2, 1, 5], [6, 8, 9, 3], [9, 3, 4, 9]]
C([], c)

Resources