Smallest difference from every row in a dataframe - python-3.x

A = [1,3,7]
B = [6,4,8]
C = [2, 2, 8]
datetime = ['2022-01-01', '2022-01-02', '2022-01-03']
df1 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df1.set_index('DATETIME', inplace = True)
df1
A = [1,3,7,6, 8]
B = [3,8,10,5, 8]
C = [5, 7, 9, 6, 5]
datetime = ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05']
df2 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df2.set_index('DATETIME', inplace = True)
df2
I want to compare the difference between every row of df1 to that of df2 and output that date for each row in df1. Lets take the first row in df1 (2022-01-01) where A=1, B=6, and C = 2. Comparing that to df2 2022-03-01 where A=1, B = 3, and C = 5, we get a total difference of 1-1=0, 6-3=3, and 2-5 = 3 for a total of 0+3+3= 6 total difference. Comparing that 2022-01-01 to the rest of df2 we see that 2022-03-01 is the lowest total difference and would like the date in df1.

I'm assuming that you want the lowest total absolute difference.
The fastest way is probably to convert the DataFrames to numpy arrays, and use numpy broadcasting to efficiently perform the computations.
# for each row of df1 get the (positional) index of the df2 row corresponding to the lowest total absolute difference
min_idx = abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(axis=-1).argmin(axis=1)
df1['min_diff_date'] = df2.index[min_idx]
Output:
>>> df1
A B C min_diff_date
DATETIME
2022-01-01 1 6 2 2022-03-01
2022-01-02 3 4 2 2022-03-01
2022-01-03 7 8 8 2022-03-03
Steps:
# Each 'block' corresponds to the absolute difference between a row of df1 and all the rows of df2
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy())
array([[[0, 3, 3],
[2, 2, 5],
[6, 4, 7],
[5, 1, 4],
[7, 2, 3]],
[[2, 1, 3],
[0, 4, 5],
[4, 6, 7],
[3, 1, 4],
[5, 4, 3]],
[[6, 5, 3],
[4, 0, 1],
[0, 2, 1],
[1, 3, 2],
[1, 0, 3]]])
# sum the absolute differences over the columns of each block
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1)
array([[ 6, 9, 17, 10, 12],
[ 6, 9, 17, 8, 12],
[14, 5, 3, 6, 4]])
# for each row of the previous array get the column index of the lowest value
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1).argmin(1)
array([0, 0, 2])

Related

Function on column from dictionary

I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)

How to Right Align Print a Python Integer List

I have three lists like:
l1 = [1, 2, 3]
l2 = [4, 5, 6]
l3 = [7, 8, 9]
I want to print these in following manner:
Fruits Quantity
Mango 1, 2, 3
Banana 4, 5, 6
Strawberry 7, 8, 9
How can I do the right alignment of the numbers in the list in python 3?
It seems an easy task. I have read many formatting tutorials online and looked in stack overflow answers but couldn't find one which can be used in my case. Maybe because I'm a total beginner in python so couldn't understand how to apply those in my situation.
Instead of trying to align quantity list, you can align their descriptive names (i.e fruits):
fruits = {"Mango": [1, 2, 3],
"Banana": [4, 5, 6],
"Strawberry": [7, 8, 9]}
for fruit, quantity in fruits.items():
print(f"{fruit:15}", ", ".join(str(i) for i in quantity))
Mango 1, 2, 3
Banana 4, 5, 6
Strawberry 7, 8, 9
if you want to work with your l1-l3 lists:
l1 = [1, 2, 3]
l2 = [4, 5, 6]
l3 = [7, 8, 9]
l4 = ["Mango", "Banana", "Strawberry"]
print(f"{'Fruits':15}Quantity")
[print(f"{key:15}{', '.join([str(num) for num in value])}") for key, value in dict(zip(l4, [l1,l2,l3])).items()]
OUTPUT:
Fruits Quantity
Mango 1, 2, 3
Banana 4, 5, 6
Strawberry 7, 8, 9
Using zip() along with ljust() that Returns the string left justified in a string of specified length:
headers = ["Fruits", "Quantity"]
l1 = [1, 2, 3]
l2 = [4, 5, 6]
l3 = [7, 8, 9]
l4 = ["Mango", "Banana", "Strawberry"]
print(''.ljust(15).join(head for head in headers))
for fruit, quantity in zip(l4, [l1,l2,l3]):
print(fruit.ljust(20), ', '.join([str(quan) for quan in quantity]))
OUTPUT:
Fruits Quantity
Mango 1, 2, 3
Banana 4, 5, 6
Strawberry 7, 8, 9

Pandas: How to aggregate by range inclusion?

I have a dataframe with a "range" column and some value columns:
In [1]: df = pd.DataFrame({
"range": [[1,2], [[1,2], [6,11]], [4,5], [[1,3], [5,7], [9, 11]], [9,10], [[5,6], [9,11]]],
"A": range(1, 7),
"B": range(6, 0, -1)
})
Out[1]:
range A B
0 [1, 2] 1 6
1 [[1, 2], [6, 11]] 2 5
2 [4, 5] 3 4
3 [[1, 3], [5, 7], [9, 11]] 4 3
4 [9, 10] 5 2
5 [[5, 6], [9, 11]] 6 1
For every row I need to check if the range is entirely included (with all of its parts) in the range of another row and then sum the other columns (A and B) up, keeping the longer range. The rows are arbitarily ordered.
The detailed steps for the example dataframe would look like: Row 0 is entirely included in row 1 and 3, row 1, 2 and 3 have no other rows where their ranges are entirely included and row 4 is included in row 1, 3 and 5, but because row 5 is also included in 3 row 4 should only be merged once.
Hence my output dataframe would be:
Out[2]:
range A B
0 [[1, 2], [6, 11]] 8 13
1 [4, 5] 3 4
2 [[1, 3], [5, 7], [9, 11]] 16 12
I thought about sorting the rows first in order to put the longest ranges at the top so it would be easier and more efficient to merge the ranges, but unfortunately I have no idea how to perform this in pandas...

Is there any function to create pairing of values from columns in pandas

I have to make the pairing of values in particular column like 3 2 2 4 2 2 to [3,2][2,2][2,4][4,2][2,2] in whole of the data set.
Expected output
[[3, 2], [2, 2], [2, 4], [4, 2], [2, 2]] Every row in separate columns like Pair 1 , Pair 2 ,Pair 3 ....
content = pd.read_csv('temp2.csv')
df = ([content], columns=['V2','V3','V4','V5','V6','V7'])
def get_pairs(x):
arr = x.split(' ')
return list(map(list, zip(arr,arr[1:])))
df['pairs'] = df.applymap(get_pairs)
df
IIUC, you can use list comprehension and zip:
# Setup
df = pd.DataFrame([3, 2, 2, 4, 2, 2], columns=['col1'])
[[x, y] for x, y in zip(df.loc[:, 'col1'], df.loc[1:, 'col1'])]
or alternatively using map and list constructor:
list(map(list, zip(df.loc[:, 'col1'], df.loc[1:, 'col1'])))
[out]
[[3, 2], [2, 2], [2, 4], [4, 2], [2, 2]]
Or if this is how your data is structured you could use applymap with your own function:
# Setup
df = pd.DataFrame(['3 2 2 4 2 2', '1 2 3 4 5 6'], columns=['col1'])
# col1
# 0 3 2 2 4 2 2
# 1 1 2 3 4 5 6
def get_pairs(x):
arr = x.split(' ')
return list(map(list, zip(arr, arr[1:])))
df['pairs'] = df.applymap(get_pairs)
[out]
col1 pairs
0 3 2 2 4 2 2 [[3, 2], [2, 2], [2, 4], [4, 2], [2, 2]]
1 1 2 3 4 5 6 [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]

Groupby arrays in a pandas dataframe

Consider a dataframe with numpy arrays as entries for lat/lon:
lat lon min max
[1, 2, 3] [4, 5, 6] 10 90
[1, 2, 3] [4, 5, 6] 80 120
[7, 8, 9] [4, 5, 6] 10 20
[7, 8, 9] [4, 5, 6] 30 40
How can I group the dataset by unique lat/lon combinations when the entries are numpy arrays? The goal is to check if min/max ranges intersect for unique lat/lon combinations and then combine them to a single row with new min/max. The result should look like this:
lat lon min max
[1, 2, 3] [4, 5, 6] 10 120
[7, 8, 9] [4, 5, 6] 10 20
[7, 8, 9] [4, 5, 6] 30 40
What I have tried so far is:
grouped = sectors.groupby(['lat', 'lon'])
But I can not access the groups in grouped. The following will result in an Error (TypeError: unhashable type: 'numpy.ndarray'):
for name, group in grouped:
print(name)
print(group)

Resources