grouping coordinates within a distance to each other - python-3.x

I have written this code which works but takes a very long time (~8hrs) to finish execution.
Wondering if it can be optimized to execute quicker.
The aim is to group a lots of items (x,y,z) coordinates based on their distance to one another. For example;
I would like to group them for a distance of +-0.5 in x, +-0.5 in y and +-0.5 in z, then the output from the data below would be [(0,3),(1),(2,4)...].
x y z
0 1000.1 20.2 93.1
1 647.7 91.7 87.7
2 941.2 44.3 50.6
3 1000.3 20.3 92.9
4 941.6 44.1 50.6
...
What I have done (and which works) is described below.
It compares the first row of the data_frame with the 2nd, 3rd, 4th .. until the end, and for each row, if the distance from x to x < +-0.5 and y to y < +-0.5 and z to z < +- 0.5 then the index is added to a list, group. If it doesn't then it compares the next row until reaching the end of the loop.
After each loop is complete the indexes which matched (stored in group), are added to another list, groups, as a set and then removed from the original list, a, and then next a[0] is compared and so on.
groups = []
group = []
data = [(x,y,z),(x,y,z),(etc)] # > 50,000 entries
data_frame = pd.DataFrame(data, columns=['x','y','z'])
a = list(i for i in range(len(data_frame)))
threshold = 0.5
for j in range(len(a) - 1) :
if len(a) > 0:
group.append(a[0])
for ii in range(a[0], len(data_frame) - 1):
if ((data_frame.loc[a[0],'x'] - data_frame.loc[ii,'x']) < threshold) and ((data_frame.loc[a[0],'y'] - data_frame.loc[ii,'y']) < threshold) and ((data_frame.loc[a[0],'z'] - data_frame.loc[ii,'z']) < threshold):
group.append(ii)
else:
continue
groups.append(set(group))
for iii in group:
if iii in a:
a.remove(iii)
else:
continue
group = []
else:
break
which returns something like this, for example;
groups = [{0}, {1, 69}, {2, 70}, {3, 67}, {4}, {5}, {6}, {7, 9}, {8}, {10}, {11}, {12}, 13}, {14, 73}, {15}, {16}, {17, 21, 74}, {18, 20}, {19}, {22, 23}]
Have made many edits to this question as it was not very clear. Hopefully makes sense now.
Below is an attempt using better logic 'O(NlogN)' which is much faster but doesn't return the correct answer. Have used the same +-0.5 for x,y,z.
Edit:
test_list = [(i,x,y,z), ... , (i,x,y,z)]
df3 = sorted(test_list,key=lambda x: x[1])
result = []
while df3:
if len(df3) > 1: ####added this because was crashing at the end of the loop
a = df3.pop(0)
alist=[a[0]]
while ((abs(a[1] - df3[0][1]) < 0.5) and (abs(a[2] - df3[0][2]) < 0.5) and (abs(a[3] - df3[0][3]) < 0.5)):
alist.append(df3.pop(0)[0])
if df3:
continue
else:
break
result.append(alist)
else:
result.append(a[0])
break

Since you are comparing each data point with every other one, your implementation has a worst time complexity of O(N!). A better way is to do a sorting first.
import random
df = [i for i in range(100)]
random.shuffle(df)
df2 = [(i,x) for i,x in enumerate(df)]
df3 = sorted(df2,key=lambda x: x[1])
df3
[(31, 0), (24, 1), (83, 2)......
Assuming now you want to group number that are +5/-5 into one list. You can then slice number into list based on a condition.
result = []
while df3:
a = df3.pop(0)
alist=[a[0]]
while a[1] + 5 >= df3[0][1]:
alist.append(df3.pop(0)[0])
if df3:
continue
else:
break
result.append(alist)
result
[[31, 24, 83, 58, 82, 35], [0, 65, 77, 41, 67, 56].......
Sorting takes O(NlogN) and a grouping basically takes linear time. So this would be much faster than N!

Related

Multiply each row of an array with coefficients in list - Python

I am very new to Python an need help. This is the problem statement:
I want to calculate the value of each of the three houses by multiplying the rows of the arraym X (each row representing one house) with the coefficients in list c, so for the first house: Price = (66x3000)+(5x200)+ (15x-50) + (2x5000) + (500x100) = 258.000
Do not use numpy
Print the price of the three houses
This is what I have so far:
# input values for three houses:
# - size [m^2],
# - size of the sauna [m^2],
# - distance to water [m],
# - number of indoor bathrooms,
# - proximity of neighbors [m]
X = [[66, 5, 15, 2, 500],
[21, 3, 50, 1, 100],
[120, 15, 5, 2, 1200]]
# coefficient values
c = [3000, 200 , -50, 5000, 100]
def predict(X, c):
price = 0
for i in range (len(X)):
for j in range (len(X[i])):
price += (c[j]*X[i][j])
print(price)
predict(X, c)
The output is
258250
334350
827100.
The program adds the value of the 2nd an 3rd hourse the the previous result, rather than returning each house's value. How can I fix this?
Many thanks!
Move the line
price = 0
into the outer for loop:
def predict(X, c):
for i in range (len(X)):
price = 0
for j in range (len(X[i])):
...

How to filter time series if data exists at least data every 6 hours?

I'd like to verify if there is data at least once every 6 hours per ID, and filter out the IDs that do not meet this criteria.
essentially a filter: "if ID's data not at least every 6h, drop id from dataframe"
I try to use the same method for filtering one per day, but having trouble adapting the code.
# add day column from datetime index
df['1D'] = df.index.day
# reset index
daily = df.reset_index()
# count per ID per day. Result is per ID data of non-zero
a = daily.groupby(['1D', 'id']).size()
# filter by right join
filtered = a.merge(df, on = id", how = 'right')
I cannot figure out how to adapt this for the following 6hr periods each day: 00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00.
Groupby ID and then integer divide hour by 6 and get unique counts. In your case it should be greater than or equal to 4 because there are 4 - 6 hour bins in 24 hours and each day has 4 unique bins i.e.
Bins = 4
00:01-06:00
06:01-12:00
12:01-18:00
18:01-24:00
Code
mask = df.groupby('id')['date'].transform(lambda x: (x.dt.hour // 6).nunique() >= 4)
df = df[mask]
I propose to use pivot_table with resample which allows to change to arbitrary frequencies. Please see comments for further explanations.
# build test data. I need a dummy column to use pivot_table later. Any column with numerical values will suffice
data = [[datetime(2020, 1, 1, 1), 1, 1],
[datetime(2020, 1, 1, 6), 1, 1],
[datetime(2020, 1, 1, 12), 1, 1],
[datetime(2020, 1, 1, 18), 1, 1],
[datetime(2020, 1, 1, 1), 2, 1],
]
df = pd.DataFrame.from_records(data=data, columns=['date', 'id', 'dummy'])
df = df.set_index('date')
# We need a helper dataframe df_tmp.
# Transform id entries to columns. resample with 6h = 360 minutes = 360T.
# Take mean() because it will produce nan values
# WARNING: It will only work if you have at least one id with observations for every 6h.
df_tmp = pd.pivot_table(df, columns='id', index=df.index).resample('360T').mean()
# Drop MultiColumnHierarchy and drop all columns with NaN values
df_tmp.columns = df_tmp.columns.get_level_values(1)
df_tmp.dropna(axis=1, inplace=True)
# Filter values in original dataframe where
mask_id = df.id.isin(df_tmp.columns.to_list())
df = df[mask_id]
I kept your requirements on timestamps but I believe you want to use the commented lines in my solution.
import pandas as pd
period = pd.to_datetime(['2020-01-01 00:01:00', '2020-01-01 06:00:00'])
# period = pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 06:00:00'])
shift = pd.to_timedelta(['6H', '6H'])
id_with_data = set(df['ID'])
for k in range(4): # for a day (00:01 --> 24:00)
period_mask = (period[0] <= df.index) & (df.index <= period[1])
# period_mask = (period[0] <= df.index) & (df.index < period[1])
present_ids = set(df.loc[period_mask, 'ID'])
id_with_data = id_with_data.intersection(present_ids)
period += shift
df = df.loc[df['ID'].isin(list(id_with_data))]

Improving While-Loop with Numpy

I have given the following three variables:
start = 30 #starting value
end = 60 #ending value
slice_size = 6 #value difference per tuble
start and end are row numbers of an array. My goal is to create an array/list of tuples, where each tuples includes as much items as slice_size defines. A little example: If start and end have the above values the first four tuples would be:
[[30,35],[36,41],[42,47],[48,53],...].
But now comes the clue: the first value of the next tuple does not start with the first value before + slice_size, but rather with first value + slice_size/2. So I want something like this:
[[30,35],[33,38],[36,41],[39,44],...].
This list of tuples goes on until end is reached or right before it is reached - so until <=end . The last value of the list is not allowed to pass the value of end. The value of slice_size must of course always be an even number to work properly.
My nooby attempt is done by a while loop:
condition = 0
i = 0
list = []
half_slice = int(slice_size /2)
while condition <= end:
list.append([start+int(slice_size/2)*i,start+((slice_size-1)+i*half_slice)])
condition = start+((slice_size-1)+i*int(slice_size/2))
i += 1
The thing is, it works. However I know this is complete rubbish and I want to improve my skill. Do you have a suggestion how to do it in a couple of code lines?
you must not use list as it is a reserved word
import numpy as np
start = 30 #starting value
end = 60 #ending value
slice_size = 6 #value difference per tuble
l = [[i,j] for i,j in zip(np.arange(start, end, slice_size/2),
np.arange(start + slice_size - 1,
end + slice_size - 1,
slice_size/2)
)
]
print(l)
Output:
[[30.0, 35.0],
[33.0, 38.0],
[36.0, 41.0],
[39.0, 44.0],
[42.0, 47.0],
[45.0, 50.0],
[48.0, 53.0],
[51.0, 56.0],
[54.0, 59.0],
[57.0, 62.0]]
1) Do NOT use list as a variable name. It is a reserved key-word.
2) Not a NumPy solution but you can use list comprehension:
start = 30 #starting value
end = 60 #ending value
slice_size = 6 #value difference per tuble
result = [[current, current + slice_size - 1] for current in range(start, end - slice_size + 2, slice_size // 2)]
print(result)
Output:
[[30, 35], [33, 38], [36, 41], [39, 44], [42, 47], [45, 50], [48, 53], [51, 56], [54, 59]]
This will work for an odd number slice_size as well.

Filter array by last value Toleranz

i‘ m using Python 3.7.
I have an Array like this:
L1 = [1,2,3,-10,8,12,300,17]
Now i want to filter the values(the -10 and the 300 is not okay)
The values in the array may be different but always counting up or counting down.
Has Python 3 a integrated function for that?
The result should look like this:
L1 = [1,2,3,8,12,17]
Thank you !
Edit from comments:
I want to keep each element if it is only a certain distance (toleranz: 10 f.e.) distance away from the one before.
Your array is a list. You can use built in functions:
L1 = [1,2,3,-10,8,12,300,17]
min_val = min(L1) # -10
max_val = max(L1) # 300
p = list(filter(lambda x: min_val < x < max_val, L1)) # all x not -10 or 300
print(p) # [1, 2, 3, 8, 12, 17]
Doku:
min()
max()
filter()
If you want instead an incremental filter you go through your list of datapoints and decide if to keep or not:
delta = 10
result = []
last = L1[0] # first one as last value .. check the remaining list L1[1:]
for elem in L1[1:]:
if last-delta < elem < last+delta:
result.append(last)
last = elem
if elem-delta < result[-1] < elem+delta :
result.append(elem)
print(result) # [1, 2, 3, 8, 12, 17]

New column based on a row with conditions in Pandas

I'm trying to do an operation with Dataframes but i'm not sure how I can solve the problem using the built-in Pandas Operations (Actualy my code is based on a for so I'm trying to build a more elegant solution).
Given the following Dataframes, defined with the columns described below
original_df = [o1, o2, o3, o4]
weights_df = [w1, w2, w3, w4]
conditions_df = [c1, c2, c3, c4]
I need to built a new column on original_df based on the division of o1/w1 but depending on the value of c1, with takes the values ["+" or "-" I need to do the -o1/w1 operation.
As long as I did was:
orignal_df['newcolumn'] = original_df / weights_df
Where of course I divided the two terms but without applying the condition, I'm trying to do with map and apply functions but I'm not sure how I can add the third column into the function.
original_df = [100, 200, 300, 400]
weights_df = [10, 20, 30, 40]
conditions_df = [1, 2, 3, 4]
df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
def div(x, y, z):
if z > 2:
return float(x/y)
else:
return float(-1*x/y)
df['new_feature'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
This is one way of solving. If your conditions_df contains '+'/'-' then you can change the condition in def div(x, y, z) accordingly.
You can use numpy.where for mask by condition:
#data from lisa answer
#df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
df['new_feature'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
print (df)
x y z new_feature
0 100 10 1 -10.0
1 200 20 2 -10.0
2 300 30 3 10.0
3 400 40 4 10.0
Timings:
#4k rows
df = pd.concat([df]*1000).reset_index(drop=True)
#lisa answer
In [95]: %timeit df['new_feature1'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
10 loops, best of 3: 123 ms per loop
In [96]: %timeit df['new_feature2'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
1000 loops, best of 3: 595 µs per loop

Resources