How do I eliminate elements in the same position in two lists but just filtering one list? - python-3.x

I have two lists i.e.:
a = [0.0 , 30.1, 0.0, 10.1]
b = [1000, 9830, 100, 1023]
I want to remove from list "a" the elements equals to 0.0 and remove the elements from list "b" which are in the same position of the elements 0.0 in the list "a".
I know I can do this saving the index of 0.0 elements in a list, and then delete from the list b. Is there something more efficient? Because I want to apply the method in very large datasets.
Thanks

Using NumPy is of the most efficient ways. It is very easy doing so by NumPy indexing if you can use this library, without needing loops and much more faster. We need to convert them to arrays and reconvert them to list by .tolist() if needed:
a = np.array(a)
b = np.array(b)
a_ = a[a == 0]
# [0. 0.]
b_ = b[a != 0]
# [9830 1023]

This should do the trick:
a = [0.0 , 30.1, 0.0, 10.1]
b = [1000, 9830, 100, 1023]
#####################################################
assert(len(a) == len(b))
b = [b[index] if value != 0 else "" for index, value in enumerate(a)]
b = list(filter(lambda x: x != "", b))
#####################################################
print(b) # prints [9830, 1023]

Related

Can anyone tell me what is wrong here in this loop?

#vl = [40.08, 36.6, 41.0, 35.2, 41.0]
indices = []
for x in vl:
if x == max(vl):
indices.append(vl.index(x))
print(indices)
**Here max element 41 is present 2 times in the list at indexes 2 and 4. So, these indexes should be appended to the list indices. I am getting the output as [2,2] instead of [2,4]. Can Anyone please tell me what is wrong with this code? **
You get [2, 2] as index() is going to always give you the index of the first match.
I think you want to check out enumerate() here as it will give you the indices you are looking for in addition to the values.
Via List Comprehension:
vl = [40.08, 36.6, 41.0, 35.2, 41.0]
indices = [index for index, value in enumerate(vl) if value == max(vl)]
print(indices)
Via Traditional Loop:
vl = [40.08, 36.6, 41.0, 35.2, 41.0]
indices = []
for index, value in enumerate(vl):
if value == max(vl):
indices.append(index)
print(indices)
both should give you:
[2, 4]

Replace specific column values with pd.NA

I am working on a data set that contains longitude and latitude values.
I converted those values to clusters using DBSCAN.
Then I plotted the clusters just as a sanity check.
I get this:
The point at (0, 0) is obviously an issue.
So I ran this code to capture which row(s) are a problem.
a = df3.loc[(df3['latitude'] < 0.01) & (df3['longitude'] < 0.01)].index
print(a) # 1812 rows with 0.0 longitude and -2e-08 latitude
I have 1812 rows with missing data all represented as 0.0 longitude and -2e-08 latitude in the source file.
I am debating some imputation strategies but first I want to replace the 0.0 and -2e-08 values
with np.NA or np.nan so that I can then use fillna() with whatever I ultimately decide to do.
I have tried both:
df3.replace((df3['longitude'] == 0.0), pd.NA, inplace=True)
df3.replace((df3['latitude'] == -2e-08), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
and
df3.replace((df3['longitude'] < 0.01), pd.NA, inplace=True)
df3.replace((df3['latitude'] < 0.01), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
In both cases the existing values remain in place, i.e., the desired substitution with pd.NA
is not occurring.
What would be the correct procedure to replace the unwanted 1812 values in both the latitude and longitude columns with pd.NA or np.nan, as I simply plan to the impute something to replace the null values.
Try this one out:
df3['longitude'] = df3['longitude'].apply(lambda x:np.nan if x == 0.0 else x)
df3['latitude'] = df3['latitude'].apply(lambda x:np.nan if x==-2e-08 else x)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
With an example
import numpy as np
import pandas as pd
a = [1, 2, 0.0, -2e-08]
b = [1, 2, 0.0, -2e-08]
df = pd.DataFrame(zip(a, b))
df.columns = ['lat', 'long']
df.long = df.long.apply(lambda x:np.nan if x == 0.0 else x)
df.lat = df.lat.apply(lambda x:np.nan if x==-2e-08 else x)

How can I replace nan with 1 in a specific condition

In my program, I have some arrays and I am using a simple formula to calculate a value. The code I am using
from itertools import combinations
import numpy as np
res = [
np.array([[12.99632095], [29.60571445], [-1.85595153], [68.78926787], [2.75185088], [2.75204384]]),
np.array([[15.66458062], [0], [-3.75927882], [0], [2.30128711], [197.45459974]]),
np.array([[10.66458062], [0], [0], [-2.65954113], [-2.30128711], [197.45459974]]),
]
def cal():
pairs = combinations(res, 2)
results = []
for pair in pairs:
r = np.concatenate(pair, axis=1)
r1 = r[:, 0]
r2 = r[:, 1]
sign = np.sign(r1 * r2)
result = np.multiply(sign, np.min(np.abs(r), axis=1) / np.max(np.abs(r), axis=1))
results.append(result)
return results
The output I am getting is
RuntimeWarning: invalid value encountered in true_divide
result = np.multiply(sign, np.min(np.abs(r), axis=1) / np.max(np.abs(r), axis=1))
[array([0.82966287, 0. , 0.49369882, 0. , 0.83626883,
0.0139376 ]), array([ 0.82058458, 0. , 0. , -0.03866215, -0.83626883,
0.0139376 ]), array([ 0.68080856, nan, 0. , 0. , -1. ,
1. ])]
I here, I am getting nan for the 3rd output array. I understand, I got nan because of 0/0.
As the size of the array or the position of 0 is not fixed. So, I want to change the code in this way, if I get 0/0, here, instead of nan I want to save 1.
Could you tell me how can I handle this nan?
One of possible solutions:
import itertools as it
def cal():
pairs = it.combinations(res, 2)
rv = []
for pair in pairs:
r = np.concatenate(pair, axis=1)
sign = np.sign(np.prod(r, axis=1))
t1 = np.min(np.abs(r), axis=1)
t2 = np.max(np.abs(r), axis=1)
ratio = np.full(shape=r.shape[0], fill_value=1.)
np.divide(t1, t2, out=ratio, where=np.not_equal(t2, 0.))
wrk = np.full(shape=r.shape[0], fill_value=1.)
np.multiply(sign, ratio, out=wrk, where=np.not_equal(t2, 0.))
rv.append(wrk)
return rv
Instead of np.sign(r1 * r2) I used np.sign(np.prod(r, axis=1)).
Then the trick to set default values instead of NaN is that I created
an array filled with this default value and called np.divide passing
out and where to divide only where the divisor is not 0.
The last step is np.multiply, with just the same where condition.
To test this code and pretty print the result, you can run:
with np.printoptions(formatter={'float': '{: 9.5f}'.format}):
result = cal()
for tbl in result:
print(tbl)
The result is:
[ 0.82966 0.00000 0.49370 0.00000 0.83627 0.01394]
[ 0.82058 0.00000 0.00000 -0.03866 -0.83627 0.01394]
[ 0.68081 1.00000 0.00000 0.00000 -1.00000 1.00000]
As you can see, in the last case there are 2 1.0 values, coresponding
with same values in the second and third source array.

Starting with Sklearns Nearest Neighbors output, how do I remove results where the record is it's own nearest neighbor?

I am want to use sklearns NearestNeighbors() model to do some data analysis.
In my use case, I want to grab the N nearest neighbors and put it back into a pandas dataframe to evaluate the similarity of different records.
However, the results include the original record. In my case, that isn't useful. I want the nearest different records.
Example:
xtest = np.array([[1,1,1], [1,1,1], [1,.8,1] [.8,1,1]])
nn = NearestNeighbors(n_neighbors=2)
nn.fit(xtest)
distances, indices = nn.kneighbors(xtest)
returns:
(array([[0. , 0. ],
[0. , 0. ],
[0. , 0.2],
[0. , 0.2]]),
array([[0, 1],
[0, 1],
[2, 1],
[3, 1]], dtype=int64))
In the above arrays the cells at indices (0,0), (1,1), (2, 0) and (3,0) are unimportant.
My goal is to manipulate this output so that I can create the following columns in pandas:
"NearestNeighbor1" - the index of the nearest record other than itself
"NearestNeighbor1_dist" -the distance of the nearest record other than itself even if the distance is zero.
"NearestNeighbor2" - the index of the next nearest record other than itself
"NearestNeighbor2_dist" -the distance of the nearest record other than itself even if the distance is zero.
In the event of a tie, I don't care which record comes first (as long as it isn't itself).
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
xtest = np.array([[1,1,1], [1,1,1], [1,.8,1], [.8,1,1]])
nn = NearestNeighbors(n_neighbors=3)
nn.fit(xtest)
distances, indices = nn.kneighbors(xtest)
df_ind = pd.DataFrame(data=indices)
df_ind = df_ind.apply(func=lambda x: [y for y in x if y != x.name], axis=1, result_type='expand')
df = pd.DataFrame({'NearestNeighbor1': df_ind.iloc[:, 0],
'NearestNeighbor1_dist': distances[:,1],
'NearestNeighbor2': df_ind.iloc[:, 1],
'NearestNeighbor2_dist': distances[:, 2]
})
print(df)
Output:
NearestNeighbor1 NearestNeighbor1_dist NearestNeighbor2 NearestNeighbor2_dist
0 1 0.0 2 0.2
1 0 0.0 2 0.2
2 1 0.2 0 0.2
3 1 0.2 0 0.2
This solution works for an arbitrary number of neighbors, although, I wonder if there is a more elegant solution using numpy.
N_NBRS = 4
nbrs = NearestNeighbors(n_neighbors=N_NBRS + 1, algorithm='brute')
nbrs.fit(X)
dist_n, ix_n = nbrs.kneighbors(X)
replacement = []
for row_idx, row in enumerate(ix_n):
new_row = [val for val in row if val != row_idx]
new_row = new_row[:N_NBRS] #Truncate in the event of many exact matches
replacement.append(new_row)
ix_n2 = np.array(replacement)
dist_n2 = dist_n[:,1:]
results = X.copy()
for col_idx in range(N_NBRS):
results[f'Neighbor{col_idx + 1}'] = ix_n2[:,col_idx]
results[f'Neighbor{col_idx + 1}_dist'] = dist_n2[:,col_idx]

How can I compare two arrays with different sizes but with some floats that are approximate? [Python3]

How can I compare two arrays with different sizes but with some floats that are approximate? For example:
# I have two arrays
a = np.array( [-2.83, -2.54, ..., 0.05, ..., 2.54, 2.83] )
b = np.array( [-3.0, -2.9, -2.8, ..., -0.1, 0.0, 0.1, ..., 2.9, 3.0] )
# wherein len( b ) > len( a )
What I need is the index where (considering those two values from both lists)
math.isclose( -2.54, -2.5, rel_tol=1e-1) == True
The answer that I need is something like
list_of_index_of_b = [1, 5, ..., -2]
Here list_of_index_of_b is a list with the "coordinates" where that specific element of b is approximate to some element of a. Not all ellements of a have an approximate in b. Also:
len(list_of_index_of_b) == len(a)
You can use broadcasting. This creates an array of the pairwise differences between every element in a and b which you can then check against the specified tolerance.
Of course this is computationally inefficient from a complexity standpoint since you construct an array of size |a|*|b| and compare every pairwise distance against the tolerance, even if once of the differences is already small enough. That said, if one of |a| or |b| is relatively small, then then this approach can be quite fast since it is pure numpy and requires no loops.
a = np.array([1,5,6,7])
b = np.array([1.1,2,3,4.8,4.9,5,8])
rtol = 0.15
diff = a - b[:,None]
mask2d = (1/np.abs(b))*np.abs(a - b[:,None]) < rtol
mask = np.any(mask2d,axis=1)
This can be combined to a single line:
indices = np.where(np.any((1/np.abs(b))*np.abs(a-b[:,None]) < rtol,axis=1))

Resources