argsort on numpy.array as generator - python-3.x

I am new in python, so may be I do something wrong. Let me first explain what I want.
I have a huge 1d numpy.array with some values and I need to know the indices of first n smallest values. I need them for later computation. Of course I can just do something like ind = numpy.argsort(hugearray)[:n].
The problem is that beforehand I don't know how many indices I need, my computations are iterative and fetch one by one index til there are enough for computation.
Another thing is, that I want a lazy argsort to avoid creating new entire array of argsorted values and prevent unnesessary searching, so I thought of a generator. But truly I don't know how to do it with a numpy.array.
UPD: from hpaulj answer, I tried to create a generator:
def gargsort(arr):
arr=arr.copy()
for i in range(len(arr)):
k = np.argmin(arr)
arra[k] = np.iinfo(arr[k]).max
yield k
May be it's possible to do it better?

Here's an iterative approach that appears to be faster than argsort, provided n isn't too large:
In [135]: arr = np.arange(200000)
In [136]: np.random.shuffle(arr)
In [137]: def foo(arr):
...: arr=arr.copy()
...: alist=[]
...: for i in range(10):
...: k=np.argmin(arr)
...: alist.append(k)
...: arr[k]=200000
...: return alist
...:
In [138]: foo(arr)
Out[138]: [176806, 180397, 139992, 151809, 59931, 59866, 130026, 191357, 84166, 130359]
In [139]: np.argsort(arr)[:10]
Out[139]:
array([176806, 180397, 139992, 151809, 59931, 59866, 130026, 191357,
84166, 130359], dtype=int32)
In [140]: timeit np.argsort(arr)[:10]
100 loops, best of 3: 15.8 ms per loop
In [141]: timeit foo(arr)
1000 loops, best of 3: 1.69 ms per loop
(I'll comment later if needed).

Related

Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?

I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns.
For example:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))
I would like to calculate cosine similarity between the input and the whole df.
I am using the following:
from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)
But it's a bit slow. Tried with swifter package, and it seems to run faster.
Please advise what is the best practice for such a task, do it like this or change to another method?
I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them
df_npy = df.values
input_npy = input.values
And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors
df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
And then matrix multiply them together
df_npy # input_npy.T
which will give you
array([[0.213],
[0.524],
[0.431]])
The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.

Using a subset of classes in ImageNet

I'm aware that subsets of ImageNet exist, however they don't fulfill my requirement. I want 50 classes at their native ImageNet resolutions.
To this end, I used torch.utils.data.dataset.Subset to select specific classes from ImageNet. However, it turns out, class labels/indices must be greater than 0 and less than num_classes.
Since ImageNet contains 1000 classes, the idx of my selected classes quickly goes over 50. How can I reassign the class indices and do so in a way that allows for evaluation later down the road as well?
Is there a way more elegant way to select a subset?
I am not sure I understood your conclusions about labels being greater than zero and less than num_classes. The torch.utils.data.Subset helper takes in a torch.utils.data.Dataset and a sequence of indices, they correspond to indices of data points from the Dataset you would like to keep in the subset. These indices have nothing to do with the classes they belong to.
Here's how I would approach this:
Load your dataset through torchvision.datasets (custom datasets would work the same way). Here I will demonstrate it with FashionMNIST since ImageNet's data is not made available directly through torchvision's API.
>>> ds = torchvision.datasets.FashionMNIST('.')
>>> len(ds)
60000
Define the classes you want to select for the subset dataset. And retrieve all indices from the main dataset which correspond to these classes:
>>> targets = [1, 3, 5, 9]
>>> indices = [i for i, label in enumerate(ds.targets) if label in targets]
You have your subset:
>>> ds_subset = Subset(ds, indices)
>>> len(ds_subset)
24000
At this point, you can use a dictionnary to remap your labels using targets:
>>> remap = {i:x for i, x in enumerate(targets)}
{0: 1, 1: 3, 2: 5, 3: 9}
For example:
>>> x, y = ds_subset[10]
>>> y, remap[y] # old_label, new_label
1, 3

Speed up getting distance between two lat and lon

I have two DataFrame containing Lat and Lon. I want to find distance from one (Lat, Lon) pair to ALL (Lat, Lon) from another DataFrame and get the minimum. The package that I am using geopy. The code is as follows:
from geopy import distance
import numpy as np
distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
count = count + 1
print(count)
for id2, row2 in df2.iterrows():
point = (row2["LAT"], row2["LON"])
distanceMiles.append(distance.distance(target, point).miles)
closestPoint = np.argmin(distanceMiles)
distanceMiles = []
The problem is that df1 has 168K rows and df2 has 1200 rows. How do I make it faster?
geopy.distance.distance uses geodesic algorithm by default, which is rather slow but more accurate. If you can trade accuracy for speed, you can use great_circle, which is ~20 times faster:
In [4]: %%timeit
...: distance.distance(newport_ri, cleveland_oh).miles
...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %%timeit
...: distance.great_circle(newport_ri, cleveland_oh).miles
...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Also you may use multiprocessing to parallelize the computation:
from multiprocessing import Pool
from geopy import distance
import numpy as np
def compute(points):
target, point = points
return distance.great_circle(target, point).miles
with Pool() as pool:
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
distanceMiles = pool.map(
compute,
(
(target, (row2["LAT"], row2["LON"]))
for id2, row2 in df2.iterrows()
)
)
closestPoint = np.argmin(distanceMiles)
Leaving this here in case anyone needs it in the future:
If you need only the minimum distance, then you don't have to bruteforce all the pairs. There are some data structures that can help you solve this in O(n*log(n)) time complexity, which is way faster than the bruteforce method.
For example, you can use a generalized KNearestNeighbors (with k=1) algorithm to do exactly that, given that you pay attention to your points being on a sphere, not a plane. See this SO answer for an example implementation using sklearn.
There seems to be a few libraries to solve this too, like sknni and GriSPy.
Here's also another question that talks a bit about the theory.
This should run much faster if you utilize itertools instead of explicit for loops. Inline comments should help you understand whats happening at each step.
import numpy as np
import itertools
from geopy import distance
#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})
#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))
#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])
#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)
50
geo_dist = [42.430772028845716,
44.29982320107605,
25.88823239877388,
23.877570442142783,
29.9351451072828,
...]
Finally,
If you are working with a massive dataset, then I would recommend using multiprocessing library to map the itertools.starmap to different cores and asynchronously compute the distance values. Python Multiprocessing library now supports starmap.
If you need to check all the the pairs by brute force, I think the following approach is the best you can do.
Looping directly on columns is usually slightly faster than iterrows, and the vectorized approach replacing the inner loop saves time too.
for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
target = (lat1, lon1)
count = count + 1
# print(count) #printing is also time expensive
df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
closestpoint = df2['dist'].min() #if you want the minimum distance
closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.

Subtracting the mean from the values of an array with Python [duplicate]

I try to subtract the mean of each row of a matrix in numpy using broadcasting but I get an error. Any idea why?
Here is the code:
from numpy import *
X = random.rand(5, 10)
Y = X - X.mean(axis = 1)
Error:
ValueError: operands could not be broadcast together with shapes (5,10) (5,)
Thanks!
The mean method is a reduction operation, meaning it converts a 1-d collection of numbers to a single number. When you apply a reduction to an n-dimensional array along an axis, numpy collapses that dimension to the reduced value, resulting in an (n-1)-dimensional array. In your case, since X has shape (5, 10), and you performed a reduction along axis 1, you end up with an array with shape (5,):
In [8]: m = X.mean(axis=1)
In [9]: m.shape
Out[9]: (5,)
When you try to subtract this result from X, you are trying to subtract an array with shape (5,) from an array with shape (5, 10). These shapes are not compatible for broadcasting. (Take a look at the description of broadcasting in the User Guide.)
For broadcasting to work the way you want, the result of the mean operation should be an array with shape (5, 1) (to be compatible with the shape (5, 10)). In recent versions of numpy, the reduction operations, including mean, have an argument called keepdims that tells the function to not collapse the reduced dimension. Instead, a trivial dimension with length 1 is kept:
In [10]: m = X.mean(axis=1, keepdims=True)
In [11]: m.shape
Out[11]: (5, 1)
With older versions of numpy, you can use reshape to restore the collapsed dimension:
In [12]: m = X.mean(axis=1).reshape(-1, 1)
In [13]: m.shape
Out[13]: (5, 1)
So, depending on your version of numpy, you can do this:
Y = X - X.mean(axis=1, keepdims=True)
or this:
Y = X - X.mean(axis=1).reshape(-1, 1)
If you are looking for performance, you can also consider using np.einsum that is supposedly faster than actually using np.sum or np.mean. Thus, the desired output could be obtained like so -
X - np.einsum('ij->i',X)[:,None]/X.shape[1]
Please note that the [:,None] part is similar to keepdims to keep the dimensions of it same as that of the input array. This could also be used in broadcasting.
Runtime tests
1) Comparing just the mean calculation -
In [47]: X = np.random.rand(500, 1000)
In [48]: %timeit X.mean(axis=1, keepdims=True)
1000 loops, best of 3: 1.5 ms per loop
In [49]: %timeit X.mean(axis=1).reshape(-1, 1)
1000 loops, best of 3: 1.52 ms per loop
In [50]: %timeit np.einsum('ij->i',X)[:,None]/X.shape[1]
1000 loops, best of 3: 832 µs per loop
2) Comparing entire calculation -
In [52]: X = np.random.rand(500, 1000)
In [53]: %timeit X - X.mean(axis=1, keepdims=True)
100 loops, best of 3: 6.56 ms per loop
In [54]: %timeit X - X.mean(axis=1).reshape(-1, 1)
100 loops, best of 3: 6.54 ms per loop
In [55]: %timeit X - np.einsum('ij->i',X)[:,None]/X.shape[1]
100 loops, best of 3: 6.18 ms per loop

Convert list of numpy.float64 to float in Python quickly

What is the fastest way of converting a list of elements of type numpy.float64 to type float? I am currently using the straightforward for loop iteration in conjunction with float().
I came across this post: Converting numpy dtypes to native python types, however my question isn't one of how to convert types in python but rather more specifically how to best convert an entire list of one type to another in the quickest manner possible in python (i.e. in this specific case numpy.float64 to float). I was hoping for some secret python machinery that I hadn't come across that could do it all at once :)
The tolist() method should do what you want. If you have a numpy array, just call tolist():
In [17]: a
Out[17]:
array([ 0. , 0.14285714, 0.28571429, 0.42857143, 0.57142857,
0.71428571, 0.85714286, 1. , 1.14285714, 1.28571429,
1.42857143, 1.57142857, 1.71428571, 1.85714286, 2. ])
In [18]: a.dtype
Out[18]: dtype('float64')
In [19]: b = a.tolist()
In [20]: b
Out[20]:
[0.0,
0.14285714285714285,
0.2857142857142857,
0.42857142857142855,
0.5714285714285714,
0.7142857142857142,
0.8571428571428571,
1.0,
1.1428571428571428,
1.2857142857142856,
1.4285714285714284,
1.5714285714285714,
1.7142857142857142,
1.857142857142857,
2.0]
In [21]: type(b)
Out[21]: list
In [22]: type(b[0])
Out[22]: float
If, in fact, you really have python list of numpy.float64 objects, then #Alexander's answer is great, or you could convert the list to an array and then use the tolist() method. E.g.
In [46]: c
Out[46]:
[0.0,
0.33333333333333331,
0.66666666666666663,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
In [47]: type(c)
Out[47]: list
In [48]: type(c[0])
Out[48]: numpy.float64
#Alexander's suggestion, a list comprehension:
In [49]: [float(v) for v in c]
Out[49]:
[0.0,
0.3333333333333333,
0.6666666666666666,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
Or, convert to an array and then use the tolist() method.
In [50]: np.array(c).tolist()
Out[50]:
[0.0,
0.3333333333333333,
0.6666666666666666,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
If you are concerned with the speed, here's a comparison. The input, x, is a python list of numpy.float64 objects:
In [8]: type(x)
Out[8]: list
In [9]: len(x)
Out[9]: 1000
In [10]: type(x[0])
Out[10]: numpy.float64
Timing for the list comprehension:
In [11]: %timeit list1 = [float(v) for v in x]
10000 loops, best of 3: 109 µs per loop
Timing for conversion to numpy array and then tolist():
In [12]: %timeit list2 = np.array(x).tolist()
10000 loops, best of 3: 70.5 µs per loop
So it is faster to convert the list to an array and then call tolist().
You could use a list comprehension:
floats = [float(np_float) for np_float in np_float_list]
So out of the possible solutions I've come across (big thanks to Warren Weckesser and Alexander for pointing out all of the best possible approaches) I ran my current method and that presented by Alexander to give a simple comparison for runtimes (the two choices come as a result of the fact that I have a true list of elements of numpy.float64 and wish to convert them to float speedily):
2 approaches covered: list comprehension and basic for loop iteration
First here's the code:
import datetime
import numpy
list1 = []
for i in range(0,1000):
list1.append(numpy.float64(i))
list2 = []
t_init = time.time()
for num in list1:
list2.append(float(num))
t_1 = time.time()
list2 = [float(np_float) for np_float in list1]
t_2 = time.time()
print("t1 run time: {}".format(t_1-t_init))
print("t2 run time: {}".format(t_2-t_1))
I ran four times to give a quick set of results:
>>> run 1
t1 run time: 0.000179290771484375
t2 run time: 0.0001533031463623047
Python 3.4.0
>>> run 2
t1 run time: 0.00018739700317382812
t2 run time: 0.0001518726348876953
Python 3.4.0
>>> run 3
t1 run time: 0.00017976760864257812
t2 run time: 0.0001513957977294922
Python 3.4.0
>>> run 4
t1 run time: 0.0002455711364746094
t2 run time: 0.00015997886657714844
Python 3.4.0
Clearly to convert a true list of numpy.float64 to float, the optimal approach is to use python's list comprehension.

Resources