np.where issue above a certain value (#Numpy) - python-3.x

I'm facing to 2 issues in the following snippet using np.where (looking for indexes where A[:,0] is identical to B)
Numpy error when n is above a certain value (see error)
quite slow
DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
So I'm wondering what I'm missing and/or misunderstanding, how to fix it, and how to speed-up the code. This is a basic example I've made to mimic my code, but in fact I'm dealing with arrays having (dozens of) millions of rows.
Thanks for your support
Paul
import numpy as np
import time
n=100_000 # with n=10 000 ok but quit slow
m=2_000_000
#matrix A
# A=np.random.random ((n, 4))
A = np.arange(1, 4*n+1, dtype=np.uint64).reshape((n, 4), order='F')
#Matrix B
B=np.random.randint(1, m+1, size=(m), dtype=np.uint64)
B=np.unique(B) # duplicate values are generally generated, so the real size remains lower than n
# use of np.where
t0=time.time()
ind=np.where(A[:, 0].reshape(-1, 1) == B)
# ind2=np.where(B == A[:, 0].reshape(-1, 1))
t1=time.time()
print(f"duration={t1-t0}")

In your current implementation, A[:, 0] is just
np.arange(n/4, dtype=np.uint64)
And if you are interested only in row indexes where A[:, 0] is in B, then you can get them like this:
row_indices = np.where(np.isin(first_col_of_A, B))[0]
If you then want to select the rows of A with these indices, you don't even have to convert the boolean mask to index locations. You can just select the rows with the boolean mask: A[np.isin(first_col_of_A, B)]
There are better ways to select random elements from an array. For example, you could use numpy.random.Generator.choice with replace=False. Also, Numpy: Get random set of rows from 2D array.
I feel there is almost certainly a better way to do the whole thing that you are trying to do with these index locations.
I recommend you study the Numpy User Guide and the Pandas User Guide to see what cool things are available there.
Honestly, with your current implementation you don't even need the first column of A at all, because row indicies simply equal the elements of A[:, 0]. Here:
row_indices = B[B < n]
row_indices.sort()
print(row_indices)

Related

Python multiprocessing same function with different arguments

I have a program that is designed to calculate order parameter from coarse-grained molecular system. In the system the I have different beads, which represents different parts of molecule. Each of these beads have a xyz-coordinates that represent their place in the system. The program works, but it is very slow since I have to calculate the number of beads type i around beads type j within a certain cutoff distance.
Function to calculate Euclidean distance between bead a and b:
def distance_ab(a, b):
n_beads = 0
for i in range(len(a)):
for j in range(len(b)):
# Euclidean distance
dist = np.sqrt(np.sum((a[i] - b[j])** 2, axis=0))
if dist <= 1.0 and dist > 0.0: # cut-off distance
n_beads += 1
return n_beads
So I decided to fasten the process of calculating the distance between different beads by using python multiprocessing library. But for some reason I can not get the multiprocessing to work for repeating the same distance calculation function with different parameters (xyz-data of beads). Multiprocessing returns a list of some numbers, when the idea is to return only one number (the number of beads in certain cut-off distance). What I do wrong and could someone help me to understand where the problem is?
The part where I am trying to use multiprocessing:
with multiprocessing.Pool(os.cpu_count()) as pool:
# go through certain number of molecular simulation frames (e.g. 100 frames)
for i in range(frames)):
# Calculating euclidean distances between different types of beads
# for each frame
a_b = pool.starmap(calculate_distances, zip(bead_a_array, bead_b_array))
a_c = pool.starmap(calculate_distances, zip(bead_a_array, bead_c_array))
When you zip together your bead arrays, you are creating an iterable of tuples that overall has the same length as the shorter of the two arrays.
>>> A=[1,2,3]
>>> B=[4,5,6,7]
>>> res=zip(A,B)
>>> list(res)
[(1, 4), (2, 5), (3, 6)]
Looking at the documentation for starmap:
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
So your starmaps are actually just passing a pair of elements (one from each array) to your function and returning the result for each of these pairs. If you want to use multiprocessing to determine, say, the number of b and c beads within the cutoff distance of a at the same time, you would need to do something like this:
import itertools as it
all_arr=[bead_a_array,bead_b_array,bead_c_array]
with multiprocessing.Pool(os.cpu_count()) as pool:
a_counts=pool.starmap(distance_ab, it.combinations(all_arr,2))
Here, instead of passing individual elements of each array to the function, it now passes each array into your function and it will compute the counts of b and c within the threshold of a (and c with the threshold of b) simultaneously. The it.combinations(all_arr,2) selects unique pairs of arrays to pass to your function.

NumPy: how to calculate variance along each row of a 2D array using np.var and by hand (i.e., not using np.var; calculating each term explicitly)?

I am using Python to import data from large files. There are three columns corresponding to x, y, z data. Each row represents a time at which the data were collected. For example:
importedData = [[1, 2, 3], <--This row: x, y, and z data at time 0.
[4, 5, 6],
[7, 8, 9]];
I want to calculate the variance for each time (row). As far as I know, one way to do this is as follows (if this is not correct, I would appreciate a heads-up):
varPerTimestep = np.var(importedData,axis=1);
Here's my problem. To convince a coworker it works, I would next like to do the same thing, but avoid using np.var. This means solving:
Var(S)=(⟨S_bar⋅S_bar⟩−⟨S_bar⟩⟨S_bar⟩) # S_bar, x, y, z
I'm an intermittent Python user and just can't figure out how to do this for each row. I found a suggestion online but don't know how to adapt the code below so it applies to each row (apologies; can't provide the link because when I do, I get an error that my code is not formatted correctly and I can't post the question; also the reason that some of the code is formatted as quotes below):
def variance(data, ddof=0):
n = len(data)
mean = sum(data) / n
return sum((x - mean) ** 2 for x in data) / (n - ddof)
I have tried various things. For example, putting the function in a loop where I first attempted just getting a row average:
for row in importedData:
mean_test = np.mean(importedData,axis=1)
print(mean_test)
This gives me an error I can't figure out:
Traceback (most recent call last):
File "<string>", line 13, in <module>
TypeError: list indices must be integers or slices, not tuple
I also tried this and get no output because I seem to be stuck in a loop:
n = len(importedData[0,:]) # Trying to get the length of each row.
mean = mean(importedData[0,:]) # Likewise trying to get the mean of each row.
deviations = [(x - mean) ** 2 for x in importedData]
variance = sum(deviations) / n
If anyone could please point me in the right direction, I would be grateful.
Well you could do something like this to make things more explicit:
import numpy as np
importedData = np.arange(1,10).reshape(3,3)
# Get means for each row
means = [row.mean() for row in importedData]
# Calculate squared errors
squared_errors = [(row-mean)**2 for row, mean in zip(importedData, means)]
# Calculate "mean for each row of squared errors" (aka the variance)
variances = [row.mean() for row in squared_errors]
# Sanity check
print(variances)
print(importedData.var(1))
# [0.6666666666666666, 0.6666666666666666, 0.6666666666666666]
# [0.66666667 0.66666667 0.66666667]

TypeError: append() missing 1 required positional argument: 'values'

I have variable 'x_data' sized 360x190, I am trying to select particular rows of data.
x_data_train = []
x_data_train = np.append([x_data_train,
x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]],axis = 0)
I get the following error :
TypeError: append() missing 1 required positional argument: 'values'
where did I go wrong ?
If I am using
x_data_train = []
x_data_train.append(x_data[0:20,:])
x_data_train.append(x_data[46:65,:])
x_data_train.append(x_data[91:110,:])
x_data_train.append(x_data[136:155,:])
x_data_train.append(x_data[181:200,:])
x_data_train.append(x_data[226:245,:])
x_data_train.append(x_data[271:290,:])
x_data_train.append(x_data[316:335,:])
the size of the output is 8 instead of 160 rows.
Update:
In matlab, I will load the text file and x_data will be variable having 360 rows and 190 columns.
If I want to select 1 to 20 , 46 to 65, ... rows of data , I simply give
x_data_train = xdata([1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :);
the resulting x_data_train will be the array of my desired.
How can do that in python because it results array of 8 subsets of array for 20*192 each, but I want it to be one array 160*192
Short version: the most idiomatic and fastest way to do what you want in python is this (assuming x_data is a numpy array):
x_data_train = np.vstack([x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]])
This can be shortened (but made very slightly slower) by doing:
xdata[np.r_[0:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
For your case where you have a lot of indices I think it helps readability, but in cases where there are fewer indices I would use the first approach.
Long version:
There are several different issues at play here.
First, in python, [] makes a list, not an array like in MATLAB. Lists are more like 1D cell arrays. They can hold any data type, including other lists, but they cannot have multiple dimensions. The equivalent of MATLAB matrices in Python are numpy arrays, which are created using np.array.
Second, [x, y] in Python always creates a list where the first element is x and the second element is y. In MATLAB [x, y] can do one of several completely different things depending on what x and y are. In your case, you want to concatenate. In Python, you need to explicitly concatenate. For two lists, there are several ways to do that. The simplest is using x += y, which modifies x in-place by putting the contents of y at the end. You can combine multiple lists by doing something like x += y + z + w. If you want to keep x, unchanged, you can assign to a new variable using something like z = x + y. Finally, you can use x.extend(y), which is roughly equivalent to x += y but works with some data types besides lists.
For numpy arrays, you need to use a slightly different approach. While Python lists can be modified in-place, strictly speaking neither MATLAB matrices nor numpy arrays can be. MATLAB pretends to allow this, but it is really creating a new matrix behind-the-scenes (which is why you get a warning if you try to resize a matrix in a loop). Numpy requires you to be more explicit about creating a new array. The simplest approach is to use np.hstack, which concatenates two arrays horizontally (or np.vstack or np.dstack for vertical and depth concatenation, respectively). So you could do z = np.hstack([v, w, x, y]). There is an append method and function in numpy, but it almost never works in practice so don't use it (it requires careful memory management that is more trouble than it is worth).
Third, what append does is to create one new element in the target list, and put whatever variable append is called with in that element. So if you do x.append([1,2,3]), it adds one new element to the end of list x containing the list [1,2,3]. It would be more like x = [x, {{1,2,3}}}, where x is a cell array.
Fourth, Python makes heavy use of "methods", which are basically functions attached to data (it is a bit more complicated than that in practice, but those complexities aren't really relevant here). Recent versions of MATLAB has added them as well, but they aren't really integrated into MATLAB data types like they are in Python. So where in MATLAB you would usually use sum(x), for numpy arrays you would use x.sum(). In this case, assuming you were doing appending (which you aren't) you wouldn't use the np.append(x, y), you would use x.append(y).
Finally, in MATLAB x:y creates a matrix of values from x to y. In Python, however, it creates a "slice", which doesn't actually contain all the values and so can be processed much more quickly by lists and numpy arrays. However, you can't really work with multiple slices like you do in your example (nor does it make sense to because slices in numpy don't make copies like they do in MATLAB, while using multiple indexes does make a copy). You can get something close to what you have in MATLAB using np.r_, which creates a numpy array based on indexes and slices. So to reproduce your example in numpy, where xdata is a numpy array, you can do xdata[np.r_[1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
More information on x_data and np might be needed to solve this but...
First: You're creating 2 copies of the same list: np and x_data_train
Second: Your indexes on x_data are strange
Third: You're passing 3 objects to append() when it only accepts 2.
I'm pretty sure revisiting your indexes on x_data will be where you solve the current error, but it will result in another error related to passing 2 values to append.
And I'm also sure you want
x_data_train.append(object)
not
x_data_train = np.append(object)
and you may actually want
x_data_train.extend([objects])
More on append vs extend here: append vs. extend

Efficient way to extend a numpy array with values that do not already exist in that array

I'm very new to python/numpy. I want to store values in an numpy array, that are solutions of a simple combination problem.
In this I have two given values x and y and a bound with x<=bound and y<=bound. I need to find all integer solutions ax+by that satisfy ax+by<=bound with "a" and "b" both positive integer. So I'm doing this by iterating over all feasible inputs for "a" and "b" and extend my array with the solutions.
The problem is, that I need a solution only to appear once. Like in my code example below, for x=3, y=5 and bound=20, the solution 15 would be a result of a*3+b*5 for (a,b)=(5,0) and also for (a,b)=(0,3). I do not need the redundance. The best way I came up with until now, is to check in an if-block, if the computed solution is not already stored, and only if so, that value is added to my array.
Is there a more efficient way to do this, other than checking the entire existing array in every single iteration step? Like a function other than np.append which automatically only stores values, that do not already exist?
Or is there a way to first store all computed solutions, but then return only an array with none-redundant values? (And would that be mor efficient?)
PS: I'm working with very large bounds and my array needs to store a few thousands of values.
import numpy as np
x=3
y=5
bound=20
arr=([]) # empty array at first
for a in range(np.int(bound/x)+1):
for b in range(np.int((bound-a*x)/y)+1):
feasible_combination=a*x+b*y
if feasible_combination not in arr[:]: # no need for redundance
arr=np.append(arr,feasible_combination)
arr=np.sort(arr)
print(arr)
Here two faster versions, one using a set and the other doing almost all the work in NumPy:
def combinations_set(x, y, bound):
combs = set()
for a in range(np.int(bound/x)+1):
for b in range(np.int((bound-a*x)/y)+1):
combs.add(a*x+b*y)
return np.sort(list(combs))
def combinations_ix(x, y, bound):
a, b = np.ix_(*[np.arange(int(bound/_)+1) for _ in (x, y)])
combs = a*x + b*y
return np.sort(np.unique(combs[combs <= bound]))
For the following sample, combinations_ix gives a 1607-fold speedup on my machine compared to the original code, but combinations_set is pretty competitive (and should use less memory than combinations_ix):
In [58]: %timeit combinations(200, 234, 100000) # original code as function
1 loops, best of 3: 8.23 s per loop
In [59]: %timeit combinations_set(200, 234, 100000)
100 loops, best of 3: 16.1 ms per loop
In [60]: %timeit combinations_ix(200, 234, 100000)
100 loops, best of 3: 5.12 ms per loop

numpy number array to strings with trailing zeros removed

question: is my method of converting a numpy array of numbers to a numpy array of strings with specific number of decimal places AND trailing zeros removed the 'best' way?
import numpy as np
x = np.array([1.12345, 1.2, 0.1, 0, 1.230000])
print np.core.defchararray.rstrip(np.char.mod('%.4f', x), '0')
outputs:
['1.1235' '1.2' '0.1' '0.' '1.23']
which is the desired result. (I am OK with the rounding issue)
Both of the functions 'rstrip' and 'mod' are numpy functions which means this is fast but is there a way to accomplish this with ONE built in numpy function? (ie. does 'mod' have an option that I couldn't find?) It would save the overhead of returning copies twice which for very large arrays is slow-ish.
thanks!
Thanks to Warren Weckesser for providing valuable comments. Credit to him.
I converted my code to use:
formatter = '%d'
if num_type == 'float':
formatter = '%%.%df' % decimals
np.savetxt(out, arr, fmt=formatter)
where out is a file handle to which I had already written my headers. Alternatively, I could also use the headers= argument in np.savetxt. I have no clue how I didn't see those options in the documentation.
For a numpy array 1300 by 1300, creating the line by line output as I did before (using np.core.defchararray.rstrip(np.char.mod('%.4f', x), '0')) took ~1.7 seconds and using np.savetxt takes 0.48 seconds.
So np.savetxt is a cleaner, more readable, and faster solution.
Note:
I did try:
np.savetxt(out, arr, fmt='%.4g')
in an effort to not have a switch based on number type but it did not work as I had hoped.

Resources