Value and Index of MAX in each column of matrix - python-3.x

I'm aware of:
id,value = max(enumerate(trans_p), key=operator.itemgetter(1))
I'm trying to find something equivalent for matrices, where I'm looking for the value and row index of the max for each column of the matrix
so the function could take in any matrix, such as:
np.array([[0,0,1],[2,0,0],[5,0,0]])
and return two vectors: a vector of row numbers where the max is found, and the max values themselves - for each column. I'm trying to avoid a for-loop! Ideally the function returns two values, like that:
rowIdVect, maxVect = ...........
where the values for the example matrix above would be:
[2,0,0] #rowIdVect
[5,0,1] #maxVect
I can do this in two steps:
idVect = np.argmax( myMat , axis=0)
maxVect = np.max( trans_probs_mat, axis=0)
But is there a syntax that would perform both at the same time? Note: I'm trying to improve run times.

You can use the index to find the corresponding values:
In [201]: arr=np.array([[0,0,1],[2,0,0],[5,0,0]])
In [202]: idx=np.argmax(arr, axis=0)
In [203]: np.max(arr, axis=0)
Out[203]: array([5, 0, 1])
In [204]: arr[idx,np.arange(3)]
Out[204]: array([5, 0, 1])
Is this worth it? I doubt if the use of argmax and/or max is a bottleneck in your calculations. But feel free to time test with realistic data.

Related

Python multiprocessing same function with different arguments

I have a program that is designed to calculate order parameter from coarse-grained molecular system. In the system the I have different beads, which represents different parts of molecule. Each of these beads have a xyz-coordinates that represent their place in the system. The program works, but it is very slow since I have to calculate the number of beads type i around beads type j within a certain cutoff distance.
Function to calculate Euclidean distance between bead a and b:
def distance_ab(a, b):
n_beads = 0
for i in range(len(a)):
for j in range(len(b)):
# Euclidean distance
dist = np.sqrt(np.sum((a[i] - b[j])** 2, axis=0))
if dist <= 1.0 and dist > 0.0: # cut-off distance
n_beads += 1
return n_beads
So I decided to fasten the process of calculating the distance between different beads by using python multiprocessing library. But for some reason I can not get the multiprocessing to work for repeating the same distance calculation function with different parameters (xyz-data of beads). Multiprocessing returns a list of some numbers, when the idea is to return only one number (the number of beads in certain cut-off distance). What I do wrong and could someone help me to understand where the problem is?
The part where I am trying to use multiprocessing:
with multiprocessing.Pool(os.cpu_count()) as pool:
# go through certain number of molecular simulation frames (e.g. 100 frames)
for i in range(frames)):
# Calculating euclidean distances between different types of beads
# for each frame
a_b = pool.starmap(calculate_distances, zip(bead_a_array, bead_b_array))
a_c = pool.starmap(calculate_distances, zip(bead_a_array, bead_c_array))
When you zip together your bead arrays, you are creating an iterable of tuples that overall has the same length as the shorter of the two arrays.
>>> A=[1,2,3]
>>> B=[4,5,6,7]
>>> res=zip(A,B)
>>> list(res)
[(1, 4), (2, 5), (3, 6)]
Looking at the documentation for starmap:
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
So your starmaps are actually just passing a pair of elements (one from each array) to your function and returning the result for each of these pairs. If you want to use multiprocessing to determine, say, the number of b and c beads within the cutoff distance of a at the same time, you would need to do something like this:
import itertools as it
all_arr=[bead_a_array,bead_b_array,bead_c_array]
with multiprocessing.Pool(os.cpu_count()) as pool:
a_counts=pool.starmap(distance_ab, it.combinations(all_arr,2))
Here, instead of passing individual elements of each array to the function, it now passes each array into your function and it will compute the counts of b and c within the threshold of a (and c with the threshold of b) simultaneously. The it.combinations(all_arr,2) selects unique pairs of arrays to pass to your function.

kth-value per row in pytorch?

Given
import torch
A = torch.rand(9).view((3,3)) # tensor([[0.7455, 0.7736, 0.1772],\n[0.6646, 0.4191, 0.6602],\n[0.0818, 0.8079, 0.6424]])
k = torch.tensor([0,1,0])
A.kthvalue_vectoriezed(k) -> [0.1772,0.6602,0.0818]
Meaning I would like to operate on each column with a different k.
Not kthvalue nor topk offers such API.
Is there a vectorized way around that?
Remark - kth value is not the value in the kth index, but the kth smallest element. Pytorch docs
torch.kthvalue(input, k, dim=None, keepdim=False, out=None) -> (Tensor, LongTensor)
Returns a namedtuple (values, indices) where values is the k th smallest element of each row of the input tensor in the given dimension dim. And indices is the index location of each element found.
Assuming you don't need indices into original matrix (if you do, just use fancy indexing for the second return value as well) you could simply sort the values (by last index by default) and return appropriate values like so:
def kth_smallest(tensor, indices):
tensor_sorted, _ = torch.sort(tensor)
return tensor_sorted[torch.arange(len(indices)), indices]
And this test case gives you your desired values:
tensor = torch.tensor(
[[0.7455, 0.7736, 0.1772], [0.6646, 0.4191, 0.6602], [0.0818, 0.8079, 0.6424]]
)
print(kth_smallest(tensor, [0, 1, 0])) # -> [0.1772,0.6602,0.0818]

How can I separate an array of numbers into two clusters and return two subsets of corresponding indexes?

I have an array of scalar numbers, pm, and a list of indexes, idx, so pm[idx] is a subset of pm. How can I separate pm[idx] into two clusters (according to the Euclidean distance) and obtain two sets of corresponding indexes (ideally using scikit-learn)?
For example,
pm = array([0,1,2,3,4,100,105])
idx = [0,2,3,5,6]
How can I obtain the idx1 = [0,2,3] and idx2 = [5,6]?
basically you want to filter your data pm which can be easily done with your idx array. You can cluster your filtered data to obtain two groups.
Partition based clustering algorithms such as k-Means or SingleLink can be perfectly applied. In scikit-learn you could use /sklearn.cluster.AgglomerativeClustering.
As those clustering algorithms expects your data to have the features in columns and the instances as rows your need to reshape your data.
From the resulting cluster labels you can then create separate index arrays using a list comprehension. (didn't found a numpy function that does the same)
Your solution could look like the following:
cluster_algorithm = AgglomerativeClustering(n_clusters=2)
labels = cluster_algorithm.fit_predict(np.expand_dims(pm[idx], axis=-1))
print(labels)
>>> [1 1 1 0 0]
idx_labels = [np.where(labels == e)[0] for e in set(labels)]
idx_labels # [array([3, 4], dtype=int64), array([0, 1, 2], dtype=int64)]

dividing random samples to subgroups using python

So I had this statistics homework and I wanted to do it with python and numpy.
The question started with making of 1000 random samples which follow normal distribution.
random_sample=np.random.randn(1000)
Then it wanted to divided these numbers to some subgroups . for example suppose we divide them to five subgroups.first subgroup is random numbers in range of (-5,-3)and it goes on to the last subgroup (3,5).
Is there anyway to do it using numpy (or anything else)?
And If it's possible I want it to work when the number of subgroups are changed.
You can get subgroup indices using numpy.digitize:
random_sample = 5 * np.random.randn(10)
random_sample
# -> array([-3.99645573, 0.44242061, 8.65191515, -1.62643622, 1.40187879,
# 5.31503683, -4.73614766, 2.00544974, -6.35537813, -7.2970433 ])
indices = np.digitize(random_sample, (-3,-1,1,3))
indices
# -> array([0, 2, 4, 1, 3, 4, 0, 3, 0, 0])
If you sort your random_sample, then you can divide this array by finding the indices of the "breakpoint" values — the values closest to the ranges you define, like -3, -5. The code would be something like:
import numpy as np
my_range = [-5,-3,-1,1,3,5] # example of ranges
random_sample = np.random.randn(1000)
hist = np.sort(random_sample)
# argmin() will find index where absolute difference is closest to zero
idx = [np.abs(hist-i).argmin() for i in my_range]
groups=[hist[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Now groups is a list where each element is an array with all random values within your defined ranges.

Filling up an output vector with stata loop

When you take centiles of a variable in Stata, for eg.
*set directory
cd"C:\Etc\Etc Etc\"
*open data file
use "dataset.dta",clear
*get centiles
centile var1, centile(1,5(5)95,99)
is there some way to record the resulting centile table to excel? The centile values are stored in r(c_#), where # indicates the centile at which you want the data. But I need a vector of the values at all the centiles, more or less as it appears in the output window.
I have attempted to use foreach loop to get the centiles into a vector, as follows:
*Create column of centiles
foreach i in r(centiles) {
xx[1,`i']=r(c_`i')
}
without success.
Thanks
EDIT:
I've since found this to work:
matrix X = 0,0
forvalues i=1/21 {
matrix X = `i',round(r(c_`i'),.001)\ X
}
Only inconveniences are 1) I have to include a a first row of 0,0 in the output, which I will then subsequently drop. 2) In this case I have 21 centiles, but it would be nice to automate the number of centiles in case I want to change it, for example something like this:
forvalues i=1/r(n_cent) {
matrix X = `i',round(r(c_`i'),.001)\ X
}
But the "i=1/r(n_cent)" is invalid syntax. Any advice as to how I might overcome these two inconveniences would be much appreciated.
Thanks
You can use the following syntax.
Load some data and compute the percentiles.
sysuse auto, clear
centile price, centile(1,5(5)95,99)
The matrix that is supposed to contain the results has to be initialized. This matrix is called X. It has as many rows as there are centiles requested via the centile command. It has two columns. At this stage, the matrix is populated with zeroes.
matrix X = J(`=wordcount("`r(centiles)'")', 2, 0)
The following loop is stepping through the results of the centile command and is replacing the zeroes in matrix X with the appropriate results. The first column of the matrix contains the number of the centile (1, 5, 10, ...) and the second column contains the result
forvalues i = 1 / `=wordcount("`r(centiles)'")' {
local cent: word `i' of `r(centiles)'
matrix X[`i', 1] = `cent'
matrix X[`i', 2] = r(c_`i')
}
Print the results:
matrix list X
If you are using round(), you are likely doing something wrong. There are few reasons to deliberately lose precision in the data; you can always display as many digits as you like using format this way or another (either applied to the data, or as an option of list or matrix list).
I wrote epctile command that returns percentiles as an estimation command, i.e., in the e(b) vector. This can be usable immediately; findit epctile to download.
You can modify your proposal as follows:
local thenumlist 1, 5(5)95, 99
centile variable, centile(`thenumlist')
forvalues i=1/`=r(n_cent)' {
matrix X = nullmat(X) \ r(c_`i')
}
numlist "`thenumlist'"
matrix rownames X = `r(numlist)'
matrix list X, format(%9.3f)

Resources