dividing random samples to subgroups using python - python-3.x

So I had this statistics homework and I wanted to do it with python and numpy.
The question started with making of 1000 random samples which follow normal distribution.
random_sample=np.random.randn(1000)
Then it wanted to divided these numbers to some subgroups . for example suppose we divide them to five subgroups.first subgroup is random numbers in range of (-5,-3)and it goes on to the last subgroup (3,5).
Is there anyway to do it using numpy (or anything else)?
And If it's possible I want it to work when the number of subgroups are changed.

You can get subgroup indices using numpy.digitize:
random_sample = 5 * np.random.randn(10)
random_sample
# -> array([-3.99645573, 0.44242061, 8.65191515, -1.62643622, 1.40187879,
# 5.31503683, -4.73614766, 2.00544974, -6.35537813, -7.2970433 ])
indices = np.digitize(random_sample, (-3,-1,1,3))
indices
# -> array([0, 2, 4, 1, 3, 4, 0, 3, 0, 0])

If you sort your random_sample, then you can divide this array by finding the indices of the "breakpoint" values — the values closest to the ranges you define, like -3, -5. The code would be something like:
import numpy as np
my_range = [-5,-3,-1,1,3,5] # example of ranges
random_sample = np.random.randn(1000)
hist = np.sort(random_sample)
# argmin() will find index where absolute difference is closest to zero
idx = [np.abs(hist-i).argmin() for i in my_range]
groups=[hist[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Now groups is a list where each element is an array with all random values within your defined ranges.

Related

Generate numpy matrix with unique range for each element

I'm trying to generate random matrices. However, each element of the random matrix has a different range. So I want to generate a random matrix such that each element has that random number within that range. So far i've been able to generate matrices with unique column ranges:
c1 = np.random.uniform(low=2, high=1000, size=(15,1))
c2 = np.random.uniform(low=0.001, high=100, size=(15,1))
c3 = np.random.uniform(low=30, high=10000, size=(15,1))
c4 = np.random.uniform(low=1, high=25, size=(15,1))
mtx = np.concatenate((c1,c2,c3,c4), axis=1)
Now Low and high for rows in mtx is also quite different. How can I generate such random matrix with each row element also having unique range and not just columns?
Something like this would probably work:
low = np.array([ 2, 0.001, 30, 1])
high = np.array([1000, 100, 10000, 25])
l = 15
mtx = np.random.rand((l,) + low.shape) * (high - low)[None, :] + low[None, :]
I think what you need to do to achieve what you want is the following:
Specify the low and high for each column and each row
Check for each element what the range is that it can be sampled from (that means the highest low and the lowest high of the two ranges imposed by its row and is column)
Sample each element separately (from a uniform distribution) with the element's specified high and low.
Now each element in each row will certainly be within the row's limits and the same would go for elements in a column.
You should be careful though not to select mutual exclusive ranges in rows and columns.
That said here some code that does this (with comments):
import numpy as np
from numpy.random import randint
n_rows = 15
n_cols = 4
# here I make random highs and lows for each row and column
# these are lists of tuples like this: [(39, 620), (83, 123), (67, 243), (77, 901)]
# where each tuple contains the low and high for the column (or row).
ranges_rows = [ (randint(0,100), randint(101, 1001)) for _ in range(n_rows) ]
ranges_cols = [ (randint(0,100), randint(101, 1001)) for _ in range(n_cols) ]
# make an empty matrix
mtx = np.empty((n_rows, n_cols))
# fill in the matrix
for x in range(n_rows):
for y in range(n_cols):
# get the specified low and high for both the column and row of the element
row_low, row_high = ranges_rows[x]
col_low, col_high = ranges_cols[y]
# the low and high for each element should be within range of both the
# row and column restrictions
elem_low = max([row_low, col_low])
elem_high = min([row_high, col_high])
# get the element within the range
rand_elem = np.random.uniform(low=elem_low, high=elem_high)
# put it in its right place in the matrix
mtx[x,y] = rand_elem

How can I separate an array of numbers into two clusters and return two subsets of corresponding indexes?

I have an array of scalar numbers, pm, and a list of indexes, idx, so pm[idx] is a subset of pm. How can I separate pm[idx] into two clusters (according to the Euclidean distance) and obtain two sets of corresponding indexes (ideally using scikit-learn)?
For example,
pm = array([0,1,2,3,4,100,105])
idx = [0,2,3,5,6]
How can I obtain the idx1 = [0,2,3] and idx2 = [5,6]?
basically you want to filter your data pm which can be easily done with your idx array. You can cluster your filtered data to obtain two groups.
Partition based clustering algorithms such as k-Means or SingleLink can be perfectly applied. In scikit-learn you could use /sklearn.cluster.AgglomerativeClustering.
As those clustering algorithms expects your data to have the features in columns and the instances as rows your need to reshape your data.
From the resulting cluster labels you can then create separate index arrays using a list comprehension. (didn't found a numpy function that does the same)
Your solution could look like the following:
cluster_algorithm = AgglomerativeClustering(n_clusters=2)
labels = cluster_algorithm.fit_predict(np.expand_dims(pm[idx], axis=-1))
print(labels)
>>> [1 1 1 0 0]
idx_labels = [np.where(labels == e)[0] for e in set(labels)]
idx_labels # [array([3, 4], dtype=int64), array([0, 1, 2], dtype=int64)]

How to divide a list in a substring of size k in all the possible ways

Consider we have :
size of array = 5
pairs= 3
array= 1 2 3 4 5
We need to divide it into int the possible sub list as:
[(1,2,3),(4),(5)]
[(1),(2,3,4),(5)]
[(1),(2),(3,4,5)]
[(1,2),(3,4),(5)]
[(1),(2,3),(4,5)]
Suppose if:
size of array = 5
pairs= 2
array= 1 2 3 4 5
We need to divide it into int the possible sub list as:
[(1,2,3,4),(5)]
[(1),(2,3,4,5)]
[(1,2),(3,4,5)]
[(1,2,3),(4,5)]
The code I have tried:
l1=[1,2,3,4,5]
from itertools import permutations
l2 = permutations(l1)
l3 = [[sum([x[0], x[1]]), sum([x[2], x[3]]),x[4]] for x in l2]
max_arr=[]
for arr in l3:
max_arr.append(max(arr))
print(min(max_arr))
To generate all list partitions, you can make a list of parts-1 ones and size - parts zeros. (Note I used parts rather pairs as more suitable name).
Then generate permutations of this list (for example, with itertools), and for every permutation make separation of initial list after indices of 1's. (Note there are Cnk(size-1, parts-1) of such permutations).
For example, result [0,1,1,0] corresponds to partition [(1,2),(3),(4,5)] (divide list after 1st and 2nd items)
This is application of "stars and bars" principle
MBo's good solution may be improved a little by generating combinations of list splitting indexes instead of permutations of separation flags. Then if we're lazy we can pass those indexes directly to numpy.split.
l = [1, 2, 3, 4, 5]
import itertools
import numpy
parts = 3
for comb in itertools.combinations(range(1, len(l)), parts-1):
print(numpy.split(l, comb))

Using numpy to vectorize subtraction of array with scalar (via another array) without using double for-loop

Suppose one wanted to use numpy to vectorize array subtractions. As an example, consider the following setup (code below): I am computing the euclidean distance between some (x,y) points with a given centroid. The reason for this question is that the example code below works for exactly 2-dimensions (x and y), but I would like to generalize and adapt this operation to N-dimensions for the purposes of adapting my k-means algorithm. The code below is only to compute the error given a specified centroid.
import numpy as np
np.random.seed(10) ## for reproducibility
x = np.random.normal(40, 10, 10)
y = np.random.normal(50, 10, 10)
data = np.array([x, y])
centroids = np.array([[25, 75], [45, 55], [20, 80], [40, 60]])
k = len(centroids)
print("\nDATA:\n{}\n\n{} CENTROIDS:\n{}\n".format(data, k, centroids))
partials = np.array([[(data[i] - centroid[i])**2 for i in range(len(data))] for centroid in centroids])
res = np.sqrt(np.sum(partials))
print("\nPARTIAL DISTANCES:\n{}\n\nTOTAL DISTANCE:\n{}\n".format(partials, res))
Running the code above produces the following output:
DATA:
[[53.31586504 47.15278974 24.54599708 39.9161615 46.21335974 32.79914439
42.65511586 41.08548526 40.04291431 38.25399789]
[54.3302619 62.03037374 40.34934329 60.28274078 52.2863013 54.45137613
38.63397788 51.35136878 64.84537002 39.20195114]]
4 CENTROIDS:
[[25 75]
[45 55]
[20 80]
[40 60]]
PARTIAL DISTANCES:
[[[8.01788213e+02 4.90746093e+02 2.06118652e-01 2.22491874e+02
4.50006631e+02 6.08266533e+01 3.11703116e+02 2.58742836e+02
2.26289271e+02 1.75668460e+02]
[4.27238073e+02 1.68211205e+02 1.20066801e+03 2.16597719e+02
5.15912109e+02 4.22245943e+02 1.32248756e+03 5.59257758e+02
1.03116510e+02 1.28150030e+03]]
[[6.91536114e+01 4.63450368e+00 4.18366235e+02 2.58454139e+01
1.47224186e+00 1.48860878e+02 5.49848164e+00 1.53234257e+01
2.45726985e+01 4.55085444e+01]
[4.48549123e-01 4.94261549e+01 2.14641742e+02 2.79073501e+01
7.36416063e+00 3.00988153e-01 2.67846680e+02 1.33125097e+01
9.69313108e+01 2.49578348e+02]]
[[1.10994686e+03 7.37273991e+02 2.06660894e+01 3.96653489e+02
6.87140229e+02 1.63818097e+02 5.13254274e+02 4.44597689e+02
4.01718414e+02 3.33208439e+02]
[6.58935454e+02 3.22907468e+02 1.57217458e+03 3.88770311e+02
7.68049096e+02 6.52732182e+02 1.71114779e+03 8.20744071e+02
2.29662810e+02 1.66448079e+03]]
[[1.77312262e+02 5.11624011e+01 2.38826206e+02 7.02889396e-03
3.86058392e+01 5.18523215e+01 7.04964021e+00 1.17827824e+00
1.84163795e-03 3.04852335e+00]
[3.21459301e+01 4.12241752e+00 3.86148309e+02 7.99423486e-02
5.95011476e+01 3.07872269e+01 4.56506901e+02 7.47988219e+01
2.34776106e+01 4.32558836e+02]]]
TOTAL DISTANCE:
163.00230640508593
I am using a nested double for-loop in this code. I noticed numpy.subtract does not have an axis kwarg. I was thinking I could numpy.tile the centroids to perform the subtraction, but this seems inefficient for large N, especially if many iterations are needed to converge. Is there a different way to vectorize this operation?
You can use expand_dims to create the missing axis:
partials = (data.T - np.expand_dims(centroids, axis=1))**2
That way data.T has shape (10,2) and you subtract from it an array with shape (4,1,2) so the subtraction is broadcast across the second axis of this array.
You can also do this by adding an extra axis on the end of centroids and not transposing `data:
partials = (data - centroids[:,:,np.newaxis])**2

Value and Index of MAX in each column of matrix

I'm aware of:
id,value = max(enumerate(trans_p), key=operator.itemgetter(1))
I'm trying to find something equivalent for matrices, where I'm looking for the value and row index of the max for each column of the matrix
so the function could take in any matrix, such as:
np.array([[0,0,1],[2,0,0],[5,0,0]])
and return two vectors: a vector of row numbers where the max is found, and the max values themselves - for each column. I'm trying to avoid a for-loop! Ideally the function returns two values, like that:
rowIdVect, maxVect = ...........
where the values for the example matrix above would be:
[2,0,0] #rowIdVect
[5,0,1] #maxVect
I can do this in two steps:
idVect = np.argmax( myMat , axis=0)
maxVect = np.max( trans_probs_mat, axis=0)
But is there a syntax that would perform both at the same time? Note: I'm trying to improve run times.
You can use the index to find the corresponding values:
In [201]: arr=np.array([[0,0,1],[2,0,0],[5,0,0]])
In [202]: idx=np.argmax(arr, axis=0)
In [203]: np.max(arr, axis=0)
Out[203]: array([5, 0, 1])
In [204]: arr[idx,np.arange(3)]
Out[204]: array([5, 0, 1])
Is this worth it? I doubt if the use of argmax and/or max is a bottleneck in your calculations. But feel free to time test with realistic data.

Resources