Generate numpy matrix with unique range for each element - python-3.x

I'm trying to generate random matrices. However, each element of the random matrix has a different range. So I want to generate a random matrix such that each element has that random number within that range. So far i've been able to generate matrices with unique column ranges:
c1 = np.random.uniform(low=2, high=1000, size=(15,1))
c2 = np.random.uniform(low=0.001, high=100, size=(15,1))
c3 = np.random.uniform(low=30, high=10000, size=(15,1))
c4 = np.random.uniform(low=1, high=25, size=(15,1))
mtx = np.concatenate((c1,c2,c3,c4), axis=1)
Now Low and high for rows in mtx is also quite different. How can I generate such random matrix with each row element also having unique range and not just columns?

Something like this would probably work:
low = np.array([ 2, 0.001, 30, 1])
high = np.array([1000, 100, 10000, 25])
l = 15
mtx = np.random.rand((l,) + low.shape) * (high - low)[None, :] + low[None, :]

I think what you need to do to achieve what you want is the following:
Specify the low and high for each column and each row
Check for each element what the range is that it can be sampled from (that means the highest low and the lowest high of the two ranges imposed by its row and is column)
Sample each element separately (from a uniform distribution) with the element's specified high and low.
Now each element in each row will certainly be within the row's limits and the same would go for elements in a column.
You should be careful though not to select mutual exclusive ranges in rows and columns.
That said here some code that does this (with comments):
import numpy as np
from numpy.random import randint
n_rows = 15
n_cols = 4
# here I make random highs and lows for each row and column
# these are lists of tuples like this: [(39, 620), (83, 123), (67, 243), (77, 901)]
# where each tuple contains the low and high for the column (or row).
ranges_rows = [ (randint(0,100), randint(101, 1001)) for _ in range(n_rows) ]
ranges_cols = [ (randint(0,100), randint(101, 1001)) for _ in range(n_cols) ]
# make an empty matrix
mtx = np.empty((n_rows, n_cols))
# fill in the matrix
for x in range(n_rows):
for y in range(n_cols):
# get the specified low and high for both the column and row of the element
row_low, row_high = ranges_rows[x]
col_low, col_high = ranges_cols[y]
# the low and high for each element should be within range of both the
# row and column restrictions
elem_low = max([row_low, col_low])
elem_high = min([row_high, col_high])
# get the element within the range
rand_elem = np.random.uniform(low=elem_low, high=elem_high)
# put it in its right place in the matrix
mtx[x,y] = rand_elem

Related

Efficient solution of dataframe range filtering based on another ranges

I tried the following code to find the range of a dataframe not within the range of another dataframe. However, it takes more than a day to compute the large files because, in the last 2 for-loops, it's comparing each row. Each of my 24 dataframes has around 10^8 rows. Is there any efficient alternative to the following approach?
Please refer to this thread for a better understanding of my I/O: Return the range of a dataframe not within a range of another dataframe
My approach:
I created the tuple pairs from the (df1['first.start'], df1['first.end']) and (df2['first.start'], df2['first.end']) initially in order to apply the range() function. After that, I put a condition whether df1_ranges are in the ranges of df2_ranges or not. Here the edge case was df1['first.start'] = df1['first.end']. I collected the filtered indices from the iterations and then passed into the df1.
df2_lst=[]
for i,j in zip(temp_df2['first.start'], temp_df2['first.end']):
df2_lst.append(i)
df2_lst.append(j)
df1_lst=[]
for i,j in zip(df1['first.start'], df1['first.end']):
df1_lst.append(i)
df1_lst.append(j)
def range_subset(range1, range2):
"""Whether range1 is a subset of range2."""
if not range1:
return True # empty range is a subset of anything
if not range2:
return False # non-empty range can't be a subset of empty range
if len(range1) > 1 and range1.step % range2.step:
return False # must have a single value or integer multiple step
return range1.start in range2 and range1[-1] in range2
##### FUNCTION FOR CREATING CHUNKS OF LISTS ####
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i],lst[i+1]
df1_lst2 = list(chunks(df1_lst,2))
df2_lst2 = list(chunks(df2_lst,2))
indices=[]
for idx,i in enumerate(df1_lst2): #main list
x,y = i
for j in df2_lst2: #filter list
m,n = j
if((x!=y) & (range_subset(range(x,y), range(m,n)))): #checking if the main list exists in the filter range or not
indices.append(idx) #collecting the filtered indices
df1.iloc[indices]
If n and m are the number of rows in df1 and df2, any algorithm needs to make at least n * m comparision to check every range in df1 against every range in df2, The problem with your code as posted is that (a) it has too may intermediate steps and (b) it uses slow Python loops. If you switch to numpy broadcast, which uses highly optimized C loop under the hood, it will be a lot faster.
The downside with numpy broadcast is memory: it will create a comparison matrix of n * m bytes and the size of your problem may run your computer out of memory. We can mitigate that by chunking df1 to trade performance for lower memory usage.
# Sample data
def random_dataframe(size):
a = np.random.randint(1, 100, 2*size).cumsum()
return pd.DataFrame({
'first.start': a[::2],
'first.end': a[1::2]
})
n, m = 10_000_000, 1000
np.random.seed(42)
df1 = random_dataframe(n)
df2 = random_dataframe(m)
# ---------------------------
# Prepare the Start and End time of df2 for comparison
# [:, None] raise the array by one dimension, which is necessary
# for array broadcasting
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
s1 = df1['first.start'].to_numpy()[offset:offset+chunk_size]
e1 = df1['first.end'].to_numpy()[offset:offset+chunk_size]
mask.append(
((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
)
offset += chunk_size
mask = np.hstack(mask)
# ---------------------------
# If memory is not a concern, use the following code. However, this
# may run slower than the chunking approach due to increased size of
# the array broadcasting operation. Profile your code to find out.
s2 = df2['first.start'].to_numpy()[:, None]
e2 = df2['first.end'].to_numpy()[:, None]
s1 = df1['first.start'].to_numpy()
e1 = df1['first.end'].to_numpy()
mask = ((s2 <= s1) & (s1 <= e2) & (s2 <= e1) & (e1 <= e2)).any(axis=0)
The chunking code took 30s on my computer. To access the result:
df1[mask] # ranges in df1 that are completely surrounded by a range in df2
df1[~mask] # ranges in df1 that are NOT completely surrounded by any range in df2
By tweaking the comparison, you can check for overlapping ranges too.

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Find n smallest values in a list of tensors

I am trying to find the indices of the n smallest values in a list of tensors in pytorch. Since these tensors might contain many non-unique values, I cannot simply compute percentiles to obtain the indices. The ordering of non-unique values does not matter however.
I came up with the following solution but am wondering if there is a more elegant way of doing it:
import torch
n = 10
tensor_list = [torch.randn(10, 10), torch.zeros(20, 20), torch.ones(30, 10)]
all_sorted, all_sorted_idx = torch.sort(torch.cat([t.view(-1) for t in tensor_list]))
cum_num_elements = torch.cumsum(torch.tensor([t.numel() for t in tensor_list]), dim=0)
cum_num_elements = torch.cat([torch.tensor([0]), cum_num_elements])
split_indeces_lt = [all_sorted_idx[:n] < cum_num_elements[i + 1] for i, _ in enumerate(cum_num_elements[1:])]
split_indeces_ge = [all_sorted_idx[:n] >= cum_num_elements[i] for i, _ in enumerate(cum_num_elements[:-1])]
split_indeces = [all_sorted_idx[:n][torch.logical_and(lt, ge)] - c for lt, ge, c in zip(split_indeces_lt, split_indeces_ge, cum_num_elements[:-1])]
n_smallest = [t.view(-1)[idx] for t, idx in zip(tensor_list, split_indeces)]
Ideally a solution would pick a random subset of the non-unique values instead of picking the entries of the first tensor of the list.
Pytorch does provide a more elegant (I think) way to do it, with torch.unique_consecutive (see here)
I'm going to work on a tensor, not a list of tensors because as you did yourself, there's just a cat to do. Unraveling the indices afterward is not hard either.
# We want to find the n=3 min values and positions in t
n = 3
t = torch.tensor([1,2,3,2,0,1,4,3,2])
# To get a random occurrence, we create a random permutation
randomizer = torch.randperm(len(t))
# first, we sort t, and get the indices
sorted_t, idx_t = t[randomizer].sort()
# small util function to extract only the n smallest values and positions
head = lambda v,w : (v[:n], w[:n])
# use unique_consecutive to remove duplicates
uniques_t, counts_t = head(*torch.unique_consecutive(sorted_t, return_counts=True))
# counts_t.cumsum gives us the position of the unique values in sorted_t
uniq_idx_t = torch.cat([torch.tensor([0]), counts_t.cumsum(0)[:-1]], 0)
# And now, we have the positions of uniques_t values in t :
final_idx_t = randomizer[idx_t[uniq_idx_t]]
print(uniques_t, final_idx_t)
#>>> tensor([0,1,2]), tensor([4,0,1])
#>>> tensor([0,1,2]), tensor([4,5,8])
#>>> tensor([0,1,2]), tensor([4,0,8])
EDIT : I think the added permutation solves your need-random-occurrence problem

Rearranging a 3-D array using indices from sorting?

I have a 3-D array of random numbers of size [channels = 3, height = 10, width = 10].
Then I sorted it using sort command from pytorch along the columns and obtained the indices as well.
The corresponding index is shown below:
Now, I would like to return to the original matrix using these indices. I currently use for loops to do this (without considering the batches). The code is:
import torch
torch.manual_seed(1)
ch = 3
h = 10
w = 10
inp_unf = torch.randn(ch,h,w)
inp_sort, indices = torch.sort(inp_unf,1)
resort = torch.zeros(inp_sort.shape)
for i in range(ch):
for j in range(inp_sort.shape[1]):
for k in range (inp_sort.shape[2]):
temp = inp_sort[i,j,k]
resort[i,indices[i,j,k],k] = temp
I would like it to be vectorized considering batches as well i.e.input size is [batch, channel, height, width].
Using Tensor.scatter_()
You can directly scatter the sorted tensor back into its original state using the indices provided by sort():
torch.zeros(ch,h,w).scatter_(dim=1, index=indices, src=inp_sort)
The intuition is based on the previous answer below. As scatter() is basically the reverse of gather(), inp_reunf = inp_sort.gather(dim=1, index=reverse_indices) is the same as inp_reunf.scatter_(dim=1, index=indices, src=inp_sort):
Previous answer
Note: while correct, this is probably less performant, as calling the sort() operation a 2nd time.
You need to obtain the sorting "reverse indices", which can be done by "sorting the indices returned by sort()".
In other words, given x_sort, indices = x.sort(), you have x[indices] -> x_sort ; while what you want is reverse_indices such that x_sort[reverse_indices] -> x.
This can be obtained as follows: _, reverse_indices = indices.sort().
import torch
torch.manual_seed(1)
ch, h, w = 3, 10, 10
inp_unf = torch.randn(ch,h,w)
inp_sort, indices = inp_unf.sort(dim=1)
_, reverse_indices = indices.sort(dim=1)
inp_reunf = inp_sort.gather(dim=1, index=reverse_indices)
print(torch.equal(inp_unf, inp_reunf))
# True

Return value from DataFrame at maximum of a function

For narrow banded processing I want the complex pressure at the peak frequency bin. To find the peak frequency bin I use the frequency with the highest absolute value, within a small range of frequencies.
I have come up with the following code, borrowing heavily from
Use idxmax for indexing in pandas
This seems to me bulky, and hard to generalize. Ideally I hope to be able to be able to make fBins into an array, and return many frequencies at once. Its OK to make maxAbsIndex into a list, but I can't see the next step.
import numpy as np
import pandas as pd
# Construct fake frequency data on multiple channels
np.random.seed(0)
numF = 1000
f = np.arange(numF) / (numF * 2)
y = np.random.randn(numF, 2) + 1j * np.random.randn(numF, 2)
# Put time series into a DataFrame, indexed by frequency
yFrame = pd.DataFrame(y, index = f)
fBins = 0.1
tol = 0.01
# Find the index of the maxium absolute value within a given frequency window
absMaxIndex = yFrame[(fBins - tol) : (fBins + tol)].abs().idxmax()
# Return the value at this index
value = [yFrame.ix[items[1], items[0]] for items in absMaxIndex.iteritems()]
print(value)
Value should have the complex value
[(-2.0946030712061448-1.0585718976053677j), (-2.7396771671895563+0.79204149842297422j)]
Which have the largest absolute value in yFrame between 0.09 and 0.11 Hz for each channel.

Resources