get n smallest values of multidimensional xarray.DataArray - python-3.x

I'm currently working with some weather data that I have as netcdf files which I can easily read with pythons xarray library
I would now like to get the n smallest values of my DataArray which has 3 dimensions (longitude, latitude and time)
When I have a DataArray dr, I can just do dr.min(), maybe specify an axis and then I get the minimum, but when I want to get also the second smallest or even a variable amount of smallest values, it seems not to be as simple
What I currently do is:
with xr.open_dataset(path) as ds:
dr = ds[selection]
dr = dr.values.reshape(dr.values.size)
dr.sort()
n_smallest = dr[0:n]
which seems a bit complicated to me compared to the simple .min() I have to type for the smallest value
I actually want to get the times to the respective smallest values which I do for the smallest with:
dr.where(dr[selection] == dr[selection].min(), drop=True)[time].values
so is there a better way of getting the n smallest values? or maybe even a simple way to get the times for the n smallest values?
maybe there is a way to reduce the 3D DataArray along the longitude and latitude axis to the respective smallest values?

I just figured out there really is a reduce function for DataArray that allows me to reduce along longitude and latitude and as I don't reduce the time dimension, I can just use the sortby function and get the DataArray with min values for each day with their respective times:
with xr.open_dataset(path) as ds:
dr = ds[selection]
dr = dr.reduce(np.min,dim=[longitude,latitude])
dr.sortby(dr)
which is obviously not shorter than my original code, but perfectly satisfies my demands

Related

Question(s) regarding computational intensity, prediction of time required to produce a result

Introduction
I have written code to give me a set of numbers in '36 by q' format ( 1<= q <= 36), subject to following conditions:
Each row must use numbers from 1 to 36.
No number must repeat itself in a column.
Method
The first row is generated randomly. Each number in the coming row is checked for the above conditions. If a number fails to satisfy one of the given conditions, it doesn't get picked again fot that specific place in that specific row. If it runs out of acceptable values, it starts over again.
Problem
Unlike for low q values (say 15 which takes less than a second to compute), the main objective is q=36. It has been more than 24hrs since it started to run for q=36 on my PC.
Questions
Can I predict the time required by it using the data I have from lower q values? How?
Is there any better algorithm to perform this in less time?
How can I calculate the average number of cycles it requires? (using combinatorics or otherwise).
Can I predict the time required by it using the data I have from lower q values? How?
Usually, you should be able to determine the running time of your algorithm in terms of input. Refer to big O notation.
If I understood your question correctly, you shouldn't spend hours computing a 36x36 matrix satisfying your conditions. Most probably you are stuck in the infinite loop or something. It would be more clear of you could share code snippet.
Is there any better algorithm to perform this in less time?
Well, I tried to do what you described and it works in O(q) (assuming that number of rows is constant).
import random
def rotate(arr):
return arr[-1:] + arr[:-1]
y = set([i for i in range(1, 37)])
n = 36
q = 36
res = []
i = 0
while i < n:
x = []
for j in range(q):
if y:
el = random.choice(list(y))
y.remove(el)
x.append(el)
res.append(x)
for j in range(q-1):
x = rotate(x)
res.append(x)
i += 1
i += 1
Basically, I choose random numbers from the set of {1..36} for the i+q th row, then rotate the row q times and assigned these rotated rows to the next q rows.
This guarantees both conditions you have mentioned.
How can I calculate the average number of cycles it requires?( Using combinatorics or otherwise).
I you cannot calculate the computation time in terms of input (code is too complex), then fitting to curve seems to be right.
Or you could create an ML model with iterations as data and time for each iteration as label and perform linear regression. But that seems to be overkill in your example.
Graph q vs time
Fit a curve,
Extrapolate to q = 36.
You might want to also graph q vs log(time) as that may give an easier fitted curve.

Exclude indices from Pytorch tensor

I have an MxN table of distances between two distinct sets of points of size M <= N. I would like to find associate to each point of the first set M points in the second set in the following way.
Suppose that the shortest of all pairwise distances is between the i0 point of the first set and the j0 of the second. Then we attribute point i0 of the first set to j0 in the second. For the second pair, I have to find i1 != i0 and j1 != j0 such that the distance is minimal among remaining non-paired points.
I figure that I could do the first step by using torch.min function that will deliver me both minimal value as well as its 2d index in the matrix. But for the next steps I'll need to each time exclude a row a colunm, while keeping their original indices.
In other words, if I have a 3x4 matrix, and my first element is (1,2), I would like to be left with a 2x3 matrix with indices 0,2 and 0,1,3. So that, if my second desired element position in the original matrix is, say (2,3) I will be given (2,3) as a result of performing torch.max on the matrix with excluded row and column, rather than (1,2) again.
P.S. I could reach my goal by replacing the values in row and column I'd like to exclude by, say, positive infinities, but I think the question is still worth asking.

Fast algorithms to approximate distance between two strings

I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114

Coin Change Algorithm - DP with 1D array

I came across a solution to the Coin Change problem here : Coin Change. Here I was able to understand the first recursive method, the second method which uses DP with a 2D array. But am not able to understand the logic behind the third solution.
As far as I have thought, the last method works for problems in which the sequence of coins used in coin change is considered. Am I correct? Can anyone please explain me if I am wrong.
Well I figured it out myself!
This can be easily proved using induction. Let table[k] denote the ways change can be given for a total of k. Now the algorithm consists of two loops, one which is controlled by i and iterates through the array containing all the different coins and the other is the j controlled loop which for a given i, updates all the values of elements in array table. Now consider for a fixed i we have calculated the number of ways change can be given for all values from 1 to n and these values are stored in table from table[1] to table[n]. When the i controlled loop iterates for i+1, the value in table[j] for an arbitrary j is incremented by table[j-S[i + 1]] which is nothing but the ways we can create j using at least one coin with value S[i + 1] (the array which stores coin values). Thus the total value in table[j] equals the number of ways we can create a change with coins of value S[1]....S[i] (this was already stored before) and the value table[j-S[i + 1]]. This is same as the optimal substructure of the problem used in the recursive algorithm.
int arr[size];
memset(arr,0,sizeof(size));
int n;
cin>>n;
int sum;
cin>>sum;
int a[size];
fi(i,n)
cin>>a[i];
arr[0]=1;
fi(i,n)
for(int j=arr[i]; j<=n; j++)
a[j]+=a[j-arr[i]];
cout<<arr[n];
The array arr is initialised as 0 so as to show that the number of ways a sum of ican be represented is zero(that is not initialised). However, the number of ways in which a sum of 0 can be represented is 1 (zero way).
Further, we take each coin and start initialising each position in the array starting from the coin denomination.
a[j]+=a[j-arr[i]] means that we are basically incrementing the possible ways to represent the sum jby the previous number of ways, required (j-arr[i]).
In the end, we output the a[n]

Querying distance on 2 dimensions in a 3 dimension spatial data set

I have a case requirement where I need to find the nearest N suppliers who supply specific product types. The types in an int range of 0..1048575, which represents an hierarchy. Any supplier could have multiple points since they could supply multiple product types.
I could store the long & lat in PostGIS and the types in an indexed int array column, and query against both with a limit of N. However, I don't believe this will be efficient as I am not sure that PostgreSQL will use both indexes.
Another idea I have is to store the types in a third "vertical" dimension. This would create stacked "vertical" shape segments at the long & lat shape of each supplier. To query, I would get the nearest long & lat intersects that simply intersect the desired type on the third dimension.
Is this possible with PostGIS using say 3DM? In other words, can I have it calculate nearest neighbor using only the long and lat, but use all 3 dimensions for the intersection?

Resources