Find the closest distance between every galaxy in the data and create pairs based on closest distance between them - python-3.x

My task is to pair up galaxies that are closest together from a large list of galaxies. I have the RA, DEC and Z of each, and a formula to work out the distance between each one from the data given. However, I can't work out an efficient method of iterating over the whole list to find the distance between EACH galaxy and EVERY other galaxy in the list, with the intention of then matching each galaxy with its nearest neighbour.
The data has been imported in the following way:
hdulist = fits.open("documents/RADECMASSmatch.fits")
CATAID = data['CATAID_1']
Xpos_DEIMOS_1 = data['Xpos_DEIMOS_1']
z = data['Z_1']
RA = data['RA']
DEC = data['DEC']
I have tried something like:
radiff = []
for i in range(0,n):
for j in range(i+1,n):
radiff.append(abs(RA[i]-RA[j]))
to initially work out difference in RA and DEC between every galaxy, which does actually work but I feel like there must be a better way.
A friend suggested something along the lines of:
galaxy_coords = (data['RA'],data['DEC'],data['Z])
separation_matrix = np.zeros((len(galaxy_coords),len(galaxy_coords))
done = []
for i, coords1 in enumerate(galaxy_coords):
for j, coords2 in enumerate(galaxy_coords):
if (j,i) in done:
separation_matrix[i,j] += separation matrix[j,i]
continue
separation = your_formula(coords1, coords2)
separation_matrix[i,j] += separation
done.append((i,j))
But I don't really understand this so can't readily apply it. I've tried but it yields nothing useful.
Any help with this would be much appreciated, thanks

Your friend's code seems to be generating a 2D array of the distances between each pair, and taking advantage of the symmetry (distance(x,y) = distance(y,x)). It would be slightly better if it used itertools to generate combinations, and assigned your_formula(coords1, coords2) to separation_matrix[i,j] and separation_matrix[j,i] within the same iteration, rather than having separate iterations for both i,j and j,i.
Even better would probably be this package that uses a tree-based algorithm: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html . It seems to be focused on rectilinear coordinates, but that should be addressable in linear time.

Related

Valid Sudoku: How to decrease runtime

Problem is to check whether the given 2D array represents a valid Sudoku or not. Given below are the conditions required
Each row must contain the digits 1-9 without repetition.
Each column must contain the digits 1-9 without repetition.
Each of the 9 3x3 sub-boxes of the grid must contain the digits 1-9 without repetition.
Here is the code I prepared for this, please give me tips on how I can make it faster and reduce runtime and whether by using the dictionary my program is slowing down ?
def isValidSudoku(self, boards: List[List[str]]) -> bool:
r = {}
a = {}
for i in range(len(boards)):
c = {}
for j in range(len(boards[i])):
if boards[i][j] != '.':
x,y = r.get(boards[i][j]+f'{j}',0),c.get(boards[i][j],0)
u,v = (i+3)//3,(j+3)//3
z = a.get(boards[i][j]+f'{u}{v}',0)
if (x==0 and y==0 and z==0):
r[boards[i][j]+f'{j}'] = x+1
c[boards[i][j]] = y+1
a[boards[i][j]+f'{u}{v}'] = z+1
else:
return False
return True
Simply optimizing assignment without rethinking your algorithm limits your overall efficiency by a lot. When you make a choice you generally take a long time before discovering a contradiction.
Instead of representing, "Here are the values that I have figured out", try to represent, "Here are the values that I have left to try in each spot." And now your fundamental operation is, "Eliminate this value from this spot." (Remember, getting it down to 1 propagates to eliminating the value from all of its peers, potentially recursively.)
Assignment is now "Eliminate all values but this one from this spot."
And now your fundamental search operation is, "Find the square with the least number of remaining possibilities > 1. Try each possibility in turn."
This may feel heavyweight. But the immediate propagation of constraints results in very quickly discovering constraints on the rest of the solution, which is far faster than having to do exponential amounts of reasoning before finding the logical contradiction in your partial solution so far.
I recommend doing this yourself. But https://norvig.com/sudoku.html has full working code that you can look at at need.

How to create a multi-diagonal square matrix in Theano?

Is there a better way to create a multi-diagonal square matrix in theano than the following,
A = theano.tensor.nlinalg.AllocDiag(offset=0)(x)
A += theano.tensor.nlinalg.AllocDiag(offset=1)(x[:-1])
A += theano.tensor.nlinalg.AllocDiag(offset=-1)(x[1:])
where x is the vector i want on the diagonals? Each time i call AllocDiag()() a new Apply node is created which is causing memory issues and inefficiencies.
I'm hoping there is a way similar to scipy where a list of vectors can be passed into the function with a corresponding list of offsets, see https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.diags.html.
Any assistance is much appreciated.
One way which doesn't require AllocDiag()() is to use theano.tensor.set_subtensor() with A[range(n),range(n)] to obtain the diagonal indexes where A is an n*nmatrix . Something like the following:
A = tt.set_subtensor(A0[range(n),range(n)], x)
A = tt.set_subtensor(A[range(n-1),range(1,n)], x[:-1])
A = tt.set_subtensor(A[range(1,n),range(n-1), x[1:])
where A0is the initial matrix, for example, a matrix of zeros.

PageRank with custom initial scores

I am trying to implement a simple algorithm that will calculate PageRank on a directed network generated and handled with NetworkX. However, I'd like to add a simple change: rather than having the initial PageRank for each node be equal to 1/n, where n is the number of nodes in the graph, I want each node to have rank 1.
So far I have tried checking out the official documentations on PageRank, but I found nothing that seems to help. Apparently the 'personalization' parameter is of no use either. I tried using nstart, but to no avail. The code currently looks like this:
import networkx as nx
D=nx.DiGraph()
D.add_weighted_edges_from([('1','2',0.5),('1','3',0.5)])
nst = {n: 1 for n in D.nodes}
print(nx.pagerank(D, alpha = 0.95, nstart=nst))
At the moment, the ranks given to each node at the end of the calculation still sum up to 1, while they should sum up to 3.
Is such a thing even feasible to begin with? Should I look elsewhere to implement such an algorithm? Could there be problems with convergence if such a change is applied? Thanks in advance.
PageRank in networkx has an attribute nstart:
nstart (dictionary, optional) – Starting value of PageRank iteration for each node.
Here is source code for this:
# Choose fixed starting vector if not given
if nstart is None:
x = dict.fromkeys(W, 1.0 / N)
else:
# Normalized nstart vector
s = float(sum(nstart.values()))
x = dict((k, v / s) for k, v in nstart.items())
You can just specify nstart in your code, like this:
nst = {n: 1 for n in G.nodes}
pr = nx.pagerank(G, nstart=nst)
Edit 1: Modern PageRank algorithm forcefully normalizes start vector (you can see it in the code above). The whole algorithm is based on it and if one will force nstart values to be 1, not 1/N, it will be broken because convergence:
will never be assumed (e is increasing each iteration). If you want to use 1 as starting values, as in the original PageRank algorithm:
In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1.
You should implement the whole algorithm manually because it is deprecated.

Simultaneous Subset sums

I am dealing with a problem which is a variant of a subset-sum problem, and I am hoping that the additional constraint could make it easier to solve than the classical subset-sum problem. I have searched for a problem with this constraint but I have been unable to find a good example with an appropriate algorithm either on StackOverflow or through googling elsewhere.
The problem:
Assume you have two lists of positive numbers A1,A2,A3... and B1,B2,B3... with the same number of elements N. There are two sums Sa and Sb. The problem is to find the simultaneous set Q where |sum (A{Q}) - Sa| <= epsilon and |sum (B{Q}) - Sb| <= epsilon. So, if Q is {1, 5, 7} then A1 + A5 + A7 - Sa <= epsilon and B1 + B5 + B7 - Sb <= epsilon. Epsilon is an arbitrarily small positive constant.
Now, I could solve this as two completely separate subset sum problems, but removing the simultaneity constraint results in the possibility of erroneous solutions (where Qa != Qb). I also suspect that the additional constraint should make this problem easier than the two NP complete problems. I would like to solve an instance with 18+ elements in both lists of numbers, and most subset-sum algorithms have a long run time with this number of elements. I have investigated the pseudo-polynomial run time dynamic programming algorithm, but this has the problems that a) the speed relies on a short bit-depth of the list of numbers (which does not necessarily apply to my instance) and b) it does not take into account the simultaneity constraint.
Any advice on how to use the simultaneity constraint to reduce the run time? Is there a dynamic programming approach I could use to take into account this constraint?
If I understand your description of the problem correctly (I'm confused about why you have the distance symbols around "sum (A{Q}) - Sa" and "sum (B{Q}) - Sb", it doesn't seem to fit the rest of the explanation), then it is in NP.
You can see this by making a reduction from Subset sum (SUB) to Simultaneous subset sum (SIMSUB).
If you have a SUB problem consisting of a set X = {x1,x2,...,xn} and a target called t and you have an algorithm that solves SIMSUB when given two sets A = {a1,a2,...,an} and B = {b1,b2,...,bn}, two intergers Sa and Sb and a value for epsilon then we can solve SUB like this:
Let A = X and let B be a set of length n consisting of only 0's. Set Sa = t, Sb = 0 and epsilon = 0. You can now run the SIMSUB algorithm on this problem and get the solution to your SUB problem.
This shows that SUBSIM is as least as hard as SUB and therefore in NP.

MATLAB: fastest way to do a root-mean-squared error between a vector and array of vectors

I have a question regarding the fastest way to compute the RMSE between a single vector and an array of vectors. Specifically, I have a vector A representing an point and would like to find the index in a list B of points that A is closest to. Right now I am using:
tempmat = bsxfun(#minus,A,B);
tempmat1 = sqrt(sum(tempmat.^2,2);
index = find(tempmat1 == min(tempmat1));
this takes about 0.058 seconds to calculate the index. Is there a faster way in MATLAB of doing this? I performing this calculations literally millions of times.
Many thanks for reading,
Joe
tempmat = bsxfun(#minus,A,B);
tmpmat1 = sum(tempmat.^2,2);
[m,index] = min(tempmat1);
m = sqrt(m); %# optional, only if you need the actual numerical value
This avoids calculating sqrt on the whole array, since the minumum of the squared differences will have the same index. It also uses the second output of min to avoid the second pass of find.
You'll probably find that
tempmat = A - B(ones(1, size(A,1)), :)
is faster than the bsxfun version, unless size(A,1) is exceptionally large.
This assumes that A is your array and B is your vector. The RSS calculation implies that you have row vectors.
Also, I presume you know that you're calculating the RSS not RMS.

Resources