Clarification on the formulation of the Traveling Salesman - traveling-salesman

I have been doing research in the traveling salesman problem, and I have a question about how it is formulated. Or this might be a question on classification or name of sub-problems or variations on the problem.
In the traveling salesman problem are the cities places in a space and the distances between the cities measured to form a graph with weighted connections, or can the weights on the edges be arbitrarily chosen, even though they might make it impossible to lay the cities out on a map?
If one of those is considered the standard traveling salesman problem, is there a name for the other one?

TSP can be defined in a lot of ways. You're describing the symmetric Euclidean TSP, where weights correspond to the actual distances between the nodes and traveling clockwise on a tour between the nodes would give. As suggested by Phpdna, the triangle inequality is satisfied.
However, that's not the standard definition of the TSP. In fact, this IS the sub-problem or special case. The general problem can have any weight between each pair of nodes, and it doesn't have to be a Euclidean distance.
For example, if you were trying to formulate the shortest tour by the cost of travel rather than distance, you'd have the cost of travel between cities as the weight between the vertices... that could be anything. City A might be closest to city B on a Euclidean map, but the cost of travel from A to B might be phenomenally greater than from A to C to B for whatever reason. This is the general scenario. But either way, they're both NP-hard.

In the metric tsp it's satisfy the triangle inequality but if you have one-way streets or obstacles like mountains, canyons and so on it's not the metric tsp.

Related

what type of model should I use?

I am trying to assess the infuence of sex (nominal), altitude (nominal) and latitude (nominal) on corrected wing size (continuous; residual of wing size by body mass) of an animal species. I considered altitude as a nominal factor given the fact that this particular species is mainly distributed at the extremes (low and high) of steep elevational gradients in my study area. I also considered latitude as a nominal fixed factor given the fact that I have sampled individuals only at three main latitudinal levels (north, center and south). 
I have been suggested to use Linear Mixed Model for this analysis. Specifically, considering sex, altitude, latitude, sex:latitude, sex:altitude, and altitude:latitude as fixed factors, and collection site (nominal) as the random effect. The latter given the clustered distribution of the collection sites.
However, I noticed that despite the corrected wing size follow a normal distribution, it violates the assumption of homoscedasticity among some altitudinal/latitudinal groups. I tried to use a non-parametric equivalent of factorial ANOVA (ARTool) but I cannot make it run because it does not allow cases of missing data and it requires to asses all possible fixed factor and their interactions. I will appreciate any advice on what type of model I can use given the design of my data and what software/package can I use to perform the analysis.
Thanks in advance for your kind attention.
Regards,

Flipping a three-sided coin

I have two related question on population statistics. I'm not a statistician, but would appreciate pointers to learn more.
I have a process that results from flipping a three sided coin (results: A, B, C) and I compute the statistic t=(A-C)/(A+B+C). In my problem, I have a set that randomly divides itself into sets X and Y, maybe uniformly, maybe not. I compute t for X and Y. I want to know whether the difference I observe in those two t values is likely due to chance or not.
Now if this were a simple binomial distribution (i.e., I'm just counting who ends up in X or Y), I'd know what to do: I compute n=|X|+|Y|, σ=sqrt(np(1-p)) (and I assume my p=.5), and then I compare to the normal distribution. So, for example, if I observed |X|=45 and |Y|=55, I'd say σ=5 and so I expect to have this variation from the mean μ=50 by chance 68.27% of the time. Alternately, I expect greater deviation from the mean 31.73% of the time.
There's an intermediate problem, which also interests me and which I think may help me understand the main problem, where I measure some property of members of A and B. Let's say 25% in A measure positive and 66% in B measure positive. (A and B aren't the same cardinality -- the selection process isn't uniform.) I would like to know if I expect this difference by chance.
As a first draft, I computed t as though it were measuring coin flips, but I'm pretty sure that's not actually right.
Any pointers on what the correct way to model this is?
First problem
For the three-sided coin problem, have a look at the multinomial distribution. It's the distribution to use for a "binomial" problem with more then 2 outcomes.
Here is the example from Wikipedia (https://en.wikipedia.org/wiki/Multinomial_distribution):
Suppose that in a three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?
Note: Since we’re assuming that the voting population is large, it is reasonable and permissible to think of the probabilities as unchanging once a voter is selected for the sample. Technically speaking this is sampling without replacement, so the correct distribution is the multivariate hypergeometric distribution, but the distributions converge as the population grows large.
Second problem
The second problem seems to be a problem for cross-tabs. Then use the "Chi-squared test for association" to test whether there is a significant association between your variables. And use the "standardized residuals" of your cross-tab to identify which of the assiciations is more likely to occur and which is less likely.

Cosine similarity alternative for tf-idf (triangle inequality)

I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.
The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.
Is there another way of measure the similarity of two documents so that:
Vectors score as very similar based on their direction regardless of their magnitude
Satisfy triangle inequality: if A is similar to B and B is similar to C then A is also similar to C
Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.
Cosine is squared Euclidean on normalized data.
So simply L2 normalize your vectors to unit length, and use Euclidean.

Dijkstra on 2D grid?

There are N points on a 2D grid (x,y). I need to find the shortest path, from point A to point B, but I can only travel from one point to another and I can't travel between two points if the distance between them is farther than a distance D. I thought it might be solved by using some kind of modified Dijkstra's algorithm, but I'm not sure how, because I've never implemented it before, just studied it on Wiki.
Well, Dijkstra finds shortest paths in graphs. So just consider the grid points to be nodes in a graph with edges between each node S and all other nodes T such that dist(S, T) <= D. You don't have to actually construct the graph because the edges are easily determined as needed by Dijkstra. Just check all nodes in a square around S with radius D. A S-T edge exists iff (Sx - Tx)^2 (Sy - Ty)^2 <= D^2.
Wiki explanation is sufficient for this.
Dijkstra's algorithm takes 3 inputs. The Graph, Starting node and Ending node.
To construct the graph just do this
For i 1..n in points
For j i+1..n in points
if(dist(points[i],points[j])<=D)
add j to childs of i
add i to childs of j
After constructing the graph, perform dijkstra.
The subtlety of a question like this lies in a critical definition - what is the measure of distance in your grid?
The are many different shortest path problems and solutions, and they are studied throughout mathematics. They are each characterised by the 'topology' of the area being searched. Consider a few distinct topologies with their own solutions:
A one sided piece of paper
Suppose your grid represents coordinates on a piece of paper - the shortest path is easy to find, as it is simply a straight line between those points.
The surface of the moon
If your grid represents locations on the moon in terms of latitude and longitude, the shortest path is an arc along the moon's surface - If you drove "in a straight line" between two points on the moon, you would be travelling in an arc, because of the moon's curvature.
Road Intersections
If you want to find the distance between two intersections in a grid of roads, where the traffic on each road has a different speed, and you can only travel along the roads, then you can find the shortest path using Dijkstra's algorithm.
One way road intersections
A slight variation of the above - we only need to consider roads in one direction. There might not be any paths in this case.
Summary
To give a good solution, we need to understand the topology of your grid. If the distance is pythagerous's theorem than that indicates euclidean geometry (like in the piece of paper example), so the solution is a straight line.
Is it possible you mean that you can travel between any two points if the are closer than D - like flying a plane between airports, for example?
EDIT: I didn't see your comment because you didn't use #. In your case your grid is like the airports a plane can fly between. The shortest path is found using Dijkstra's algorithm - the immediate neighbours of a point are all points closer than D. Find them, represent it all as a graph, and use Dijkstra's algorithm.
I would suggest using the formula to find the distance between 2 points i.e sqrt((x2-x1)^2+(y2-y1)^2). This distance is always the shortest between 2 points.

Calculating the distance between each pair of a set of points

So I'm working on simulating a large number of n-dimensional particles, and I need to know the distance between every pair of points. Allowing for some error, and given the distance isn't relevant at all if exceeds some threshold, are there any good ways to accomplish this? I'm pretty sure if I want dist(A,C) and already know dist(A,B) and dist(B,C) I can bound it by [dist(A,B)-dist(B,C) , dist(A,B)+dist(B,C)], and then store the results in a sorted array, but I'd like to not reinvent the wheel if there's something better.
I don't think the number of dimensions should greatly affect the logic, but maybe for some solutions it will. Thanks in advance.
If the problem was simply about calculating the distances between all pairs, then it would be a O(n^2) problem without any chance for a better solution. However, you are saying that if the distance is greater than some threshold D, then you are not interested in it. This opens the opportunities for a better algorithm.
For example, in 2D case you can use the sweep-line technique. Sort your points lexicographically, first by y then by x. Then sweep the plane with a stripe of width D, bottom to top. As that stripe moves across the plane new points will enter the stripe through its top edge and exit it through its bottom edge. Active points (i.e. points currently inside the stripe) should be kept in some incrementally modifiable linear data structure sorted by their x coordinate.
Now, every time a new point enters the stripe, you have to check the currently active points to the left and to the right no farther than D (measured along the x axis). That's all.
The purpose of this algorithm (as it is typically the case with sweep-line approach) is to push the practical complexity away from O(n^2) and towards O(m), where m is the number of interactions we are actually interested in. Of course, the worst case performance will be O(n^2).
The above applies to 2-dimensional case. For n-dimensional case I'd say you'll be better off with a different technique. Some sort of space partitioning should work well here, i.e. to exploit the fact that if the distance between partitions is known to be greater than D, then there's no reason to consider the specific points in these partitions against each other.
If the distance beyond a certain threshold is not relevant, and this threshold is not too large, there are common techniques to make this more efficient: limit the search for neighbouring points using space-partitioning data structures. Possible options are:
Binning.
Trees: quadtrees(2d), kd-trees.
Binning with spatial hashing.
Also, since the distance from point A to point B is the same as distance from point B to point A, this distance should only be computed once. Thus, you should use the following loop:
for point i from 0 to n-1:
for point j from i+1 to n:
distance(point i, point j)
Combining these two techniques is very common for n-body simulation for example, where you have particles affect each other if they are close enough. Here are some fun examples of that in 2d: http://forum.openframeworks.cc/index.php?topic=2860.0
Here's a explanation of binning (and hashing): http://www.cs.cornell.edu/~bindel/class/cs5220-f11/notes/spatial.pdf

Resources