Does the optimum solution of a TSP remain the optimum even after skipping a few cities? - traveling-salesman

Let's say that I know the global optimum solution to a 100-city standard Travelling Salesman Problem. Now, lets say that the salesman wants to skip over 5 of the cities. Does the TSP have to be re-solved? Will the sequence of cities obtained by simply deleting those cities from the previous optimum solution be global optimum for the new 95-city TSP?

Updated: Replaced counterexample with Euclidean instance.
Great question.
No, if you remove some cities, the original sequence of cities does not remain optimal. Here is a counterexample:
The node coordinates are:
0 0
4 0
4 2
2.6 3
10 3
4 4
4 6
0 6
Here is the optimal tour:
Now suppose we don't need to visit node 5. If we just "close up" the original tour, the resulting tour has a cost of 21.94:
But the optimal tour has a cost of 21.44:
(If you want to remove 5 cities instead of 1, just put all 5 cities close together all the way on the right.)

Related

What is the probability of more than 100 people arriving at the station, if they come based on exponential distribution with 2 mins?

So, i got this problem:
"You have people arriving at the bus station based on exponential distribution.
You know that the mean of the distribution is 2 mins.
Whats the probability for that in 3 hours more than 100 people will arrive.
So i figured out that the problem is that, we have to calculate the probability of having the actual mean under 1.8 mins.
But i don't really know how to solve this?
Is it something with confidence intervals?
So basically the rate of arrival to get 100 customers in 3hrs will be 1.8 min per customer. Using cumulative distribution function:
Here = 0.5 and t = 1.8. As we are looking for more than 100 customers within 3 hrs so the integral will be from 0 to 1.8.
This gives 1-e^(-0.5*1.8) your answer i.e 0.5934.
You can refer this link to get hold on the theory and few examples.

xgboost: handling of missing values for split candidate search

in section 3.4 of their article, the authors explain how they handle missing values when searching the best candidate split for tree growing. Specifically, they create a default direction for those nodes with, as splitting feature, one with missing values in the current instance set. At prediction time, if the prediction path goes through this node and the feature value is missing, the default direction is followed.
However the prediction phase would break down when the feature values is missing and the node does not have a default direction (and this can occur in many scenarios). In other words, how do they associate a default direction to all nodes, even those with missing-free splitting feature in the active instance set at training time?
xgboost always accounts for a missing value split direction even if none are present is training. The default is the yes direction in the split criterion. Then it is learned if there are any present in training
From the author link
This can be observed by the following code
require(xgboost)
data(agaricus.train, package='xgboost')
sum(is.na(agaricus.train$data))
##[1] 0
bst <- xgboost(data = agaricus.train$data,
label = agaricus.train$label,
max.depth = 4,
eta = .01,
nround = 100,
nthread = 2,
objective = "binary:logistic")
dt <- xgb.model.dt.tree(model = bst) ## records all the splits
> head(dt)
ID Feature Split Yes No Missing Quality Cover Tree Yes.Feature Yes.Cover Yes.Quality
1: 0-0 28 -1.00136e-05 0-1 0-2 0-1 4000.5300000 1628.25 0 55 924.50 1158.2100000
2: 0-1 55 -1.00136e-05 0-3 0-4 0-3 1158.2100000 924.50 0 7 679.75 13.9060000
3: 0-10 Leaf NA NA NA NA -0.0198104 104.50 0 NA NA NA
4: 0-11 7 -1.00136e-05 0-15 0-16 0-15 13.9060000 679.75 0 Leaf 763.00 0.0195026
5: 0-12 38 -1.00136e-05 0-17 0-18 0-17 28.7763000 10.75 0 Leaf 678.75 -0.0199117
6: 0-13 Leaf NA NA NA NA 0.0195026 763.00 0 NA NA NA
No.Feature No.Cover No.Quality
1: Leaf 104.50 -0.0198104
2: 38 10.75 28.7763000
3: NA NA NA
4: Leaf 9.50 -0.0180952
5: Leaf 1.00 0.0100000
6: NA NA NA
> all(dt$Missing == dt$Yes,na.rm = T)
[1] TRUE
source code
https://github.com/tqchen/xgboost/blob/8130778742cbdfa406b62de85b0c4e80b9788821/src/tree/model.h#L542
My understanding of the algorithm is that a default direction is assigned probabalistically based on the distribution of the training data if no missing data is available at training time. IE. Just go in the direction with the majority of samples in the training set. In practice I'd say it's a bad idea to have missing data in your data set. Generally, the model will perform better if the data scientist cleans the data set up in a smart way before training the GBM algorithm. For example, replace all NA with the mean/median value or impute the value by finding the K nearest neighbors and averaging their values for that feature to impute the training point.
I'm also wondering why data would be missing at test time and not at train. That seems to imply the distribution of your data is evolving over time. An algorithm that can be trained as new data is available like a neural net may do better in you use case. Or you could always make a specialist model. For example let's say the missing feature is credit score in your model. Because some people may not approve you to access their credit. Why not train one model using credit and one not using credit. The model trained excluding credit may be able to get much of the lift credit was providing by using other correlated features.
Thank you for sharing your thoughts #Josiah. Yes I totally agree with you when you say that it is better to avoid missing data in the dataset, but sometimes it is not the optimal solution to replace them. In addition, if we have a learning algorithm such as GBM that can cope with them, why not to give them a try. The scenario I'm thinking about is when you have some features with few missings (<10%) or even less.
Regarding the second point, the scenario I have in mind is the following: the tree has already be grown to some depth so that the instance set is not the full one anymore. For a new node, the best candidate is found to be a value for a feature f that originally contains some missings, but not in the current instance set, so that no default branch is defined. So even if f contains some missings in the training dataset, this node doesn't have a default branch. A test instance falling here, would be stuck.
Maybe you are right and the default branch will be the one with more examples, if no missings are present. I'll try to reach out the authors and post here the reply, if any.

How to compare means of two sets when one set is a subset of another and the sample sizes are not

I have two sets containing citation counts for some publications. Of those sets one is a subset of the another. That is, subset contains some exact citation counts appearing on the other set. e.g.
Set1 Set2 (Subset)
50 50
24 24
12 -
5 5
4 4
43 43
2 -
2 -
1 -
1 -
So I want to decide if the numbers from the subset are good enough to represent set1? On this matter:
I have intended to apply student t-test but i could not be sure how
to apply it. The reason is that the sets are dependent so I could
not apply unpaired t-test requiring both sets must come from
independent populations. On the other hand, paired t-test also does
not look suitable since sample sizes must be equal.
In case of an outlier should I remove it? To me it is not logical
since it is not normally an outlier but a publication is cited quite a
lot so it belongs to the same sample. How to deal with such cases?
If I do not remove it, it causes the variance to be too big
affecting statistical tests...Is it a good idea to replace it with
median instead of mean since citation distributions generally tend
to be highly skewed?
How could I remedy this issue?

Levenshtein cost settings

I've been asked to guess the user intention when part of expected data is missing. For example if I'm looking to get very well or not very well but I get only not instead, then I should flag it as not very well.
The Levenshtein distance for not and very well is 9 and the distance for not and not very well is 10. I think I'm actually trying to drive a screw with a wrench, but we have already agreed in our team to use Levenshtein for this case.
As you have seen the problem above, is there anyway if I can make some sense out of it by changing the insertion, replacement and deletion costs?
P.S. I'm not looking for a hack for this particular example. I want something that generally works as expected and outputs a better result in these cases also.
The Levenshtein distance for not and very well is actually 12. The alignment is:
------not
very well
So there are 6 insertions with a total cost of 6 (cost 1 for each insertion), and 3 replacements with a total cost of 6 (cost 2 for each replacement). The total cost is 12.
The Levenshtein distance for not and not very well is 10. The alignment is:
not----------
not very well
This includes only 10 insertions. So you can choose not very well as the best match.
The cost and alignment can be computed with htql for python:
import htql
a=htql.Align()
a.align('not', 'very well')
# (12.0, ['------not', 'very well'])
a.align('not', 'not very well')
# (10.0, ['not----------', 'not very well'])

How can I measure separability between different number of instance of one feature vector

How can I measure separability between different number of instance of one feature vector ?
For example the main vector is V=[1 1 2 3 4 5 7 8 10 100 1000 99 999 54] and different combination with different sample lengths are
t1=[1 1 2 3 99 1000] or t2=[1 10 1000] or t3=[2 3 4 10 100 99 999 54]
which one is more separable and more informative ?
If I put it in GMM, the vectors with less samples has better probability which is not fair.
train=[1 2 1 2 1 2 100 101 102 99 100 101 1000 1001 999 1003];
No_of_Iterations=10;
No_of_Clusters=3;
[mm,vv,ww]=gaussmix(train,[],No_of_Iterations,No_of_Clusters);
test1=[1 1 1 2 2 2 100 100 100 101 1000 1000 1000];
test2=[1 1 2 2 100 99 1000 999];
test3=[1 100 1000];
[lp,rp,kh,kp]=gaussmixp(test1,mm,vv,ww);
sum(lp)
[lp,rp,kh,kp]=gaussmixp(test2,mm,vv,ww);
sum(lp)
[lp,rp,kh,kp]=gaussmixp(test3,mm,vv,ww);
sum(lp)
The results are as follow :
ans =
-8.0912e+05
ans =
-8.1782e+05
ans =
-5.0381e+05
I will really appreciate, if you could help me.
How can I measure separability between different number of instance of one feature vector ?
Notion of "separability" is not strict. If data is linearly separable one could define the size of the margin as the "separability", but in case of not linarly separable data there is no definite answer even for question "how easy is to separate this data", as it is heavily model dependent question - the answer will be completely differennt if you want to separate it with SVM with some partciular kernel and different if you want to use a decision tree etc.. There are many possible probabilistic, geometric and statistical approaches to such analysis, but this is not the Q&A site place, this is hard and long-lasting process od data analysis, which is performed by skilled researchers.
which one is more separable and more informative ?
Depends on the exact definition of separability and informativity. This is not a question that can be answered in the Q&A fashion, this is a research topic, not an issue to solve.
If I put it in GMM, the vectors with less samples has better probability which is not fair.
You have already asked the question about it and received answer showing why it is "fair".
You can try to ask on http://stats.stackexchange.com but you will rather hear similar answer - that "it depends" and is impossible to answer such a question "by hand".

Resources