How can I measure separability between different number of instance of one feature vector - statistics

How can I measure separability between different number of instance of one feature vector ?
For example the main vector is V=[1 1 2 3 4 5 7 8 10 100 1000 99 999 54] and different combination with different sample lengths are
t1=[1 1 2 3 99 1000] or t2=[1 10 1000] or t3=[2 3 4 10 100 99 999 54]
which one is more separable and more informative ?
If I put it in GMM, the vectors with less samples has better probability which is not fair.
train=[1 2 1 2 1 2 100 101 102 99 100 101 1000 1001 999 1003];
No_of_Iterations=10;
No_of_Clusters=3;
[mm,vv,ww]=gaussmix(train,[],No_of_Iterations,No_of_Clusters);
test1=[1 1 1 2 2 2 100 100 100 101 1000 1000 1000];
test2=[1 1 2 2 100 99 1000 999];
test3=[1 100 1000];
[lp,rp,kh,kp]=gaussmixp(test1,mm,vv,ww);
sum(lp)
[lp,rp,kh,kp]=gaussmixp(test2,mm,vv,ww);
sum(lp)
[lp,rp,kh,kp]=gaussmixp(test3,mm,vv,ww);
sum(lp)
The results are as follow :
ans =
-8.0912e+05
ans =
-8.1782e+05
ans =
-5.0381e+05
I will really appreciate, if you could help me.

How can I measure separability between different number of instance of one feature vector ?
Notion of "separability" is not strict. If data is linearly separable one could define the size of the margin as the "separability", but in case of not linarly separable data there is no definite answer even for question "how easy is to separate this data", as it is heavily model dependent question - the answer will be completely differennt if you want to separate it with SVM with some partciular kernel and different if you want to use a decision tree etc.. There are many possible probabilistic, geometric and statistical approaches to such analysis, but this is not the Q&A site place, this is hard and long-lasting process od data analysis, which is performed by skilled researchers.
which one is more separable and more informative ?
Depends on the exact definition of separability and informativity. This is not a question that can be answered in the Q&A fashion, this is a research topic, not an issue to solve.
If I put it in GMM, the vectors with less samples has better probability which is not fair.
You have already asked the question about it and received answer showing why it is "fair".
You can try to ask on http://stats.stackexchange.com but you will rather hear similar answer - that "it depends" and is impossible to answer such a question "by hand".

Related

Most appropriate test for identifying if one group has significantly higher variance in its samples, than another?

I am conducting analyses on fMRI (neural) data. There are two brain regions of focus, region1 and region2. For each region, there are 5 subjects worth of neural data (5 samples). Each sample is shaped as 72 x 79 x 95 x 79 x 4, or collapsible to 79 x 95 x 79 x 288.
I want to establish whether there is higher variance in the data between subjects in region1, than there is variance between subjects in region2.
What statistical test would you recommend to assess this?
Many thanks in advance!
Since it's fMRI data, let us assume your samples are independent between subjects, normally distributed, and equal variance.
If the regions are independent, that is, data for region 1 was collected from 5 subjects and data for region 2 was collected from 5 unrelated subjects, then the canonical answer is to use an F-test for equality of variances (link1, link2).
If the regions are dependent, where the data was collected in 5 subjects only, and then region1 and region2 extracted from the same 5 subjects, then you have a few options. A popular solution is the Pitman-Morgan test for variance in paired samples. Various modifications have been proposed, of which Wilcox (2015) does a good job at controlling for Type I error.

Does the optimum solution of a TSP remain the optimum even after skipping a few cities?

Let's say that I know the global optimum solution to a 100-city standard Travelling Salesman Problem. Now, lets say that the salesman wants to skip over 5 of the cities. Does the TSP have to be re-solved? Will the sequence of cities obtained by simply deleting those cities from the previous optimum solution be global optimum for the new 95-city TSP?
Updated: Replaced counterexample with Euclidean instance.
Great question.
No, if you remove some cities, the original sequence of cities does not remain optimal. Here is a counterexample:
The node coordinates are:
0 0
4 0
4 2
2.6 3
10 3
4 4
4 6
0 6
Here is the optimal tour:
Now suppose we don't need to visit node 5. If we just "close up" the original tour, the resulting tour has a cost of 21.94:
But the optimal tour has a cost of 21.44:
(If you want to remove 5 cities instead of 1, just put all 5 cities close together all the way on the right.)

How to compare means of two sets when one set is a subset of another and the sample sizes are not

I have two sets containing citation counts for some publications. Of those sets one is a subset of the another. That is, subset contains some exact citation counts appearing on the other set. e.g.
Set1 Set2 (Subset)
50 50
24 24
12 -
5 5
4 4
43 43
2 -
2 -
1 -
1 -
So I want to decide if the numbers from the subset are good enough to represent set1? On this matter:
I have intended to apply student t-test but i could not be sure how
to apply it. The reason is that the sets are dependent so I could
not apply unpaired t-test requiring both sets must come from
independent populations. On the other hand, paired t-test also does
not look suitable since sample sizes must be equal.
In case of an outlier should I remove it? To me it is not logical
since it is not normally an outlier but a publication is cited quite a
lot so it belongs to the same sample. How to deal with such cases?
If I do not remove it, it causes the variance to be too big
affecting statistical tests...Is it a good idea to replace it with
median instead of mean since citation distributions generally tend
to be highly skewed?
How could I remedy this issue?

Which Multivariate Statistic Test / Algorithm for Testing Statistical Significans

I'm looking for a mathematical algorithm to proof significances in multivariate testing.
E.g. Lets take website tests having 3 headlines, 2 images, 2 buttons test. This results in 3 x 2 x 2 = 12 variations:
h1-i1-b1, h2-i1-b1, h3-i1-b1,
h1-i2-b1, h2-i2-b1, h3-i2-b1,
h1-i1-b2, h2-i1-b2, h3-i1-b2,
h1-i2-b2, h2-i2-b2, h3-i2-b2.
The hypothesis is that one variation is better than others.
I'd like to to know with which significane one of the variations is the winner and how long I have to wait, that I can be sure that I have statistically a winner or at least have an indicator how sure I can be that one variation is the winner.
So basically I'd like to get a probability for each variation telling me wether it the winner or not. As the tests runs longer some variations drop in probability and the winner increases.
Which algorithm would you use? Whats the formula?
Are there any libs for this?
You can use a chi-square test. Your null hypothesis is that all outcomes are equally likely; when you plug in the measured counts for each of the 12 outcomes, you get out a number telling you the probability of getting a set of 12 counts as extreme (i.e. as far away from equally distributed) as this. If the probability is sufficiently small (typically < 5% or < 1%), you conclude that the null hypothesis was wrong.

How do I perform a Mixed model analysis on my data in SPSS? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
In my thesis I'm trying to discover which factors influence the CSR (corporate social responsibility, GSE_RAW) behavior of companies. Two groups of possible factors / variables have been identified: company-specific and country-specific.
First, company-specific variables are (among others)
MKT_AVG_LN: the marketvalue of the company
SIGN: the number of CSR treaties the company has signed
INCID: the number of reported CSR incidents the company has been involved in
Second, each of the 4,000 companies in the dataset is headquartered in one of 35 countries. For each country, I have gathered some country-specific data, among others:
LAW_FAM: the legal family the countries' legal system stems from (either French, English, Scandinavian, or German)
LAW_SR: relative protection the countries' company law gives to shareholders (for instance, in case of company default)
LAW_LE: the relative effectiveness of the countries' legal system (higher value means more effective, thus for instance less corrupted)
COM_CLA: a measurement for the intensity of internal market competition
GCI_505: mesurement for the quality of primary education
GCI_701: measurement for the quality of secondary education
HOF_PDI: power distance (higher value means more hierarchical society)
HOF_LTO: country time orientation (higher means more long-term orientation)
DEP_AVG: the countries' GDP per capita
CON_AVG: the countries' average inflation over the 2008-2010 timeframe
In order to make an analysis on this data, I "raised" the country-level data to the company-level. For instance, if Belgium has a COM_CLA value of 23, then all Belgian companies in the dataset have their COM_CLA value set to 23. The variable LAW_FAM is split up into 4 dummy variables (LAW_FRA, LAW_SCA, LAW_ENG, LAW_GER), giving each company a 1 for one of these dummies.
This all results in a dataset like this:
COMPANY MKT_AVG_LN .. INCID .. LAW_FRA LAW_SCA .. LAW_SR LAW_LE COM_CLA .. etc
------------------------------------------------------------------------------
1 1.54 55 0 1 34 65 53
2 1.44 16 0 1 34 65 53
3 0.11 2 0 1 34 65 53
4 0.38 12 1 0 18 40 27
5 1.98 114 1 0 18 40 27
. . . . . . . .
. . . . . . . .
4,000 0.87 9 0 1 5 14 18
Here, companies 1 to 3 are from the same country A, and 4 and 5 from country B.
First, I tried analyzing using OLS, but the model seemed very "unstable", as is shown below. The first model has a r-squared of .516:
Adding only two variables changes many of the beta's and significance levels, as well as the r-squared (.591). Of course the r-squared increases when variables are added, but this is quite an increase from .516:
Eventually, it was suggested in another post that I should not use OLS here but mixed models, because of the categorical countly-level data. However, I am confused as to how perform this in SPSS. The examples I found online are not comparable to mine, so I don't know what to fill in, amongst others, in the below mixed model dialogue:
Could somebody using SPSS please help me explain how to perform this analysis so that I may come to a regression model (CSR = b1*MKT_AVG_LN + b2*SIGN + ... + b13*CON_AVG) so that I can conclude wheter CSR is determined by company-features or country-features (or by neither or both)?
I believe I have to insert the company-level variables as covariates and the country-level variables as factors. Is this correct? Second, I am unsure what to do with the LAW_SCA to LAW_ENG dummy variables.
Any help is greatly appreciated!

Resources