AIC values in model comparison - statistics

I was comparing two models using the AIC. However, I realized that both AIC values are too small (-4752.66, and the other is close to that). I was wondering if that is normal or I did something wrong while calculating it.

Its ok to have negative aic values! (Ref-https://stats.stackexchange.com/questions/84076/negative-values-for-aic-in-general-mixed-model)
As the values are too close to each other, the one with smaller delta is your choice. If both deltas are smaller than two, look for evidence ratio . Evidence ratio smaller than 2.7 indicates that booth models are good ->(serch for "comparing models" at : http://theses.ulaval.ca/archimede/fichiers/21842/apa.html#d0e5831).
IF this is your case, you can use model averaging->(Symonds, Matthew RE, and Adnan Moussalli. "A brief guide to model selection, multimodel inference and model averaging in behavioural ecology using Akaike’s information criterion." Behavioral Ecology and Sociobiology 65.1 (2011): 13-21.)

Related

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Information Retrieval: How to combine different word results when using tf-idf?

Let's say I have a user search query which looks like:
"the happy bunny"
I have already computed tf-idf and have something like this (following are made up example values) for each document in which I am searching (of coures the idf is always the same):
tf idf score
the 0.06 1 0.06 * 1 = 0.06
happy 0.002 20 0.002 * 20 = 0.04
bunny 0.0005 60 0.0005 * 60 = 0.03
I have two questions with what to do next.
Firstly, the still has the highest score, even though it is adjusted for rarity by idf, still it's not exactly important - do you think I should square the idf values to weight in terms of rare words, or would this give bad results? Otherwise I'm worried that the is getting equal importance to happy and bunny, and it should be obvious that bunny is the most important word in the search. As long as rare always equals important then it would be always a good idea to weight in terms of rarity, but if that is not always the case then doing so could really mess up the results.
Secondly and more importantly: what is the best/preferred method for combining the scores for each word together to give each document a single score that represents how well it reflects the entire search query? I was thinking of adding them, but it has become apparent that that is going to give higher priority to a document containing 10,000 happy but only 1 bunny instead of another document with 500 happy and 500 bunny (which would be a better match).
First, make sure that you are computing the correct TF-IDF values. As others have pointed they do not look right. TF is relative to specific documents, and we often do not need to compute them for queries (since raw term frequency is almost always 1 in queries). There are different types of TF functions to pick from (check the Wikipedia page on tf-idf, it has a good coverage). Log Normalisation is common and the most efficient scheme, since it saves an extra disk access to get the respective document's total frequency maxF that is needed for something like Double Normalisation. When you are dealing with large volumes of documents this can be expensive, especially if you can't bring these into memory. A bit of insight on inverted files can go a long way in understanding some of the underlying complexities. Log normalisation is efficient and is a non-linear function, therefore better than raw frequency.
Once you are certain on your weighting scheme, then you may want to consider a stop list to get rid of very common/noisy words. These do not contribute to the rank of documents. It is generally recommended to use a stop list of high frequency, very common words. Do a search and you will find many available, including the one that Lucene uses.
The remaining lies on your ranking strategy and that will depend on your implementation/model. The vector space model (VSM) is simple and readily available with libraries like Lucene, Lemur, etc. VSM computes the Dot product or scalar of the weights of common terms between the query and a document. Term weights are normalised via vector length normalisation (which solves your second question), and the result of applying the model is a value between 0 and 1. This is also justified/interpreted as the Cosine of the angle between two vectors in a planar graph, or the Euclidean distance divided by the Euclidean vector length of two vectors.
One of the earliest comprehensive studies on weighting schemes and ranking with VSM is an article by Salton (pdf) and is a good read if you are interested in Information Retrieval. A bit outdated perhaps (notice how log normalisation is not mentioned in the article).
Your best read I believe is the book Introduction to Information Retrieval by Christopher Manning. It will take you through everything that you need to know, from indexing to ranking schemes, etc. A bit lacking on ranking models (does not cover some of the more complex probabilistic approaches).
You should reconsider your TF and IDF values, they do not look correct. The TF value is usually just how often the word occurs, so if the word "the" appeared 20 times it's tf value would be 20. A word like "the" should have a very low IDF value (possibly around 4 decimal places, 0.000...).
You could use stop word removal if word like the are not necessary, they would be removed rather than just given a low score.
A vector space model could be used for this.
can you compute tf-idf for amalgamated terms? That is, you first generate a sentiment that considers each of its component as equal before treating the sentiment as a single term for which you now compute the tf-idf

Systematic threshold for cosine similarity with TF-IDF weights

I am running an analysis of several thousand (e.g., 10,000) text documents. I have computed TF-IDF weights and have a matrix with pairwise cosine similarities. I want to treat the documents as a graph to analyze various properties (e.g., the path length separating groups of documents) and to visualize the connections as a network.
The problem is that there are too many similarities. Most are too small to be meaningful. I see many people dealing with this problem by dropping all similarities below a particular threshold, e.g., similarities below 0.5.
However, 0.5 (or 0.6, or 0.7, etc.) is an arbitrary threshold, and I'm looking for techniques that are more objective or systematic to get rid of tiny similarities.
I'm open to many different strategies. For example, is there a different alternative to tf-idf that would make most of the small similarities 0? Other methods to keep only significant similarities?
In short, take the average cosine value of an initial clustering or even all of the initial sentences and accept or reject clusters based on something akin to the following.
One way to look at the problem is to try and develop a score based on a distance from the mean similarity (1.5 standard deviations (86th percentile if the data were normal) tends to mark an outlier with 3 (99.9th percentile) being an extreme outlier), taking the high end for good measure. I cannot remember where, but this idea has had traction in other forums and formed the basis for my similarity.
Keep in mind that the data is not likely to be normally distributed.
average(cosine_similarities)+alpha*standard_deviation(cosine_similarities)
In order to obtain alpha, you could use the Wu Palmer score or another score as described by NLTK. Strong similarities with Wu Palmer should lead to a larger range of acceptance while lower Wu Palmer scores should lead to a more strict acceptance. Therefore, taking 1-Wu Palmer score would be adviseable. You can even use this method for LSA or LDA groups. To be even more strict and take things close to 1.5 or more standard deviations, you could even try 1+Wu Palmer (the cream of the crop), re-find the ultimate K,find the new score, cluster, and repeat.
Beware though, this would mean finding the Wu Palmer of all relevant words and is quite a large computational problem. Also, 10000 documents is peanuts compared to most algorithms. The smallest I have seen for tweets was 15,000 and the 20 news groups set was 20,000 documents. I am pretty sure Alchemy API uses something akin to the 20 news groups set. They definitely use senti-wordnet.
The basic equation is not really mine so feel free to dig around for it.
Another thing to keep in mind is that the calculation is time intensive. It may be a good idea to use a student t value for estimating the expected value/mean wu-palmer score of SOV pairings and especially good if you try to take the entire sentence. Commons Math3 for java/scala includes the distribution as does scipy for python and R should already have something as well.
Xbar +/- tsub(alpha/2)*sample_std/sqrt(sample_size)
Note: There is another option with this weight. You could use an algorithm that adds or subtracts from this threshold until achieving the best result. This would likely not be related solely to the cosine importance but possibly to an inflection point or gap as with Tibshirani's gap statistic.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Resources