how to interpret IRStatisticsImpl data in mahout - statistics

i want to read the IRStatisticsImpl data but have some problems:
my result is:
IRStatisticsImpl[precision:0.04285714285714287,recall:0.04275534441805227,fallOut:0.0018668022652391654,nDCG:0.04447353132522083,reach:0.997624703087886]
does it meant, that i got only 4% of good recommendations (precision) and about the same level of bad recommendation (recall)?
what should the numbers look like at best - precision at 1.0 and recall at 0.0?

Well, by definition:
Precision represents how many results are correct in your result set.
Recall represents the probability that a correct element in a test set has to be selected as a correct and picked in a result set.
To be perfect Precision and Recall should be both at 100%. Good results and criteria about these values must be evaluated according to your domain.
For example if you have a bucket with good and bad mushrooms you should aim at 100% for precision no matter how low is your recall. Because precision is critical for your health, you can even leave a lot of good mushrooms. The important thing is not eat the ugly ones.
You could pick one good mushroom and so you get 100% precision, but if there were four good mushrooms in your bucket, your recall is 25%.
Ideally if precision and recall are 100% means that in your result set all your mushrooms are good and also all good mushrooms are in your result set and none is leaved in your test set.
So values might have different meanings.
Sadly your results seem very ugly, because you're having many false positives and too much false negatives.
Take a look here.

Related

would the width of simulation 95% confidence interval for k<2 proportion in multinomial sample be narow, wide or same comparing indvidual CI?

a categorical response variable can take on K different values.
Apparently a narrow confidence interval implies that there is a smaller chance of obtaining an observation within that interval, therefore, our accuracy is higher. Also a 95% confidence interval is narrower than a 99% confidence interval which is wider.
Would this be a correct explination if not what would be the best reason?
Sort of.
Recall that CI are to estimate the real mean of a population, based on some sample data from that population.
It is true that a 95% CI would be smaller (narrower) than a 99% one.
It is directly tied to your alpha value (significance level), so an alpha of 5% would mean that you run a theoretical 5% risk that your CI does not contain the actual/real mean value.
Thus, if we want to minimize this risk, thus increasing significance, we could set alpha to 1%, and the CI would become a (100-alpha)% CI. And just like you observed, this 99% interval is much larger, because me want a lower risk, so it is larger to cover a much more broader range of values. - making it less likely that we miss the real mean.
Like I mentioned, this risk is theoretical and is pretty much: "What is the chance that the data (and our CI we created from it) just happened to be so incorrect (by sheer chance) that the real mean value is outside of our CI?"
So, it is only a probability calculation based on how much data you had, and how widely spread it was.
And that's why the wideness/narrowness differs between them. A 95% CI indicates that you have selected your significance level (alpha) to be 5%.

Need of bonferroni correction in A/B testing

I am a newbie in the field of Data Science. I came across the below statements which read:
More metrics we choose in our A/B testing, higher the chance of getting significant difference by chance.
To eliminate this problem we use Bonferroni correction method.
What does the 1st statement mean? How does it increase the chances of getting false positives? and how does the Bonferroni correction method help us here?
With p value of 0.05 (which is a commonly used level of statistical significance), you will get false positive results 5% of time. Thus if in your analysis you have one test, your chance of false positive is 5%. If you have two tests, you´ll have 5% for the first AND 5% for the second. Et cetera.
So for each additional test, your risk increases. Still, as you want to keep your total risk level at 0.05, you either set more strict level of statistical significance (smaller p value), or use some statistical method to correct for multiple comparisons. Bonferroni correction is one of such methods.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Does stemming harm precision in text classification?

I have read stemming harms precision but improves recall in text classification. How does that happen? When you stem you increase the number of matches between the query and the sample documents right?
It's always the same, if you raise recall, your doing a generalisation. Because of that, you're losing precision. Stemming merge words together.
On the one hand, words which ought to be merged together (such as "adhere" and "adhesion") may remain distinct after stemming; on the other, words which are really distinct may be wrongly conflated (e.g., "experiment" and "experience"). These are known as understemming errors and overstemming errors respectively.
Overstemming lowers precision and understemming lowers recall. So, since no stemming at all means no over- but max understemming errors, you have a low recall there and a high precision.
Btw, precision means how many of your found 'documents' are those you were looking for. Recall means how many of all 'documents', which were correct, you received.
From the wikipedia entry on Query_expansion:
By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision.

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.
What are the potential algos for the purpose? Accuracy is a matter of concern.
I am very new to Stats, so need help in very basic algos.
Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:
A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.
There are a few good ways to proceed:
Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.
Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).
Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).
You may have heard the expression 'six sigma'.
This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.
Anything outside the 'six sigma' range could be treated as an outlier.
On reflection, I think 'six sigma' is too wide.
This article describes how it amounts to "3.4 defective parts per million opportunities."
It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.
Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.
And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.
Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.
Here is how I would go about it in SQL Server
The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:
select w.Gender, Avg(w.Weight) as AvgWeight
from ScaleData w
join ( select d.Gender, Avg(d.Weight) as AvgWeight,
2*STDDEVP(d.Weight) StdDeviation
from ScaleData d
group by d.Gender
) d
on w.Gender = d.Gender
and w.Weight between d.AvgWeight-d.StdDeviation
and d.AvgWeight+d.StdDeviation
group by w.Gender
There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.
NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation
If you want to just analyse it, say you want to compute the correlation with another variable, its ok to exclude outliers. But if you want to model / predict, it is not always best to exclude them straightaway.
Try to treat it with methods such as capping or if you suspect the outliers contain information/pattern, then replace it with missing, and model/predict it. I have written some examples of how you can go about this here using R.

Resources