Why some portion of statistics is not used in data science - statistics

I have learned statistics including mean, median, mode and different tests
being Z test, F test and chi-square and all but generally participating in
difficult numeric data prediction challenges like on kaggle and other
platforms I hardly see anyone using statistical tests like z, f, chi-square,
normalization of data these - all we use boxplots, bar plots to see mean,
median, mode etc.
my question is where these tests are an integral part in data science, for what
sort of problems are these mainly designed - research based.
What portion of statistics should ideally be used in a data science problem and
why only some portion is used when all of statistics is must for data science.
I am asking regarding tests and other statistics except the algorithms.

You're most likely to see statistical hypothesis testing in data science if you're looking at something like A/B testing, where your goal is to determine whether there is a reliable difference between two samples and the size of that difference.
Kaggle competitions specifically are supervised learning problems rather than hypothesis testing, which is why you don't see people using things like chi-squared. (Which makes sense: if you have ten people do hypothesis testing on the same dataset, they should all get pretty much the same answer, which would make for a pretty uninteresting competition.)
Personally, I think it's good to be familiar with both statistical hypothesis testing and machine-learning techniques, since they have different uses. Hope that helps! :)

Every problem in data science requires a different approach, so a generic statistics might not apply. There will be problems where some statistics might not be needed

Related

Dynamic Topic Modeling with Gensim / which code?

I want to use Dynamic Topic Modeling by Blei et al. (http://www.cs.columbia.edu/~blei/papers/BleiLafferty2006a.pdf) for a large corpus of nearly 3800 patent documents.
Does anybody has experience in using the DTM in the gensim package?
I identified two models:
models.ldaseqmodel – Dynamic Topic Modeling in Python Link
models.wrappers.dtmmodel – Dynamic Topic Models (DTM) Link
Which one did you use, of if you used both, which one is "better"? In better words, which one did/do you prefer?
Both packages work fine, and are pretty much functionally identical. Which one you might want to use depends on your use case. There are small differences in the functions each model comes with, and small differences in the naming, which might be a little confusing, but for most DTM use cases, it does not matter very much which you pick.
Are the model outputs identical?
Not exactly. They are however very, very close to being identical (98%+) - I believe most of the differences come from slightly different handling of the probabilities in the generative process. So far, I've not yet come across a case where a difference in the sixth or seventh digit after the decimal point has any significant meaning. Interpreting the topics your models finds matters much more than one version finding a higher topic loading for some word by 0.00002
The big difference between the two models: dtmmodel is a python wrapper for the original C++ implementation from blei-lab, which means python will run the binaries, while ldaseqmodel is fully written in python.
Why use dtmmodel?
the C++ code is faster than the python implementation
supports the Document Influence Model from Gerrish/Blei 2010 (potentially interesting for your research, see this paper for an implementation.
Why use ldaseqmodel?
easier to install (simple import statement vs downloading binaries)
can use sstats from a pretrained LDA model - useful with LdaMulticore
easier to understand the workings of the code
I mostly use ldaseqmodel but thats for convenience. Native DIM support would be great to have, though.
What should you do?
Try each of them out, say, on a small sample set and see what the models return. 3800 documents isn't a huge corpus (assuming the patents aren't hundreds of pages each), and I assume that after preprocessing (removing stopwords, images and metadata) your dictionary won't be too large either (lots of standard phrases and legalese in patents, I'd assume). Pick the one that works best for you or has the capabilities you need.
Full analysis might take hours anyway, if you let your code run overnight there is little practical difference, after all, do you care if it finishes at 3am or 5am? If runtime is critical, I would assume the dtmmodel will be more useful.
For implementation examples, you might want to take a look at these notebooks: ldaseqmodel and dtmmodel

Is incomplete testing versus exhaustive analysis an apples-to-oranges comparison?

I often hear arguments like this: A disadvantage of traditional testing is that it is incomplete whereas Alloy analysis is exhaustive and complete (within a bound). But, the first is talking about software, the second is talking about models. Isn't it an apples-to-oranges comparison?
Update: I was wrong. The comparison is not this: testing code versus analyzing models. That is an apples-to-oranges comparison. Instead, the comparisons are these:
Testing models versus analysis of models.
Testing code versus analysis of code.
Those are apples-to-apples comparisons.
So, whether the artifact is a model or code, you can compare two kinds of analysis: testing, which corresponds to drawing a relatively small number of cases randomly, without a bound on the size, versus small scope analysis, which involves all cases within a small bound.
Thanks to Daniel Jackson for clearing up my misunderstanding.
First, when Alloy was invented, the only existing tools for analyzing models in data-rich languages such as Z and VDM that were not proof-based used scenarios to test the model. Each scenario was constructed by the user, so the approach suffered from the cost of creating the scenarios and the low coverage of their small number.
Second, Alloy has been used to find bugs in code: see the PhD theses by Mandana Vaziri, Mana Taghdiri, Greg Dennis, Juan Pablo Galeotti and others. In all of these, bugs were found that evaded conventional tests.
Third, it's worth noting that bounded-exhaustive forms of testing are becoming viable. Sarfraz Khurshid was a pioneer in this work with his thesis on generating test cases, initially in a tool called TestEra based on Alloy, and later (with Darko Marinov et al) in a tool called Korat that traded a more diected solving method for less declarative constraints.

What statistical distribution is used to benchmark an algorithm?

I have benchmarked my algorithm, it run for 1000 times. Now I have all time values and at this point it would be interesting to know the mean, standard deviation, median. The problem is that I don't know what is correct statistics to use to estimate these parameters. I'm not sure about using Normal distribution.
Learn about statistics. There are lots of books, guides, papers and introductions out there (1,2,3, 4)
There are also lots of libraries which implements default statistical methods:
Java Commons Math,
C++ Libs,
and there are certainly lots of others for the language you use...
And also one last hint: For a quick (initial) result I often use excel and its diagram functions. It supports some statistical methods with which you can play around a bit to see in which direction you may continue....
That really depends on what distribution your workload experiences, so you would not be able to answer generically to this.
But there is a trick: if you go one step forward, and do several iterations, each consisting of N calls, and compute, say, average time/throughput for the entire iteration. Then, for a large N and consistent workload behavior across the calls, the iteration scores may be subject to Central Limit Theorem, which can turn them into normally distributed.

Software to Tune/Calibrate Properties for Heuristic Algorithms

Today I read that there is a software called WinCalibra (scroll a bit down) which can take a text file with properties as input.
This program can then optimize the input properties based on the output values of your algorithm. See this paper or the user documentation for more information (see link above; sadly doc is a zipped exe).
Do you know other software which can do the same which runs under Linux? (preferable Open Source)
EDIT: Since I need this for a java application: should I invest my research in java libraries like gaul or watchmaker? The problem is that I don't want to roll out my own solution nor I have time to do so. Do you have pointers to an out-of-the-box applications like Calibra? (internet searches weren't successfull; I only found libraries)
I decided to give away the bounty (otherwise no one would have a benefit) although I didn't found a satisfactory solution :-( (out-of-the-box application)
Some kind of (Metropolis algorithm-like) probability selected random walk is a possibility in this instance. Perhaps with simulated annealing to improve the final selection. Though the timing parameters you've supplied are not optimal for getting a really great result this way.
It works like this:
You start at some point. Use your existing data to pick one that look promising (like the highest value you've got). Set o to the output value at this point.
You propose a randomly selected step in the input space, assign the output value there to n.
Accept the step (that is update the working position) if 1) n>o or 2) the new value is lower, but a random number on [0,1) is less than f(n/o) for some monotonically increasing f() with range and domain on [0,1).
Repeat steps 2 and 3 as long as you can afford, collecting statistics at each step.
Finally compute the result. In your case an average of all points is probably sufficient.
Important frill: This approach has trouble if the space has many local maxima with deep dips between them unless the step size is big enough to get past the dips; but big steps makes the whole thing slow to converge. To fix this you do two things:
Do simulated annealing (start with a large step size and gradually reduce it, thus allowing the walker to move between local maxima early on, but trapping it in one region later to accumulate precision results.
Use several (many if you can afford it) independent walkers so that they can get trapped in different local maxima. The more you use, and the bigger the difference in output values, the more likely you are to get the best maxima.
This is not necessary if you know that you only have one, big, broad, nicely behaved local extreme.
Finally, the selection of f(). You can just use f(x) = x, but you'll get optimal convergence if you use f(x) = exp(-(1/x)).
Again, you don't have enough time for a great many steps (though if you have multiple computers, you can run separate instances to get the multiple walkers effect, which will help), so you might be better off with some kind of deterministic approach. But that is not a subject I know enough about to offer any advice.
There are a lot of genetic algorithm based software that can do exactly that. Wrote a PHD about it a decade or two ago.
A google for Genetic Algorithms Linux shows a load of starting points.
Intrigued by the question, I did a bit of poking around, trying to get a better understanding of the nature of CALIBRA, its standing in academic circles and the existence of similar software of projects, in the Open Source and Linux world.
Please be kind (and, please, edit directly, or suggest editing) for the likely instances where my assertions are incomplete, inexact and even flat-out incorrect. While working in related fields, I'm by no mean an Operational Research (OR) authority!
[Algorithm] Parameter tuning problem is a relatively well defined problem, typically framed as one of a solution search problem whereby, the combination of all possible parameter values constitute a solution space and the parameter tuning logic's aim is to "navigate" [portions of] this space in search of an optimal (or locally optimal) set of parameters.
The optimality of a given solution is measured in various ways and such metrics help direct the search. In the case of the Parameter Tuning problem, the validity of a given solution is measured, directly or through a function, from the output of the algorithm [i.e. the algorithm being tuned not the algorithm of the tuning logic!].
Framed as a search problem, the discipline of Algorithm Parameter Tuning doesn't differ significantly from other other Solution Search problems where the solution space is defined by something else than the parameters to a given algorithm. But because it works on algorithms which are in themselves solutions of sorts, this discipline is sometimes referred as Metaheuristics or Metasearch. (A metaheuristics approach can be applied to various algorihms)
Certainly there are many specific features of the parameter tuning problem as compared to the other optimization applications but with regard to the solution searching per-se, the approaches and problems are generally the same.
Indeed, while well defined, the search problem is generally still broadly unsolved, and is the object of active research in very many different directions, for many different domains. Various approaches offer mixed success depending on the specific conditions and requirements of the domain, and this vibrant and diverse mix of academic research and practical applications is a common trait to Metaheuristics and to Optimization at large.
So... back to CALIBRA...
From its own authors' admission, Calibra has several limitations
Limit of 5 parameters, maximum
Requirement of a range of values for [some of ?] the parameters
Works better when the parameters are relatively independent (but... wait, when that is the case, isn't the whole search problem much easier ;-) )
CALIBRA is based on a combination of approaches, which are repeated in a sequence. A mix of guided search and local optimization.
The paper where CALIBRA was presented is dated 2006. Since then, there's been relatively few references to this paper and to CALIBRA at large. Its two authors have since published several other papers in various disciplines related to Operational Research (OR).
This may be indicative that CALIBRA hasn't been perceived as a breakthrough.
State of the art in that area ("parameter tuning", "algorithm configuration") is the SPOT package in R. You can connect external fitness functions using a language of your choice. It is really powerful.
I am working on adapters for e.g. C++ and Java that simplify the experimental setup, which requires some getting used to in SPOT. The project goes under name InPUT, and a first version of the tuning part will be up soon.

How do I calculate these statistics?

I'm writing an app to help facilitate some research, and part of this involves doing some statistical calculations. Right now, the researchers are using a program called SPSS. Part of the output that they care about looks like this:
They're really only concerned about the F and Sig. values. My problem is that I have no background in statistics, and I can't figure out what the tests are called, or how to calculate them.
I thought the F value might be the result of the F-test, but after following the steps given on Wikipedia, I got a result that was different from what SPSS gives.
This website might help you out a bit more. Also this one.
I'm working from a fairly rusty memory of a statistics course, but here goes nothing:
When you're doing analysis of variance (ANOVA), you actually calculate the F statistic as the ratio from the mean-square variances "between the groups" and the mean-square variances "within the groups". The second link above seems pretty good for this calculation.
This makes the F statistic measure exactly how powerful your model is, because the "between the groups" variance is explanatory power, and "within the groups" variance is random error. High F implies a highly significant model.
As in many statistical operations, you back-determine Sig. using the F statistic. Here's where your Wikipedia information comes in slightly handy. What you want to do is - using the degrees of freedom given to you by SPSS - find the proper P value at which an F table will give you the F statistic you calculated. The P value where this happens [F(table) = F(calculated)] is the significance.
Conceptually, a lower significance value shows a very strong ability to reject the null hypothesis (which for these purposes means to determine your model has explanatory power).
Sorry to any math folks if any of this is wrong. I'll be checking back to make edits!!!
Good luck to you. Stats is fun, just maybe not this part. =)
I assume from your question that your research colleagues want to automate the process by which certain statistical analyses are performed (i.e., they want to batch process data sets). You have two options:
1) SPSS is now scriptable through python (as of version 15) - go to spss.com and search for python. You can write python scripts to automate data analyses and extract key values from pivot tables, and then process the answers any way you like. This has the virtue of allowing an exact comparison between the results from your python script and the hand-calculated efforts in SPSS of your collaborators. Thus you won't have to really know any statistics to do this work (which is a key advantage)
2) You could do this in R, a free statistics environment, which could probably be scripted. This has the disadvantage that you will have to learn statistics to ensure that you are doing it correctly.
Statistics is hard :-). After a year of reading and re-reading books and papers and can only say with confidence that I understand the very basics of it.
You might wish to investigate ready-made libraries for whichever programming language you are using, because they are many gotcha's in math in general and statistics in particular (rounding errors being an obvious example).
As an example you could take a look at the R project, which is both an interactive environment and a library which you can use from your C++ code, distributed under the GPL (ie if you are using it only internally and publishing only the results, you don't need to open your code).
In short: don't do this by hand, link/use existing software. And sain_grocen's answer is incorrect. :(
These are all tests for significance of parameter estimates that are typically used in Multivariate response Multiple Regressions. These would not be simple things to do outside of a statistical programming environment. I would suggest either getting the output from a pre-existing statistical program, or using one that you can link to and use that code.
I'm afraid that the first answer (sain_grocen's) will lead you down the wrong path. His explanation is likely of a special case of what you are actually dealing with. The anova explained in his links is for a single variate response, in a balanced design. These aren't the F statistics you are seeing. The names in your output (Pillai's Trace, Hotelling's Trace,...) are some of the available multivariate versions. They have F distributions under certain assumptions. I can't explain a text books worth of material here, I would advise you to start by looking at
"Applied Multivariate Statistical Analysis" by Johnson and Wichern
Can you explain more why SPSS itself isn't a fine solution to the problem? Is it that it generates pivot tables as output that are hard to manipulate? Is it the cost of the program?
F-statistics can arise from any number of particular tests. The F is just a distribution (loosely: a description of the "frequencies" of groups of values), like a Normal (Gaussian), or Uniform. In general they arise from ratios of variances. Opinion: many statisticians (myself included), find F-based tests to be unstable (jargon: non-robust).
The particular output statistics (Pillai's trace, etc.) suggest that the original analysis is a MANOVA example, which as other posters describe is a complicated, and hard to get right procedure.
I'm guess also that, based on the MANOVA, and the use of SPSS, this is a psychology or sociology project... if not please enlighten. It might be that other, simpler models might actually be easier to understand and more repeatable. Consult your local university statistical consulting group, if you have one.
Good luck!
Here's an explanation of MANOVA ouptput, from a very good site on statistics and on SPSS:
Output with explanation:
http://faculty.chass.ncsu.edu/garson/PA765/manospss.htm
How and why to do MANOVA or multivariate GLM:
(same path as above, but terminating in '/manova.htm')
Writing software from scratch to calculate these outputs would be both lengthy and difficult;
there's lots of numerical problems and matrix inversions to do.
As Henry said, use Python scripts, or R. I'd suggest working with somebody who knows SPSS if scripting.
In addition, SPSS itself is capable of exporting the output tables to files using something called OMS.
A script within SPSS can do this.
Find out who in your research group knows SPSS and work with them.

Resources