I was digging in the scipy code for this test Wilcoxon signed-rank test (stats.wilcoxon) and I found that in scipy they compute the sum of the ranks for the differences that are positives and separately for the ones that are negatives. Then they picked the smaller one and use that as W.
That is substantially different from the test explanation in Wikipedia, or other sites(http://vassarstats.net/textbook/ch12a.html) where W = sum(all the differences).
Is this approach valid?
For many statistical functions, there are alternate definitions which have different properties. Sometimes one is more useful, sometimes another. To find out why the author or authors of scipy chose the definition you found, you'll have to ask them. My advice is to look at the revision history (Git or whatever, I don't know what Scipy uses) of scipy.Wilcoxon or ask on the Scipy developers mailing list. Hope this helps.
Related
I need to solve a linear program (LP) with a mix of binary and continuous variables (MILP).
The data I use as constraints and objectives come from pandas data frame manipulations, so they are in matrix (or I should probably say numpy array) format, and the variables to solve for are sometimes 100's / 1000's.
I know how to use matrices and vectors to setup and solve MILP's in R, but as I coded part of the data handling in python, I wanted to include the LP part in the python script.
So I looked for MILP solvers for python, and I found this post:
Python Mixed Integer Linear Programming
and this one:
https://realpython.com/linear-programming-python/#what-is-mixed-integer-linear-programming
Apart from many links no longer working in the first post, my impression, after browsing through the docs of some of the suggested solvers, was that the LP functions in scipy, which take matrices and vectors as input, cannot solve MILP's (see also this post); on the other hand, interfaces like pulp etc., which do solve MILP's, seem to have their own very special input where one has to specify one by one the equations and name each variable. That would be very inconvenient, as I already have a simple numpy array where each column is a variable and each row is a vector of constraint coefficients.
Does anyone know of a python interface to a MILP solver that takes matrices and vectors as input, instead of requiring to write everything as explicit text equations?
Thanks!
PS I should probably mention that I am looking for interfaces to non-commercial solvers.
EDIT adding my current conclusion, after following the advice from users, for future record and in case it helps someone else
In short: as pulp does the job and comes with a (by all accounts) very good LP solver (CBC), I gave in and accepted to rewrite my problem as a set of equations and inequalities, instead of using the matrix format.
It seems to work, it's indeed fast, at least for the problems I tried so far.
My advice to anyone trying to use it on large problems with many variables would be: look up LpAffineExpression, it's very useful to put together complicated constraints. But first you need to define your variables using LpVariable. A variable can also be a list of variables (that's what stumped me initially).
Before trying pulp, I tried cvxopt, which is easy to install, but then it's true, like the users below said, it has its own bizarre format, and GLPK unfortunately is not a very fast solver.
When I still tried to stick to my matrix input format, I tried cvxpy. On paper this is very good, and in fact would allow me to write the problem as a sum of squares min, which is in fact what my original problem was (I turned it into a LP by replacing sum of squares with sum of abs values, and converting that to LP by known techniques).
I got all the way to writing my objective and constraints in matrix format, and then the solver failed... in fact even the example problem published in the cvxpy documentation failed with the same error. Big waste of time.
Anyway, lesson learned: no point fighting: sometimes you just have to do what python wants you to do ;)
There are some specifics missing and no real definition for how close to exactly what you have as input must be accepted, but Google's OR tools mention a method to define problems for the various solvers they wrap through arrays. This may be workable for you.
For 100s/1000s of variables, this is probably not an option, but I have added an answer to the scipy.optimize question with a toy implementation of branch-and-bound on top of linprog which takes it's args in arrays. If your problem is highly constrained, it might give some answers in a day or two :).
For future reference:
Make sure you have at least pip install scipy>=1.9.3 and use from scipy.optimize import milp as explained here using the HiGHS solvers via method='highs'
I want to use Dynamic Topic Modeling by Blei et al. (http://www.cs.columbia.edu/~blei/papers/BleiLafferty2006a.pdf) for a large corpus of nearly 3800 patent documents.
Does anybody has experience in using the DTM in the gensim package?
I identified two models:
models.ldaseqmodel – Dynamic Topic Modeling in Python Link
models.wrappers.dtmmodel – Dynamic Topic Models (DTM) Link
Which one did you use, of if you used both, which one is "better"? In better words, which one did/do you prefer?
Both packages work fine, and are pretty much functionally identical. Which one you might want to use depends on your use case. There are small differences in the functions each model comes with, and small differences in the naming, which might be a little confusing, but for most DTM use cases, it does not matter very much which you pick.
Are the model outputs identical?
Not exactly. They are however very, very close to being identical (98%+) - I believe most of the differences come from slightly different handling of the probabilities in the generative process. So far, I've not yet come across a case where a difference in the sixth or seventh digit after the decimal point has any significant meaning. Interpreting the topics your models finds matters much more than one version finding a higher topic loading for some word by 0.00002
The big difference between the two models: dtmmodel is a python wrapper for the original C++ implementation from blei-lab, which means python will run the binaries, while ldaseqmodel is fully written in python.
Why use dtmmodel?
the C++ code is faster than the python implementation
supports the Document Influence Model from Gerrish/Blei 2010 (potentially interesting for your research, see this paper for an implementation.
Why use ldaseqmodel?
easier to install (simple import statement vs downloading binaries)
can use sstats from a pretrained LDA model - useful with LdaMulticore
easier to understand the workings of the code
I mostly use ldaseqmodel but thats for convenience. Native DIM support would be great to have, though.
What should you do?
Try each of them out, say, on a small sample set and see what the models return. 3800 documents isn't a huge corpus (assuming the patents aren't hundreds of pages each), and I assume that after preprocessing (removing stopwords, images and metadata) your dictionary won't be too large either (lots of standard phrases and legalese in patents, I'd assume). Pick the one that works best for you or has the capabilities you need.
Full analysis might take hours anyway, if you let your code run overnight there is little practical difference, after all, do you care if it finishes at 3am or 5am? If runtime is critical, I would assume the dtmmodel will be more useful.
For implementation examples, you might want to take a look at these notebooks: ldaseqmodel and dtmmodel
I am having a hard time optimizing a program that is relying on ads conjugateGradientDescent function for most of it's work.
Basically my code is a translation of an old papers code that is written in Matlab and C. I have not measured it, but that code is running at several iterations per second. Mine is in the order of minutes per iteration ...
The code is available in this repositories:
https://github.com/fhaust/aer
https://github.com/fhaust/aer-utils
The code in question can be run by following these commands:
$ cd aer-utils
$ cabal sandbox init
$ cabal sandbox add-source ../aer
$ cabal run learngabors
Using GHCs profiling facilities I have confirmed that the descent is in fact the part that is taking most of the time:
(interactive version here: https://dl.dropboxusercontent.com/u/2359191/learngabors.svg)
-s is telling me that productivity is quite low:
Productivity 33.6% of total user, 33.6% of total elapsed
From what I have gathered there are two things that might lead to higher performance:
Unboxing: currently I use a custom matrix implementation (in src/Data/SimpleMat.hs). This was the only way I could get ad to work with matrices (see: How to do automatic differentiation on hmatrix?). My guess is that by using a matrix type like newtype Mat w h a = Mat (Unboxed.Vector a) would achieve better performance due to unboxing and fusion. I found some code that has ad instances for unboxed vectors, but up to now I haven't been able to use these with the conjugateGradientFunction.
Matrix derivatives: In an email I just can't find at the moment Edward mentions that it would be better to use Forward instances for matrix types instead of having matrices filled with Forward instances. I have a faint idea how to achieve that, but have yet to figure out how I'd implement it in terms of ads type classes.
This is probably a question that is too wide to be answered on SO, so if you are willing to help me out here, feel free to contact me on Github.
You are running into pretty much the worst-case scenario for the current ad library here.
FWIW- You won't be able to use the existing ad classes/types with "matrix/vector ad". It'd be a fairly large engineering effort, see https://github.com/ekmett/ad/issues/2
As for why you can't unbox: conjugateGradient requires the ability to use Kahn mode or two levels of forward mode on your functions. The former precludes it from working with unboxed vectors, as the data types carry syntax trees, and can't be unboxed. For various technical reasons I haven't figured out how to make it work with a fixed sized 'tape' like the standard Reverse mode.
I think the "right" answer here is for us to sit down and figure out how to get matrix/vector AD right and integrated into the package, but I confess I'm timesliced a bit too thinly right now to give it the attention it deserves.
If you get a chance to swing by #haskell-lens on irc.freenode.net I'd happy to talk about designs in this space and offer advice. Alex Lang has also been working on ad a lot and is often present there and may have ideas.
I had a brief exposure to CP and MiniZinc but I'm no expert.
I have a CP model, which I cannot post here ATM, implemented in MiniZinc.
I need to generate all the feasible solution for the problem. We expect to have just a "few" ones, say less than 1000, more than 100.
I tried to solve the model with the -a flag passed to minizinc ver. 1.6 but I notice a lot of solutions being printed are identical.
Here they refer to "projection". In another paper I read they used some "backtracking mechanism".
It's still not clear to me.
My questions then are:
what is the best way to generate only unique solutions from a CP model?
Is there a standard mechanism implemented in CP libraries like SCIP or Gecode? Does it have a common name?
Is it computationally efficient?
does minizinc support that? How do I access that feature?
Normally, CP systems give you just distinct solutions. I suspect that you have decision variables that are not printed (not in the output section) and you don't see that if these values are included in the solution it would be unique solutions.
In the question you linked to (this recent discussion), it is mentioned that Gecode's FlatZinc solver (at least the SVN version) now generates distinct solutions given a subset of decision variables in the output section. Other FlatZinc solver don't seem to have this feature.
If that doesn't answer your questions, please give more details of the model and example of the output (including the output section).
I'm writing an app to help facilitate some research, and part of this involves doing some statistical calculations. Right now, the researchers are using a program called SPSS. Part of the output that they care about looks like this:
They're really only concerned about the F and Sig. values. My problem is that I have no background in statistics, and I can't figure out what the tests are called, or how to calculate them.
I thought the F value might be the result of the F-test, but after following the steps given on Wikipedia, I got a result that was different from what SPSS gives.
This website might help you out a bit more. Also this one.
I'm working from a fairly rusty memory of a statistics course, but here goes nothing:
When you're doing analysis of variance (ANOVA), you actually calculate the F statistic as the ratio from the mean-square variances "between the groups" and the mean-square variances "within the groups". The second link above seems pretty good for this calculation.
This makes the F statistic measure exactly how powerful your model is, because the "between the groups" variance is explanatory power, and "within the groups" variance is random error. High F implies a highly significant model.
As in many statistical operations, you back-determine Sig. using the F statistic. Here's where your Wikipedia information comes in slightly handy. What you want to do is - using the degrees of freedom given to you by SPSS - find the proper P value at which an F table will give you the F statistic you calculated. The P value where this happens [F(table) = F(calculated)] is the significance.
Conceptually, a lower significance value shows a very strong ability to reject the null hypothesis (which for these purposes means to determine your model has explanatory power).
Sorry to any math folks if any of this is wrong. I'll be checking back to make edits!!!
Good luck to you. Stats is fun, just maybe not this part. =)
I assume from your question that your research colleagues want to automate the process by which certain statistical analyses are performed (i.e., they want to batch process data sets). You have two options:
1) SPSS is now scriptable through python (as of version 15) - go to spss.com and search for python. You can write python scripts to automate data analyses and extract key values from pivot tables, and then process the answers any way you like. This has the virtue of allowing an exact comparison between the results from your python script and the hand-calculated efforts in SPSS of your collaborators. Thus you won't have to really know any statistics to do this work (which is a key advantage)
2) You could do this in R, a free statistics environment, which could probably be scripted. This has the disadvantage that you will have to learn statistics to ensure that you are doing it correctly.
Statistics is hard :-). After a year of reading and re-reading books and papers and can only say with confidence that I understand the very basics of it.
You might wish to investigate ready-made libraries for whichever programming language you are using, because they are many gotcha's in math in general and statistics in particular (rounding errors being an obvious example).
As an example you could take a look at the R project, which is both an interactive environment and a library which you can use from your C++ code, distributed under the GPL (ie if you are using it only internally and publishing only the results, you don't need to open your code).
In short: don't do this by hand, link/use existing software. And sain_grocen's answer is incorrect. :(
These are all tests for significance of parameter estimates that are typically used in Multivariate response Multiple Regressions. These would not be simple things to do outside of a statistical programming environment. I would suggest either getting the output from a pre-existing statistical program, or using one that you can link to and use that code.
I'm afraid that the first answer (sain_grocen's) will lead you down the wrong path. His explanation is likely of a special case of what you are actually dealing with. The anova explained in his links is for a single variate response, in a balanced design. These aren't the F statistics you are seeing. The names in your output (Pillai's Trace, Hotelling's Trace,...) are some of the available multivariate versions. They have F distributions under certain assumptions. I can't explain a text books worth of material here, I would advise you to start by looking at
"Applied Multivariate Statistical Analysis" by Johnson and Wichern
Can you explain more why SPSS itself isn't a fine solution to the problem? Is it that it generates pivot tables as output that are hard to manipulate? Is it the cost of the program?
F-statistics can arise from any number of particular tests. The F is just a distribution (loosely: a description of the "frequencies" of groups of values), like a Normal (Gaussian), or Uniform. In general they arise from ratios of variances. Opinion: many statisticians (myself included), find F-based tests to be unstable (jargon: non-robust).
The particular output statistics (Pillai's trace, etc.) suggest that the original analysis is a MANOVA example, which as other posters describe is a complicated, and hard to get right procedure.
I'm guess also that, based on the MANOVA, and the use of SPSS, this is a psychology or sociology project... if not please enlighten. It might be that other, simpler models might actually be easier to understand and more repeatable. Consult your local university statistical consulting group, if you have one.
Good luck!
Here's an explanation of MANOVA ouptput, from a very good site on statistics and on SPSS:
Output with explanation:
http://faculty.chass.ncsu.edu/garson/PA765/manospss.htm
How and why to do MANOVA or multivariate GLM:
(same path as above, but terminating in '/manova.htm')
Writing software from scratch to calculate these outputs would be both lengthy and difficult;
there's lots of numerical problems and matrix inversions to do.
As Henry said, use Python scripts, or R. I'd suggest working with somebody who knows SPSS if scripting.
In addition, SPSS itself is capable of exporting the output tables to files using something called OMS.
A script within SPSS can do this.
Find out who in your research group knows SPSS and work with them.