When using this code, I've noticed that it converges unbelievably quickly (small
fraction of one second), even when the model and/or the data is very large. I
suspect that in some cases I am not getting anything close to the best solution,
but this is hard to prove. It would be nice to have the option for some type of
global optimizer such as the basin hopping algorithm, even if this consumed 100
to 1,000 times as much CPU. Does anyone have any thoughts on this subject?
This is a very complex question and this answer might be incomplete, but should give you some hints (as your question also indicates some knowledge gaps):
(1) First i disagree with the desire for some type of global optimizer such as the basin hopping algorithm, even if this consumed 100 to 1,000 times as much CPU as this does not help in most cases (in ML world) as the differences are so subtle and the optimization-error will often be negligible compared to the other errors (model-power; empirical-risk)
Read "Stochastic Gradient Descent Tricks" (Battou) for some overview (and the error-components!)
He even gives a very important reason to use fast approximate algorithms (not necessarily a good fit in your case if 1000x training-time is not a problem): approximate optimization can achieve better expected risk because more training examples can be processed during the allowed time
(2) Basin-hopping is some of these highly heuristic tools of global-optimization (looking for global-minima instead of local minima) without any guarantees at all (touching NP-hardness and co.). It's the last algorithm you want to use here (see point (3))!
(3) The problem of logistic-regression is a convex optimization problem!
The local minimum is always the global-minimum, which follows from convexity (i'm ignoring stuff like strictly/unique solutions and co)!
Therefore you will always use something tuned for convex-optimization! And never Basin-hopping!
(4) There are different solvers and each support different variants of problems (different regularization and co.). We don't know exactly what you are optimizing, but of course these solvers are working differently in regards to convergence:
Take the following comments with a grain of salt:
liblinear: is probably using some CG-based algorithm (conjugated-gradient) which means convergence is highly dependent on the data
if accurate convergence is achieved is solely depending on the exact implementation (liblinear is high-quality)
as it's a first-order method i would call the general accuracy medium
sag/saga: seems to have a better convergence-theory (did not check it much), but again: it's dependent on your data as mentioned in sklearn's docs and if solutions are accurate is highly depending on the implementation details
as these are first-order methods: general accuracy medium
newton-cg: an inexact newton-method
in general much more robust in terms of convergence as line-searches replace heuristics or constant learning-rates (LS costly in first-order opt)
second-order method with inexact-core: expected accuracy: medium-high
lbfgs: quasi-newton method
again in general much more robust in terms of convergence like newton-cg
second-order method: expected accuracy: medium-high
Of course second-order methods get more hurt with large-scale data (even complexity-wise) and as mentioned, not all solvers are supporting every logreg-optimization-problem supported in sklearn.
I hope you get the idea how complex this question is (because of highly complex solver-internals).
Most important things:
LogReg is convex -> use solvers tuned for unconstrained convex optimization
If you want medium-high accuracy: use those second-order based methods available and do many iterations (it's a parameter)
If you want high accuracy: use second-order based methods which are even more conservative/careful (no: hessian-approx; inverse-hessian-approx; truncating...):
e.g. any off-the-shelve solver from convex-optimization
Open-source: cvxopt, ecos and co.
Commercial: Mosek
(but you need to formulate the model yourself in their frameworks or some wrapper; probably some examples for classic logistic-regression available)
As expected: some methods will get very slow with much data.
Related
I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!
Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.
Good afternoon,
I know that the traditional independent t-test assumes homoscedasticity (i.e., equal variances across groups) and normality of the residuals.
They are usually checked by using levene's test for homogeneity of variances, and the shapiro-wilk test and qqplots for the normality assumption.
Which statistical assumptions do I have to check with the bayesian independent t test? How may I check them in R with coda and rjags?
For whichever test you want to run, find the formula and plug in using the posterior draws of the parameters you have, such as the variance parameter and any regression coefficients that the formula requires. Iterating the formula over the posterior draws will give you a range of values for the test statistic from which you can take the mean to get an average value and the sd to get a standard deviation (uncertainty estimate).
And boom, you're done.
There might be non-parametric Bayesian t-tests. But commonly, Bayesian t-tests are parametric, and as such they assume equality of relevant population variances. If you could obtain a t-value from a t-test (just a regular t-test for your type of t-test from any software package you're comfortable with), use levene's test (do not think this in any way is a dependable test, remember it uses p-value), then you can do a Bayesian t-test. But remember the point that the Bayesian t-test, requires a conventional modeling of observations (Likelihood), and an appropriate prior for the parameter of interest.
It is highly recommended that t-tests be re-parameterized in terms of effect sizes (especially standardized mean difference effect sizes). That is, you focus on the Bayesian estimation of the effect size arising from the t-test not other parameter in the t-test. If you opt to estimate Effect Size from a t-test, then a very easy to use free, online Bayesian t-test software is THIS ONE HERE (probably one of the most user-friendly package available, note that this software uses a cauchy prior for the effect size arising from any type of t-test).
Finally, since you want to do a Bayesian t-test, I would suggest focusing your attention on picking an appropriate/defensible/meaningful prior rather then levenes' test. No test could really show that the sample data may have come from two populations (in your case) that have had equal variances or not unless data is plentiful. Note that the issue that sample data may have come from populations with equal variances itself is an inferential (Bayesian or non-Bayesian) question.
I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.
What's the relationship between the Monte-Carlo Method and Evolutionary Algorithms? On the face of it they seem to be unrelated simulation methods used to solve complex problems. Which kinds of problems is each best suited for? Can they solve the same set of problems? What is the relationship between the two (if there is one)?
"Monte Carlo" is, in my experience, a heavily overloaded term. People seem to use it for any technique that uses a random number generator (global optimization, scenario analysis (Google "Excel Monte Carlo simulation"), stochastic integration (the Pi calculation that everybody uses to demonstrate MC). I believe, because you mentioned evolutionary algorithms in your question, that you are talking about Monte Carlo techniques for mathematical optimization: You have a some sort of fitness function with several input parameters and you want to minimize (or maximize) that function.
If your function is well behaved (there is a single, global minimum that you will arrive at no matter which inputs you start with) then you are best off using a determinate minimization technique such as the conjugate gradient method. Many machine learning classification techniques involve finding parameters that minimize the least squares error for a hyperplane with respect to a training set. The function that is being minimized in this case is a smooth, well behaved, parabaloid in n-dimensional space. Calculate the gradient and roll downhill. Easy peasy.
If, however, your input parameters are discrete (or if your fitness function has discontinuties) then it is no longer possible to calculate gradients accurately. This can happen if your fitness function is calculated using tabular data for one or more variables (if variable X is less than 0.5 use this table else use that table). Alternatively, you may have a program that you got from NASA that is made up of 20 modules written by different teams that you run as a batch job. You supply it with input and it spits out a number (think black box). Depending on the input parameters that you start with you may end up in a false minimum. Global optimization techniques attempt to address these types of problems.
Evolutionary Algorithms form one class of global optimization techniques. Global optimization techniques typically involve some sort of "hill climbing" (accepting a configuration with a higher (worse) fitness function). This hill climbing typically involves some randomness/stochastic-ness/monte-carlo-ness. In general, these techniques are more likely to accept less optimal configurations early on and, as the optimization progresses, they are less likely to accept inferior configurations.
Evolutionary algorithms are loosely based on evolutionary analogies. Simulated annealing is based upon analogies to annealing in metals. Particle swarm techniques are also inspired by biological systems. In all cases you should compare results to a simple random (a.k.a. "monte carlo") sampling of configurations...this will often yield equivalent results.
My advice is to start off using a deterministic gradient-based technique since they generally require far fewer function evaluations than stochastic/monte-carlo techniques. When you hear hoof steps think horses not zebras. Run the optimization from several different starting points and, unless you are dealing with a particularly nasty problem, you should end up with roughly the same minimum. If not, then you might have zebras and should consider using a global optimization method.
well I think Monte Carlo methods is the general name for these methods which
use random numbers in order to solve optimization problems. In this ways,
even the evolutionary algorithms are a type of Monte Carlo methods if they
use random numbers (and in fact they do).
Other Monte Carlo methods are: metropolis, wang-landau, parallel tempering,etc
OTOH, Evolutionary methods use 'techniques' borrowed from nature such as
mutation, cross-over, etc.