Is there a way to include a spatial regularization penalty to cost functions in scikit-learn for clustering?
More specifically, I am working with neuroscience brain data, where every voxels has a spatially inherited dependency based on their proximity. Using 2-classes gaussian mixture learning, I would like to obtain, for each voxel, a probability score of being labeled as '1' vs '0' (based on 30-ish samples). However this task is pointless if I cannot include a regularization based on neighborhood, as voxels are not completely independents.
Related
TL;DR How can the Pearson correlation coefficient between ground truth labels and cosine similarity scores evaluate the performance of a sentence embedding model? A positive/negative linear relationship between the two doesn't necessarily indicate that a model is accurate, just that they move together, which to me doesn't seem like a good way to evaluate the performance of a sentence embedding model.
I'm training a model to be able to tell if two questions are similar or not. I first continue pre-training using MLM (masked language modeling) and finally fine-tune on the STS dataset. For fine-tuning, I'm using this example python file https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py. At the end of the file, it says to "load the stored model and evaluate its performance on STS benchmark dataset", and it uses this file to evaluate the performance of the model https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py.
The second file has a few metrics for evaluation (cosine similarity being one of them), and it uses the Pearson correlation coefficient and Spearman correlation coefficient for each metric to evaluate the performance of the model. What I'm not understanding is: how does calculating the relationship (correlation coefficient) between the ground truth labels and the cosine similarity contribute to measuring the performance of the model? Even if the two have similar movement patterns i.e. a high correlation coefficient, that doesn't mean the model is performing well, does it?
There are lots of posts here about the "Perfect Separation Error" in statsmodels when running a logisitc regression. But I'm not doing logistic regression. I'm doing GLM with frequency weights and gaussian distribution. So basically OLS.
All of my independent variables are categorical with lots of categories. So high dimensional binary coded feature set.
But I'm very frequently getting the "perfectseperationerror" from statsmodels
I'm running many many models. I think I'm getting this error when my data is too thin for that many variables. However, With freq weights, in theory, I actually have many more features then the dataframe holds because the observations should be multiplied by the freq.
Any guidance on how to proceed?
reg = sm.GLM(dep, Indies, freq_weights = freq)
<p>Error: class 'statsmodels.tools.sm_exceptions.PerfectSeparationError'>
The check is on perfect prediction and is used independently of the family.
Currently, there is now workaround when using irls. Using scipy optimizers, e.g. method="bfgs", avoids the perfect prediction/separation check.
https://github.com/statsmodels/statsmodels/issues/2680
Perfect separation is only defined for the binary case, i.e. family binomial in GLM, and could be extended to other discrete models.
However, there can be other problems with inference if the residual variance is zero, i.e. we have a perfect fit.
Here is an issue with perfect prediction in OLS
https://github.com/statsmodels/statsmodels/issues/1459
I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.
I am working on a project which aims at prediction of highly autocorrelated time series. LSTM seems very ideal for my purpose. However, does anyone know how I can incorporate multiple large autocorrelation into my prediction networks? i.e., there is a very strong yearly correlation, and seasonal correlation; how am I able to include these information into the LSTM network?
Thank you sincerely
if there is autocorrelation the correlation is linear ( not non-linear ) because common autocorrelation tests for linear correlation. Any LSTM is able to capture this linear correlations by default, it does not matter how many linear correlations are in the time series, the LSTM will capture it. A problem could be the length of memory , a LSTM has a memory between 200 and 500 timesteps ( https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/ ), so if the long-term linear correlations are in the time series at positions more extent than this the LSTM will not be able to capture because it lacks the memory ( not physical computer memory, the memory in the structure of LSTMs )
So simply build the LSTM model in keras and let it predict,
as Upasana Mittal said in his comment, cf http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html
updated answer because there is not enough space in the comments. In http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is used a lagged time series to determine ACF, this is objective else it would be impossible to determine ACF :
First, we need to review the Autocorrelation Function (ACF), which is
the correlation between the time series of interest in lagged versions
of itself. The acf() function from the stats library returns the ACF
values for each lag as a plot. However, we’d like to get the ACF
values as data so we can investigate the underlying data. To do so,
we’ll create a custom function, tidy_acf(), to return the ACF values
in a tidy tibble.
There is no use of a specially lagged time series as input and using the history of the system or past system states to predict the future system states i also an objective ansatz and essential in any RNN.
So the way of proceeding in http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is objective.
Another point you could mean is the stateful mode however it is vital that you use it because only in stateful mode the samples are not shuffled and accuracy is increased. Stateless neural nets work on probability distributions and shuffling a probability distribution does not change it ( permutation invariance ), stateful neural nets include the sequential ordering of the data so shuffling changes the result, search net for 'shuffling multifractal data' :
In normal (or “stateless”) mode, Keras shuffles the samples, and the
dependencies between the time series and the lagged version of itself
are lost. However, when run in “stateful” mode, we can often get high
accuracy results by leveraging the autocorrelations present in the
time series.
LSTMs by definition use a time series and a lagged version of the time series (timesteps,...), so this is also an objective ansatz.
If you want to dig deeper into the matter, and go beyond linear correlations that are captured by ACF, you should learn about nonlinear dynamical systems ( chaos theory, fractality, multifractality ) because it involves nonlinear systems and nonlinear correlations, i.e. the lag plot of a time series of a nonlinear dynamical systems in its chaotic state always exhibits the species of nonlinearity. The lag plot of the Logistic Map in its chaotic region shows a parabola, the lag plot of a cubic nonlinear map shows a cubic curve,.... RNNs are only capable to model / approximate systems perfectly accurate whichs lag plot shows a sufficiently simple structure ( circles, spirals, lemniscates, cubic curves, quadratic curves , ... ), i.e. for a neural net it is impossible to approximate the sequence of the primegaps because the lag plot of the sequence of primegaps is structured to complex ( however it shows a clear pattern for lag = 1, when neglecting the sequential ordering )
I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.