I am currently fitting a set of 16 models in JAGS. I have a function in jags that calculates the log of the probability of each value of the outcome variable and and another function that takes -2 * the sum of those log probabilities. I.e., I have a custom formula to calculate the deviance for each model. I wanted to check that my definition of the deviance was the same as what JAGS was using.
After running 5000 burnin and 5000 iterations, I obtained the following results:
Basically, for some models, the deviance was close but not exactly the same, and for other models (e.g., 7, 13, 16) it was vastly different.
Why is the deviance calculated using my custom formula different to that obtained using the automatic approach based on DIC?
After a bit of messing around, I think the following is the case.
First, the DIC deviance is obtained using a discrete set of samples to the main parameter estimates. Thus, estimates will differ between runs due to the inherent random aspects of MCMC estimation.
Assuming the chain length is long (several thousand) and burnin is adequate , then differences between the main samples and the DIC samples should be small if the model is converging and reasonably efficient. Thus, big differences are likely to co-occur with model convergence issues.
This all assumes that the original deviance calculation was programmed correctly.
Related
I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.
I recently came across LightFM while learning to train a recommender system. And so far what I know is that it utilizes loss functions which are logistic, BPR, WARP and k-OS WARP. I did not go through the math behind all these functions. Now what I am confused about is that how will I know that which loss function to use where?
From lightfm model documentation page:
logistic: useful when both positive (1) and negative (-1) interactions are present.
BPR: Bayesian Personalised Ranking 1 pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision#k) is desired.
k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-the positive example for any given user as a basis for pairwise updates.
Everything boils down to how your dataset is structured and what kind of user interacions you're looking at. Obviously one approach would be to include the loss function in your parameter grid when going through hyperparameter tuning (at least that's what I did) and check model accuracy. I find investingating why a given loss function performed better/worse on a dataset as a good learning exercise.
I am working on a project which aims at prediction of highly autocorrelated time series. LSTM seems very ideal for my purpose. However, does anyone know how I can incorporate multiple large autocorrelation into my prediction networks? i.e., there is a very strong yearly correlation, and seasonal correlation; how am I able to include these information into the LSTM network?
Thank you sincerely
if there is autocorrelation the correlation is linear ( not non-linear ) because common autocorrelation tests for linear correlation. Any LSTM is able to capture this linear correlations by default, it does not matter how many linear correlations are in the time series, the LSTM will capture it. A problem could be the length of memory , a LSTM has a memory between 200 and 500 timesteps ( https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/ ), so if the long-term linear correlations are in the time series at positions more extent than this the LSTM will not be able to capture because it lacks the memory ( not physical computer memory, the memory in the structure of LSTMs )
So simply build the LSTM model in keras and let it predict,
as Upasana Mittal said in his comment, cf http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html
updated answer because there is not enough space in the comments. In http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is used a lagged time series to determine ACF, this is objective else it would be impossible to determine ACF :
First, we need to review the Autocorrelation Function (ACF), which is
the correlation between the time series of interest in lagged versions
of itself. The acf() function from the stats library returns the ACF
values for each lag as a plot. However, we’d like to get the ACF
values as data so we can investigate the underlying data. To do so,
we’ll create a custom function, tidy_acf(), to return the ACF values
in a tidy tibble.
There is no use of a specially lagged time series as input and using the history of the system or past system states to predict the future system states i also an objective ansatz and essential in any RNN.
So the way of proceeding in http://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html is objective.
Another point you could mean is the stateful mode however it is vital that you use it because only in stateful mode the samples are not shuffled and accuracy is increased. Stateless neural nets work on probability distributions and shuffling a probability distribution does not change it ( permutation invariance ), stateful neural nets include the sequential ordering of the data so shuffling changes the result, search net for 'shuffling multifractal data' :
In normal (or “stateless”) mode, Keras shuffles the samples, and the
dependencies between the time series and the lagged version of itself
are lost. However, when run in “stateful” mode, we can often get high
accuracy results by leveraging the autocorrelations present in the
time series.
LSTMs by definition use a time series and a lagged version of the time series (timesteps,...), so this is also an objective ansatz.
If you want to dig deeper into the matter, and go beyond linear correlations that are captured by ACF, you should learn about nonlinear dynamical systems ( chaos theory, fractality, multifractality ) because it involves nonlinear systems and nonlinear correlations, i.e. the lag plot of a time series of a nonlinear dynamical systems in its chaotic state always exhibits the species of nonlinearity. The lag plot of the Logistic Map in its chaotic region shows a parabola, the lag plot of a cubic nonlinear map shows a cubic curve,.... RNNs are only capable to model / approximate systems perfectly accurate whichs lag plot shows a sufficiently simple structure ( circles, spirals, lemniscates, cubic curves, quadratic curves , ... ), i.e. for a neural net it is impossible to approximate the sequence of the primegaps because the lag plot of the sequence of primegaps is structured to complex ( however it shows a clear pattern for lag = 1, when neglecting the sequential ordering )
I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.
I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.