Autoregressive Generalised Additive Models (AR GAMs) and their intepretations

Autoregressive Generalised Additive Models (AR GAMs) and their intepretations - gam

Basically I'm modelling tree fruiting patterns using mgcv bams and auto regressive (1) models have much better outcomes using itsadug::compareML(). (bam AR(1) was chosen due to limitations associated with binomial data) Further, this AR is backed up by biological theory. However, the best models when I use AR techniques often don't include terms that are included in the non-AR models. I understand this to be a common occurrence, the AR term explains much of the variance, leaving smaller predictions for the remaining terms.
I've seen discussions on here warning that AR GAMs should be interpreted with care, and Gavin Simposon's AR GAM post (part 1), ends with hinting that there are some serious diagnostic criteria that should be considered, but part 2 never came out, and I'm struggling to find resources on interpretation. Much more common are simple introductory articles.
I guess the fundamental question is thus: the two different types of model will make different statements about the effects of a given predictor, which should be believed?
If the non-AR model finds that month is a useful predictor, but the AR model finds it ultimately superfluous, does month have an effect? Is month relevant due to effects like light patterns, or just because of correlational structure? I guess this is a classic 'no models are true, some are useful' situation.
This problem persists even inside of a predictor. My temperature:vpds tensor product spline will cite a particular region as increasing the probability in non-AR models, but will suggest another region does so in AR models (in addition to the first).
I'm presently leaning towards including both sets of models in my paper, and noting that the AR models provide better predictions, but the non-AR models can provide insight into the effect of variables. Even then I wonder what's more useful? The model that best fits the data without any AR? Or the non-AR version of the AR model (i.e, set the autocorrelation parameter to 0 keeping the same predictors). I'm leaning towards the former, because I feel strange about the models that have almost no predictors.

Related

Evaluation of gensim Doc2Vec model for Recommendations

I have developed a pipeline to extract text from documents, preprocess the text, and train a gensim Doc2vec model on given documents. Given a document in my corpus, I would like to recommend other documents in the corpus.
I want to know how I can evaluate my model without having a pre-defined list of "good" recommendations. Any ideas?

One simple self-check that can be used to catch some big problems with a Doc2Vec model training pipeline – like gross misparameterizations, or insufficient data/epochs – is to re-infer vectors for the training texts (using .infer_vector()), and check that generally:
the bulk-trained vector for the same text is "close to" the re-inferred vector - such as its nearest-neighbor, or one of the top neighbors, in a .most_similar() operation on the re-inferred text
the overall list of nearest-neighbors (from .most_similar()) for the bulk-trained vector, & the re-inferred vector, are very similar.
They won't necessarily be identical, for reasons explained in Q11 & Q12 of the Gensim Project FAQ, but if they're wildly-different, then something foundational has gone wrong, like:
insufficient (in quantity or quality/form) training data
misparameterizations, like too few epochs or too-large (overfitting-prone) vectors for the quantity of data
Ultimately, though, the variety of data sources & intended uses & possible dimensions of "recommendation-worthy" mean that you need cusomt inputs, based on your project's needs, usually from the intended audience (or your own ability to simulate/represent it).
In the original paper introducing the "Paragraph Vector" algorithm (what's inside the Doc2Vec class), and a followup evaluating it on Wikipedia & arXiv articles, several of the evaluations used triplets of documents, where 2 of the triplet were conjectured to be "necessarily similar" based on some preexisting system's groupings, and the 3rd randomly-chosen.
The algorithm's performance, and relative performance under different parameter choices, was scored based on how often it placed the 2 presumptively-related documents closer-together than the 3rd randomly-chosen document.
For example, one of the original paper's evaluations use brief search-engine-result snippets as documents, and considered any 2 documents that appeared as sibling top-10 results for the same query as presumptively-related. Two of the followup paper's evaluation used the human-curated categories of Wikipedia or arXiv as signalling that articles of the same category should be presumptively-related.
It's imperfect, but allowed the creation of large evaluation sets from already-existing systems/data, which generally pointed results in the same direction as human senses-of-relatedness.
Perhaps you can find a similar preexisting guide for your data. Or, as you perform ad-hoc checking, be sure to capture every judgement you make, so that it becomes, over time, a growing dataset of desirable pairings that are either (a) better than some other result that was co-presented; or (b) just "presumably good enough" that they usually should rank higher than other random 3rd documents. A large amount of imprecision in such desirability-data is tolerable, as it can even out as the set of probe-pairings grows, and the power of being able to automate bulk quantitative evaluations (reusing old assessments against new parameters/models) drives far more overall improvement than any small glitches in the evaluations cost.

How to continue training Doc2Vec with a specific domain corpus after training with a generic corpus

I want to train a Doc2Vec model with a generic corpus and, then, continue training with a domain-specific corpus (I have read that is a common strategy and I want to test results).
I have all the documents, so I can build and tag the vocab at the beginning.
As I understand, I should train initially all the epochs with the generic docs, and then repeat the epochs with the ad hoc docs. But, this way, I cannot place all the docs in a corpus iterator and call train() once (as it is recommended everywhere).
So, after building the global vocab, I have created two iterators, the first one for the generic docs and the second one for the ad hoc docs, and called train() twice.
Is it the best way or it is a more appropriate way?
If the best, how I should manage alpha and min_alpha? Is it a good decision not to mention them in the train() calls and let the train() manage them?
Best
Alberto

This is probably not a wise strategy, because:
the Python Gensim Doc2Vec class hasn't ever properly supported expanding its known vocabulary after a 1st single build_vocab() call. (Up through at least 3.8.3, such attempts typically cause a Segmentation Fault process crash.) Thus if there are words that are only in your domain-corpus, an initial typical initialization/training on the generic-corpus would leave them out of the model entirely. (You could work around this, with some atypical extra steps, but the other concerns below would remain.)
if there is truly an important contrast between the words/word-senses used in your generic and the different words/word-senses used in your domain corpus, influence of the words from the generic corpus may not be beneficial, diluting domain-relevant meanings
further, any followup training that just uses a subset of all documents (the domain corpus) will only be updating the vectors for that subset of words/word-senses, and the model's internal weights used for further unseen-document inference, in directions that make sense for the domain-corpus alone. Such later-trained vectors may be nudged arbitrarily far out of comparable alignment with other words not appearing in the domain-corpus, and earlier-trained vectors will find themselves no longer tuned in relation to the model's later-updated internal-weights. (Exactly how far will depend on the learning-rate alpha & epochs choices in the followup training, and how well that followup training optimizes model loss.)
If your domain dataset is sufficient, or can be grown with more domain data, it may not be necessary to mix in other training steps/data. But if you think you must try that, the best-grounded approach would be to shuffle all training data together, and train in one session where all words are known from the beginning, and all training examples are presented in balanced, interleaved fashion. (Or possibly, where some training texts considered extra-important are oversampled, but still mixed in with the variety of all available documents, in all epochs.)
If you see an authoritative source suggesting such a "train with one dataset, then another disjoint dataset" approach with the Doc2Vec algorithms, you should press them for more details on what they did to make that work: exact code steps, and the evaluations which showed an improvement. (It's not impossible that there's some way to manage all the issues! But I've seen many vague impressions that this separate-pretraining is straightforward or beneficial, and zero actual working writeups with code and evaluation metrics showing that it's working.)
Update with respect to the additional clarifications you provided at https://stackoverflow.com/a/64865886/130288:
Even with that context, my recommendation remains: don't do this segmenting of training into two batches. It's almost certain to degrade the model compared to a combined training.
I would be interested to see links to the "references in the literature" you allude to. They may be confused or talking about algorithms other than the Doc2Vec ("Paragraph Vectors") algorithm.
If there is any reason to give your domain docs more weight, a better-grounded way would be to oversample them in the combined corpus.
Bu by all means, test all these variants & publish the relative results. If you're exploring shaky hypotheses, I would ignore any advice from StackOverflow-like sources & just run all the variants that your reading of the literature suggest, to see which, if any actually help.
You're right to recognized that the choice of alpha parameters is a murky area that could majorly influence what impact such add-on training has. There's no right answer, so you'll have to search-for and reason-out what might make sense. The inherent issues I've mentioned with such subset-followup-training could make it so that even if you find benefits in some combos, they may be more a product of a lucky combination of data & arbitrary parameters than a generalizable practice.
And: your specific question "if it is better to set such values or not provide them at all" reduces to: "do you want to use the default values, or values set when the model was created, or not?"
Which values might be workable, if at all, for this unproven technique is something that'd need to be experimentally discovered. That is, if you wanted to have comparable (or publishable) results here, I think you'd have to justify from your own novel work some specific strategy for choosing good alpha/epochs and other parameters, rather than adopt any practice merely recommended in a StackOverflow answer.

Can I interpret doc2vec components?

I am solving a binary text classification problem with corporate filings. Using Doc2Vec embeddings of length 100 with LightGBM is producing great results. However, for this project it would be very valuable to approximate a thematic meaning for at least one of the components. Ideally, this would be a feature ranked with high importance by LightGBM explained anecdotally with a few examples.
Has anyone attempted this, or should interpretation be off the table for a high-dimensional model with this level of complexity?

The individual dimensions of a Doc2Vec representation should not be considered independent, interpretable features. They're only useful in concert with each other, and the exact directions aligned with individual coordinate-axes may not be strongly meaningful in any human-describable sense.
However, neighborhoods of the space may loosely fit describable themes, and certain directions (not specifically parallel with coordinate-axes) may loosely fit semantic themes.
But to characterize those, you might try to find the centroid points of groups-of-related-documents, or discovered clusters, and compare the relative distances/directions between those centroids.

Latent Class Analysis Model Selection

When conducting Latent Class Analysis sometimes the information criterion (i.e., AIC, BIC, aBIC) don't select the same model. Such is the case in a study of substance use patterns that I am conducting among 774 men who have sex with men. Figure 1 shows the fit criterion plotted for each number of latent classes. BIC and CAIC select the three class model (See Figure 2). However, the aBIC selects a five class model (See Figure 2).
How do you select a model solution under these circumstances? Is there a way to select variables or collapse variables down in order to optimize results?

It is never easy to select the number of classes for LCA, but there are some rules of thumb that I follow:
Based on Nylund, Asparouhov & Muthén (2007) you want to follow BIC and bootstrap likelihood ratio test (BLRT). Even then, they seldom agree – BLRT will tell you to pick a model with more classes, BIC will be more conservative and suggest fewer classes. But this is as close as you can get by using statistical tests.
Rely on the available theory underlying your model. Look for potential discrepancies with your theoretical knowledge and try to deduce from the theory how many classes are to be expected. There is no golden rule, LCA is a good method, but without theory it is quite meaningless. If you have little theory, what you can do to double check your findings is to relate your latent variable to a distal outcome (covariate) about which you might have some theory and see if it works out. For example, you suspect that one of your latent classes will be dominated by one gender: associate your latent variable with gender and see.
Parsimony rule: simple models are preferred to complex ones (Collins & Lanza, 2010). If a simpler model does all the work, why choose a complex one?
In your case, I would start with a 3 class model, since it is suggested by BIC and parsimony. After finishing the analysis and interpreting the findings, I would re-run the model with 4/5 classes and see if I would reach substantially different findings - something that is worth reporting on, any important or contradicting findings to what I have found with a 3 class model. If it just adds complexity, but does not contradict or improve what I have already known, I'd stick to a 3 class model.
Looking at the results, I think that the 5 class model does not provide anything beyond the 3 classes. In the 3 class model, you have one class of extensive drug users (16%), moderate drug users dominated by cannabis, popper, hallucinogens and cocaine (40%), and finally a class of light users dominated by alcohol and cannabis (44%). The 5 class model split the first two groups into specific smaller sub-groups, but you have to decide whether these splits are important for your research - whether they make sense for your research question.
I would also recommend checking bivariate residuals. It is possible that the model misfit that is suggesting more classes is generated by a residual association between your indicators. If you can justify it theoretically (for example by finding some similarity between the indicators beyond the latent class), you can add the residual association and obtain a similarly good fit with the 3 class model.
One last point, avoid using AIC for LCA altogether - it is a very poorly performing index! Use cAIC, BIC and aBIC instead. AIC does not correct for the sample size, which can be quite problematic with larger samples.
Sources:
Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.

Bayesian statistics, machine learning: prior v.s hyperprior

I have a linear regression (say) model
p(t|x;w) = N(t ; m , D);
Being Bayesian, I can put a Gaussian prior on parameter w.
However, I've realized for some models we can put Gaussian-Wishart hyperprior on the Gaussian to be 'more' Bayesian. Is this correct ? Are both of these two models valid Bayesian models ?
It seems to me that we can always put hyperprior, hyperhyperprior,.......... because it will still be a valid probabilistic model.
I am wondering what's the difference between putting a prior and putting the hyperprior on the prior. Are they both Bayesian ?

Using a hyperprior is still "valid Bayesian" in the sense that this sort of hierarchical modeling is comes naturally to Bayesian models, and just about any book/course on Bayesian modeling does go through the use of hyperpriors.
It's completely fine to use Normal-Wishart as the prior (or hyperprior) of a Gaussian distribution. I guess it's, in some sense, even "more Bayesian" to do so if doing so models the phenomenon at hand more accurately.
I'm not sure what you mean by "are they both Bayesian" when it comes to the difference between using a prior and a hyperprior. Bayesian hierarchical models with hyperpriors are still Bayesian models.

Using hyperpriors only makes sense in a hierarchical Bayesian model. In that case you would be looking at multiple groups and estimate a group specific coefficient w_group based on group specific priors, with coefficients drawn from a global hyperprior.
If your prior and hyperprior reside on the same hierarchical level, which seems to be the case you are think about, then the effect on the results is the same as using a simple prior with a wider standard deviation. Since it still requires additional computational costs, such stacking should be avoided.
There is a lot of statistical literature on how to pick non-informative priors, often theoretically best solutions are improper distributions (their total integral is infinite) and there is a large risk of getting improper posterior solutions without well defined means or even medians. So for practical purposes picking wide normal distributions usually works best.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string