Finding demographics for a model in R Lavaan - statistics

I am currently in a pickle. I have taken a subsample from a sample (General Social Survey; GSS). The subsample has certain features that the rest of the data doesn't have.
Other features are: IV is a manifest variable; the 3 mediators are composite variables; The DV is a 5 item latent variable; my subsample has a 706 participants. The entire GSS has over 2,400 participants.
In my model I have several variables that work in concert with each other. Do the demographics have to be assessed in combination with all of the variables in play? So, I think I need to find demographics of the entire model in play? How would I find demographic information for this model?
Or is there a way to create a variable for the entire model? And then look for certain features such as socioeconomic status, age and gender.

Related

How to implement multi state LSTM RNN in keras

I have 1000 distinct users and the dataset consists activities of these users over the past 1 year. Total records are over 300K. The inputs for the LSTM RNN has the feature vectors corresponding to these users. The user is also included because behavior of each user may vary from person to person. The network should learn behavior of each user and should be able to predict the next behavior based on the past information of the same user.
How to maintain separate hidden states for each user within an LSTM RNN.
Following blog post is similar to my problem:
https://towardsdatascience.com/multi-state-lstms-for-categorical-features-66cc974df1dc
Update
My dataset looks like:
DATASET
I transformed my dataset into a 3D the numpy array and reshaped it as (No of records, timesteps, n_features).
The questions are:
1) Is it necessary to encode the "user" attribute?
2) what is the correct batch size for this problem? Is it batch = 1000 (no. of distinct users)?
3) Do I need to include each user's data in each batch input to the model?
OR
Please suggest the correct implementation of this problem.
This is just automatic. You don't need to do anything.
The LSTM layer will certainly have a state matrix the size of your batch of users. (Otherwise it wouldn't be useful)

Latent Class Analysis Model Selection

When conducting Latent Class Analysis sometimes the information criterion (i.e., AIC, BIC, aBIC) don't select the same model. Such is the case in a study of substance use patterns that I am conducting among 774 men who have sex with men. Figure 1 shows the fit criterion plotted for each number of latent classes. BIC and CAIC select the three class model (See Figure 2). However, the aBIC selects a five class model (See Figure 2).
How do you select a model solution under these circumstances? Is there a way to select variables or collapse variables down in order to optimize results?
It is never easy to select the number of classes for LCA, but there are some rules of thumb that I follow:
Based on Nylund, Asparouhov & Muthén (2007) you want to follow BIC and bootstrap likelihood ratio test (BLRT). Even then, they seldom agree – BLRT will tell you to pick a model with more classes, BIC will be more conservative and suggest fewer classes. But this is as close as you can get by using statistical tests.
Rely on the available theory underlying your model. Look for potential discrepancies with your theoretical knowledge and try to deduce from the theory how many classes are to be expected. There is no golden rule, LCA is a good method, but without theory it is quite meaningless. If you have little theory, what you can do to double check your findings is to relate your latent variable to a distal outcome (covariate) about which you might have some theory and see if it works out. For example, you suspect that one of your latent classes will be dominated by one gender: associate your latent variable with gender and see.
Parsimony rule: simple models are preferred to complex ones (Collins & Lanza, 2010). If a simpler model does all the work, why choose a complex one?
In your case, I would start with a 3 class model, since it is suggested by BIC and parsimony. After finishing the analysis and interpreting the findings, I would re-run the model with 4/5 classes and see if I would reach substantially different findings - something that is worth reporting on, any important or contradicting findings to what I have found with a 3 class model. If it just adds complexity, but does not contradict or improve what I have already known, I'd stick to a 3 class model.
Looking at the results, I think that the 5 class model does not provide anything beyond the 3 classes. In the 3 class model, you have one class of extensive drug users (16%), moderate drug users dominated by cannabis, popper, hallucinogens and cocaine (40%), and finally a class of light users dominated by alcohol and cannabis (44%). The 5 class model split the first two groups into specific smaller sub-groups, but you have to decide whether these splits are important for your research - whether they make sense for your research question.
I would also recommend checking bivariate residuals. It is possible that the model misfit that is suggesting more classes is generated by a residual association between your indicators. If you can justify it theoretically (for example by finding some similarity between the indicators beyond the latent class), you can add the residual association and obtain a similarly good fit with the 3 class model.
One last point, avoid using AIC for LCA altogether - it is a very poorly performing index! Use cAIC, BIC and aBIC instead. AIC does not correct for the sample size, which can be quite problematic with larger samples.
Sources:
Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.

Best practice wit ai training

I am developing an apps that use wit ai as a service. Right now, I am having problems training it. In my apps I have 3 intents:
to call
to text
to send picture
Here are my example training:
Call this number 072839485 and text this number 0623744758 and send picture to this number 0834952849.
Call this number 072839485, 0834952849 and 0623744758
In my first training I labeled that sentence with all 3 intents, and 072839485 as phone_number with role to_call_phone_number, 0623744758 as phone_number with role to_text_phone_number and 0834952849 as phone_number with role to_send_pic_phone_number.
In my second training I labeled all the 3 numbers as phone_number with to_call_phone_number role.
After many training, the wit still output the wrong labelled. When the sentence like this:
Call this number 072637464, 07263485 and 0273847584
The wit says 072637464 is to_call_phone_number but 07263485 and 0273847584 are to_send_pic_phone_number.
Am I not correctly training it? Can some one give me some suggestions about the best practice to train wit?
There aren't many best practices out there for wit.ai training at the moment, but with this particular example in mind I would recommend the following:
Pay attention to the type of entity in addition to just the value. If you choose free-text or keyword, you'll get different responses from the wit engine. For example: in your training if the number is a keyword, it'll associate the particular number with the intent/role rather than the position. This is probably the reason your training isn't working correctly.
One good practice would be to train your bot with specific examples first which will provide the bot with more information (such as user providing keyword 'photograph' and number) and then general examples which will apply to more cases (such as your second example).
Think about the user's perspective and what would seem natural to them. Work with those training examples first. Generate a list of possible training examples labelling them from general to specific and then train intents/roles/entities based on those examples rather than thinking about intents and roles first.

Dataset for density based clustering based on probability and possible cluster validation method

Can anyone help me to find a dataset have scores as attribute values and having the class labels(Ground Truth for cluster validation).I want to find the probability of each data item and inturn use it for clustering.
The preferable attribute values are scores like user survey scores(1-bad,2-satisfactory,3-good,4-very good) for each of the attributes.I am preferring score values(say 1,2,3,4) as attribute values as it is easy to calculate probability of each attribute value from these score values.
I found some datasets from UCI Repository but not all attribute values were score values.
Most (if not all) clustering algorithms are density based.
There is plenty of survey literature on clustering algorithm that you need to check. There are literary hundreds of density based algorithms, including DBSCAN, OPTICS, DENCLUE, ...
However, I have the impression you are using the term "density based" different than literature. You seem to refer to probability, not density?
Do not expect clustering to give class labels. Classes are not clusters. Classes can be inseparable, or a single class may consists of multiple clusters. The famous iris data set, for example, intuitively consists only of 2 clusters (but 3 classes).
For evaluation and all that, check existing questions and answers, please.

Is it possible to supplement Naive Bayes text classification algorithm with author information?

I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.
Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.
I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).
One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.
Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.
There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.
And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.
Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.
One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.

Resources