Does anyone know how to do a nested random-effects model in Python? Using statsmodels MixedLM, it gives me a singular matrix error.
the issue arises when you have a column with values 0 throughout
So, you can check the variables you're using in mixedlm model and find out which ones are zero.
Hope it helps!
Related
After training my dataset which has a number of categorical data using fastai's tabular model, I wish to read out the entity embedding and use it to map to my original data values.
I can see the embedding weights. The number of input don't seem to match anything, but maybe it is based on the unique categorical values in the train_ds.
To get that map, I would like to get the self.categories dictionary from the Categorify transform class. Is there anyway to get that from the data variable obtained by calling TabularList.from_df?
Or maybe someone can tell me a better way to get this map. I know the input df into the TabularList.from_df() is not it, because the number of rows are wrong. Most likely because df is splitted into train and valid subsets. But there is no easy way to obtain the train part of the TabularList to check just the train part.
It's strange I can't find any code example that shows this. Doesn't anyone else care to map the entity embedding value back to its original categorical value?
I found it.
It is in data.train_ds.inner_df.
I am trying to code a predictive model and i Found this code somewhere and wanted to know what it does mean please. Here it is "X_train.reset_index(inplace = True)"?
I think it would help if you provide more context to your question. But in the meanwhile, it seems that the line of code that you have shown here is enumerating the training dataset of whatever model you're working with (usually X denotes the data and Y denotes the labels).
The dataset is a pandas DataFrame object, and the reset_index function enumerates the items in the DataFrame so that each item in the DataFrame is numbered instead of named. You can find more information about this in the documentation for this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
Is there a way to get the tf and idf for the stopwords_ attribute of sklearn's TtfidfVectorizer (not stopwords)?
They are already calculated, so the model should have these values, but has anyone ever used them? If not, I guess I have to hack the internal code and get them myself, correct?
[UPDATE]
For anyone who might end up on this question, as an update, what I ended up doing is hacking sklearn/feature_extraction/text.py and exporting the words and values as tuples for class CountVectorizer rather than just the words.
In my Theano program, I want to split the tensor matrix into two parts, with each of them making different contributions to the error function. Can anyone tell me whether automatic differentiation support this?
For example, for a tensor matrix variable M, I want to split it into M1=M[:300,] and M2=M[300:,], then the cost function is defined as 0.5* M1 * w + 0.8*M2*w. Is it still possible to get the gradient with T.grad(cost,w)?
Or more specifically, I want to construct an Autoencoder with different features having different weights in contribution to the total cost.
Thanks for anyone who answers my question.
Theano support this out of the box. You have nothing particular to do. If Theano don't support something in the crash, it should raise an error. But you won't have it for this, if there isn't problem in the way you call it. But the current pseudo-code should work.
I am now working on a document recommendation program and I am kinda stuck here.
For each document, I have a score assigned according to user's actions. Then, when a new document comes in, I need to predict how user will like it and rerank the whole documents again according to their scores. My solution is to use a threshold to divide those scores into "recommend" and "not recommend". Then naiveBayes or other classification models can either give me a label or return the possibility of that label (I am using NLTK package to do text analytics).
Am I on the right way? My question is when I get that possibility, how can I convert it into the score that I use to do the ranking? Or I should use logistic regression in scikit instead?
Thanks!
It sounds like you are trying to force a ranking problem into a classification problem. What you really want to do is learn how to rank the documents given a "query".
I would suggest trying out something like the SVM-Rank algorithm. It takes as input a set of "recommended" and "not recommended" vectors and then learns how to rank them so that the recommended ones come first. There is also a simple python tool in dlib you can use to do it. See here for an example: http://dlib.net/svm_rank.py.html