fastai tabular model trained but can not find categorical mapping - pytorch

After training my dataset which has a number of categorical data using fastai's tabular model, I wish to read out the entity embedding and use it to map to my original data values.
I can see the embedding weights. The number of input don't seem to match anything, but maybe it is based on the unique categorical values in the train_ds.
To get that map, I would like to get the self.categories dictionary from the Categorify transform class. Is there anyway to get that from the data variable obtained by calling TabularList.from_df?
Or maybe someone can tell me a better way to get this map. I know the input df into the TabularList.from_df() is not it, because the number of rows are wrong. Most likely because df is splitted into train and valid subsets. But there is no easy way to obtain the train part of the TabularList to check just the train part.
It's strange I can't find any code example that shows this. Doesn't anyone else care to map the entity embedding value back to its original categorical value?

I found it.
It is in data.train_ds.inner_df.

Related

Tuned model with GroupKFold Cross-Validaion requires Group parameter when Predicting

I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.

I want to understand Python prediction model

I am trying to code a predictive model and i Found this code somewhere and wanted to know what it does mean please. Here it is "X_train.reset_index(inplace = True)"?
I think it would help if you provide more context to your question. But in the meanwhile, it seems that the line of code that you have shown here is enumerating the training dataset of whatever model you're working with (usually X denotes the data and Y denotes the labels).
The dataset is a pandas DataFrame object, and the reset_index function enumerates the items in the DataFrame so that each item in the DataFrame is numbered instead of named. You can find more information about this in the documentation for this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

Is there a way to extract predicted values, using which XGBoost calculates the train/eval errors (stored in evals_results)?

I am looking to gain a better understanding of how my model learns a particular dataset. I wanted to visualize the training and eval phases of learning by plotting the actual training/eval data alongside model predictions for the same.
I got the idea from observing some Matlab code, which allows the user to plot the above mentioned values. Unfortunately I no longer have access to the Matlab code and would like to recreate the same in Python.
Using the code below:
model = xgb.train(params, dtrain,evals=watchlist,evals_result=results,verbose_eval=False)
I can get a results dictionary which saves, the training and eval rmse values as shown below:
{'eval': {'rmse': [0.557375, 0.504097, 0.449699, 0.404737, 0.364217, 0.327787, 0.295155, 0.266028, 0.235819, 0.212781]}, 'train': {'rmse': [0.405989, 0.370338, 0.337915, 0.308605, 0.281713, 0.257068, 0.234662, 0.214531, 0.195993, 0.179145]}}
While the output shows me the rmse values, I was wondering whether there is a way to get the predicted values for both the training as well as eval set, using which these rmse values are calculated.

Efficient way to get best matching pairs given a similarity-outputting neural network?

I am trying to come up with a neural network that ranks two short pairs of text (for example, stackexchange title and body). Following the deep learning cookbook's example, the network would look basically like this:
So we have our two inputs (title and body), embed them, then calculate the cosine similarity between embeddings. The inputs of the model would be [title,body], the output is [sim].
Now I'd like the closest matching body for a given title. I am wondering if there's a more efficient way of doing this that doesn't involve iterating over every possible pair of (title,body) and calculating the corresponding similarity? Because for very large datasets this is just not feasible.
Any help is much appreciated!
It is indeed not very efficient to iterate over every possible data pair. Instead you could use your model to extract all the embeddings of your titles and text bodies and save them in a database (or simply a .npy file). So, you don't use your model to output a similarity score but instead use your model to output an embedding (from your embedding layer).
At inference time you can then use a library for efficient similarity search such as faiss. Given a title you would simply look up its embedding and search in the whole embedding space of all body embeddings to see which ones get the highest score. I have used this approach myself and been able to search 1M vectors in just 100 ms.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources