Vectorizer (Hashing, Count, etc.): reusing after fit_transform - scikit-learn

Suppose I have two features which are both text based; for example, say I'm trying to predict sports games, and I've got:
1) Excerpt from sports commentary (a body of text)
2) Excerpt from Internet fan predictions (also a body of text).
If I were to use a text vectorizer (say HashingVectorizer) on feature 1), with fit_transform(), would it be bad to use it again (fit_transform()) on feature 2, or should I create a new vectorizer for that? I'm just wondering whether reusing fit_transform() on multiple features with the same vectorizer might perhaps have bad side effects.

I would say it depends on whether or not you want reproducibility of the text-to-vector conversion step. For example, if you want to use the same classifier (or whatever) you made from the first data set, you need to reuse the vectorizer. If you fit a new one on a different data set, it will build a different vocabulary, ie pull out different tokens, and make the vectors differently. That might be what you want with a very different data set (if you're going to retrain). It could be that the second data set contains new words that are critical for predictions. Those would be missed if you reused the vectorizer.
By the way, the vectorizers can be pickled if you want to save to disk. For an example, see: how to pickle customized vectorizer?.


When doing pre-training of a transformer model, how can I add words to the vocabulary?

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:
definitely non existing in the original training set
and impossible to handle via word piece toeknization - basically you can think of these words as "codes" that are a normalized form of a named entity
Consider that:
I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
the number of the "words" is way larger that the "unused" tokens in the "stock" vocabulary
The only advice that I have found is the one reported here:
Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
Do you think this is the only way of achieve my goal?
If yes, I do not have any idea of how to write this "script": does someone has some hints at how to proceeed (sample code, documentation etc)?
As per my comment, I'm assuming that you go with a pre-trained checkpoint, if only to "avoid [learning] a new tokenizer."
Also, the solution works with PyTorch, which might be more suitable for such changes. I haven't checked Tensorflow (which is mentioned in one of your quotes), so no guarantees that this works across platforms.
To solve your problem, let us divide this into two sub-problems:
Adding the new tokens to the tokenizer, and
Re-sizing the token embedding matrix of the model accordingly.
The first can actually be achieved quite simply by using .add_tokens(). I'm referencing the slow tokenizer's implementation of it (because it's in Python), but from what I can see, this also exists for the faster Rust-based tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")
You can quickly verify that this worked by looking at the encoded input ids:
print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]
The index 30522 now corresponds to the new token with my username, so we can check the first part. However, if we look at the function docstring of .add_tokens(), it also says:
Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the PreTrainedModel.resize_token_embeddings method.
Looking at this particular function, the description is a bit confusing, but we can get a correctly resized matrix (with randomly initialized weights for new tokens), by simply passing the previous model size, plus the number of new tokens:
from transformers import AutoModel
model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)
# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))
EDIT: Notably, .resize_token_embeddings() also takes care of any associated weights; this means, if you are pre-training, it will also adjust the size of the language modeling head (which should have the same number of tokens), or fix tied weights that would be affected by an increased number of tokens.

Tuned model with GroupKFold Cross-Validaion requires Group parameter when Predicting

I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.

Training Doc2vec with new data

I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?
Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.

Label custom entities in Resume (NER)

How I can perform NER for custom named entity. e.g. If I want to identify if particular word is skill in resume. If (Java, c++) is occurring in my text i should be able to label them as skill. I don't want to use spacy with custom corpus.I want to create the dataset e.g.
words will be my features and label(skill) will be my dependent variable.
what is the best approach to handle these kinda problems.
The alternative to custom dictionaries and gazettes is to create a dataset where you assign to each word the corrisponding label. You can define a set of labels (e.g. {OTHER, SKILL}) and create a dataset with examples like:
program OTHER
Python SKILL
And with a large enough dataset you train a model to predict the corresponding label.
You can try to get a list of "coding language" synonims (or the specific skills you are looking for) from word embeddings trained on your CV corpus and use this information to automatically label other corpora. I would say that key point is to find a way to at least partially automatize the labeling otherwise you won't have enough examples to train the model on your custom NER task. Use tools like that reduce the labeling effort.
As features you can also use word embeddings (or other typical NLP features like n-grams, POS tag, etc. depending on the model you are using)
Another option is to apply transfer learning from other NER/NLP models and finetune them on your CV labeled dataset.
I would put more effort in creating the right dataset and then test gradually more complex models selecting what best fit your needs.

adding and accessing auxiliary tf.Dataset attributes with Keras

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:
dataset =, labels, sample_weights))
dataset =
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.
I use to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset can be extended.
On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.
