I added a new entity called "orgName" to en_core_web_lg using https://spacy.io/usage/training#example-new-entity-type
All my training data (26k sentences) have the "orgName" labeled in them.
To deal with the catastrophic forgetting problem, I ran en_core_web_lg on those 26k raw sentences and added the ORG, PROD, FAC, etc. entities as labels and not face the colliding entities, I created duplicates.
So, for a sentence A which was labeled by "orgName", I created a duplicate A2 which has ORG, PROD, FAC, etc. ending up with about 52k sentences.
I trained using 100 iterations.
Now, the problem is that testing the model even on the training sentences, it's not showing the ORG, PROD, FAC, etc. but only showing "orgName".
Where do you think the problem is?
In principle the way you're trying to solve the catastrophic forgetting problem, by retraining it on its old predictions, seems like a good approach to me.
However, if you are having duplicate versions of the same sentence, but annotated differently, and feeding that to the NER classifier, you may confuse the model. The reason is that it doesn't just look at the positive examples, but also explicitely sees non-annotated words as negative cases.
So if you have "Bob lives in London", and you only annotate "London", then it will think Bob is surely not an NE. If then you have a second sentence where you annotate only Bob, it will "unlearn" that London is an NE, because now it's not annotated as such. So consistency really is important.
I would suggest to implement a more advanced algorithm to resolve the conflicts.
One option is to always just take the annotated entity with the longest Span. But if the Spans are often exactly the same, you may need to reconsider your label scheme. Which entities collide most often? I would assume ORG and OrgName? Do you really need ORG? Perhaps the two can be "merged" as the same entity?
Related
This is a question regarding training models on SPACY3.x.
I couldn't find a good answer/solution on StackOverflow hence the query.
If I am using the existing model in spacy like the en model and want to add my own entities in the model and train it, let's say since I work in the biomedical domain, things like virus name, shape, length, temperature, temperature value, etc. I don't want to lose the entities tagged by Spacy like organization names, country, etc.
All suggestions are appreciated.
Thanks
There are a few ways to do that.
The best way is to train your own model separately and then combine both models in one pipeline, with one before the other. See the double NER example project for an overview of that.
It's also possible to update the pretrained NER model, see this example project. However this isn't usually a good idea, and definitely not if you're adding completely different entities. You'll run into what's called "catastrophic forgetting", where even though you're technically updating the model, it ends up forgetting everything not represented in your current training data.
I want to train a Doc2Vec model with a generic corpus and, then, continue training with a domain-specific corpus (I have read that is a common strategy and I want to test results).
I have all the documents, so I can build and tag the vocab at the beginning.
As I understand, I should train initially all the epochs with the generic docs, and then repeat the epochs with the ad hoc docs. But, this way, I cannot place all the docs in a corpus iterator and call train() once (as it is recommended everywhere).
So, after building the global vocab, I have created two iterators, the first one for the generic docs and the second one for the ad hoc docs, and called train() twice.
Is it the best way or it is a more appropriate way?
If the best, how I should manage alpha and min_alpha? Is it a good decision not to mention them in the train() calls and let the train() manage them?
Best
Alberto
This is probably not a wise strategy, because:
the Python Gensim Doc2Vec class hasn't ever properly supported expanding its known vocabulary after a 1st single build_vocab() call. (Up through at least 3.8.3, such attempts typically cause a Segmentation Fault process crash.) Thus if there are words that are only in your domain-corpus, an initial typical initialization/training on the generic-corpus would leave them out of the model entirely. (You could work around this, with some atypical extra steps, but the other concerns below would remain.)
if there is truly an important contrast between the words/word-senses used in your generic and the different words/word-senses used in your domain corpus, influence of the words from the generic corpus may not be beneficial, diluting domain-relevant meanings
further, any followup training that just uses a subset of all documents (the domain corpus) will only be updating the vectors for that subset of words/word-senses, and the model's internal weights used for further unseen-document inference, in directions that make sense for the domain-corpus alone. Such later-trained vectors may be nudged arbitrarily far out of comparable alignment with other words not appearing in the domain-corpus, and earlier-trained vectors will find themselves no longer tuned in relation to the model's later-updated internal-weights. (Exactly how far will depend on the learning-rate alpha & epochs choices in the followup training, and how well that followup training optimizes model loss.)
If your domain dataset is sufficient, or can be grown with more domain data, it may not be necessary to mix in other training steps/data. But if you think you must try that, the best-grounded approach would be to shuffle all training data together, and train in one session where all words are known from the beginning, and all training examples are presented in balanced, interleaved fashion. (Or possibly, where some training texts considered extra-important are oversampled, but still mixed in with the variety of all available documents, in all epochs.)
If you see an authoritative source suggesting such a "train with one dataset, then another disjoint dataset" approach with the Doc2Vec algorithms, you should press them for more details on what they did to make that work: exact code steps, and the evaluations which showed an improvement. (It's not impossible that there's some way to manage all the issues! But I've seen many vague impressions that this separate-pretraining is straightforward or beneficial, and zero actual working writeups with code and evaluation metrics showing that it's working.)
Update with respect to the additional clarifications you provided at https://stackoverflow.com/a/64865886/130288:
Even with that context, my recommendation remains: don't do this segmenting of training into two batches. It's almost certain to degrade the model compared to a combined training.
I would be interested to see links to the "references in the literature" you allude to. They may be confused or talking about algorithms other than the Doc2Vec ("Paragraph Vectors") algorithm.
If there is any reason to give your domain docs more weight, a better-grounded way would be to oversample them in the combined corpus.
Bu by all means, test all these variants & publish the relative results. If you're exploring shaky hypotheses, I would ignore any advice from StackOverflow-like sources & just run all the variants that your reading of the literature suggest, to see which, if any actually help.
You're right to recognized that the choice of alpha parameters is a murky area that could majorly influence what impact such add-on training has. There's no right answer, so you'll have to search-for and reason-out what might make sense. The inherent issues I've mentioned with such subset-followup-training could make it so that even if you find benefits in some combos, they may be more a product of a lucky combination of data & arbitrary parameters than a generalizable practice.
And: your specific question "if it is better to set such values or not provide them at all" reduces to: "do you want to use the default values, or values set when the model was created, or not?"
Which values might be workable, if at all, for this unproven technique is something that'd need to be experimentally discovered. That is, if you wanted to have comparable (or publishable) results here, I think you'd have to justify from your own novel work some specific strategy for choosing good alpha/epochs and other parameters, rather than adopt any practice merely recommended in a StackOverflow answer.
I have transcripts of phone calls with customers and agents. I'm trying to find promises which were made by an agent to a customer.
I already did punctuation restoration. But there are a lot of sentences that don't have any sense. I would like to remove them from the transcript. Most of them are just a set of not connected words.
I wonder what approach is the best for this task?
My ideas are:
• Use tf idf and word2vec to create vectors from all sentences. After that we can do some kind of anomaly detection e.g. look for and delete vectors that are highly deviated from most other vectors.
• Spam filters. Maybe is it possible to apply spam filters for this task?
• Crate some pattern of part of speech tags that proper sentence must include. For example, any good sentence must include noun + verb. Or we can use for example dependency tokens from spacy.
Examples
Example of a sentence that I want to keep:
There's no charge once sent that you'll get a ups tracking number.
Example of a junk sentence:
Kinder pr just have to type it in again, clock drives bethel.
Another junk sentence:
Just so you have it on and said this is regarding that.
One thing I would try is to treat this as a classification problem (junk vs non-junk). You can train a model based on a labelled set (i.e. you need to label some subset of your dataset) and then classify the rest of the corpus.
You could use a pre-trained language model like Bert and fine-tune it with you labeled set, as in here (https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).
The advantage of using a language model like this is that you don't have to worry too much about linguistic (pre-)processing, meaning you don't have to get the part-of-speech or syntactic structure.
Comments regarding your ideas:
Anomaly detection with tf-idf and word2vec: It depends on the proportion of the junk sentences in your corpus. If they it's more than 15%, I would think that they might not be so anomal. Also, I am assuming your junk sentences come from noisy automatic speech-to-text transcription. I am not sure, to what extent parts of these junk sentences are correctly transcribed and what the effect of the correctly transcribed portion might have on the extent of the anomaly.
If you mean pre-existing spam filters that are trained on spam email, I would guess that the spammyness of emails is quite different from junkiness of your transcripts.
Use POS tags or syntactic structure to manually create rules for valid sentences:
This seems a bit tedious too me and also I am not sure if you will discover all junk with this. For instance, in your junk examples, the syntactic structure does not strike me as too unusal, e.g. "clock drives bethel" might be tagged as , which is quite a common tag sequence. The junkiness in this case comes from the meaning of the words.
I want to train two word2vec/GLoVe models on different corpora and then compare the vectors of a single word. I know that it makes no sense to do so as different models start at different random states, but what if we use pre-trained word vectors as the starting point. Can we assume that the two models will continue to build upon the pre-trained vectors by incorporating the respective domain-specific knowledge, and not go into completely different states?
Tried to find some research papers which discuss this problem, but couldn't find any.
Simply starting your models with pre-trained bectors would eliminate some of the randomness, but with each training epoch on your new corpora:
there's still randomness introduced by negative-sampling (if using that default mode), by frequent-word downsampling (if using default values of the sample parameter in word2vec), and by the interplay of different threads
each epoch with your new corpora will be pulling the word-vectors for present words to new, better positions for that corpora, but leaving original words unmoved. The net movements over many epochs could move words arbitrarily far from where they started, in response to the whole-corpus-effects on all words.
So, doing so wouldn't necessarily achieve your goal in a reliable (or theoretically-defensible) way, though it might kinda-work – at least better than starting from purely random initialization – especially if your corpora are small and you do few training epochs. (That's usually a bad idea – you want big varied training data and enough passes for extra passes to make little incremental difference. But doing those things "wrong" could make your results look "better" in this scenario, where you don't want your training to change the original coordinate-space "too much". I wouldn't rely on such an approach.)
Especially if the words you need to compare are a small subset of the total vocabulary, a couple things you could consider:
combine the corpora into one training corpus, shuffled together, but for those words you need to compare, replace them with corpora-specific tokens. For example, replace 'sugar' with 'sugar_c1' and 'sugar_c2' – leaving the vast majority of surrounding words to be the same tokens (and thus learn a single vector across the whole corpus). Then, the two variant tokens for the "same word" will learn different vectors, based on their differing contexts that still share many of the same tokens.
using some "anchor set" of words that you know (or confidently conjecture) either do mean the same across both contexts, or should mean the same, train two models but learn a transformation between the two space based on those guide words. Then, when you apply that transformation to other words, that weren't used to learn the transformation, they'll land in contrasting positions in each others' spaces, maybe achieving the comparison you need. This is a technique that's been used for language-to-language translation, and there's a helper class and example notebook included with the Python gensim library.
There may be other better approaches, these are just two quick ideas that might work without much change to existing libraries. A project like 'HistWords', which used word-vector training to try to track evolving changes in word-meaning over time, might also have ideas for usable techniques.
I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.
Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.
I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).
One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.
Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.
There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.
And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.
Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.
One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.