Adding domain knowledge (custom features) to NER - python-3.x

I'm on an Ubuntu machine with Python 3.5.2 and spaCy 2.0. I'm training a blank Spanish model to recognize entities in resumes. For that I used custom word embeddings and I'm doing a large entity annotation project. I was able to segment a resume and find out which section of the resume the segment belongs to using the word embeddings and I wanna use that knowledge to augment spaCy's NER (for example, if an entity belongs to the work experience section it's more likely to be an organization than an educational institution). I was looking through the documentation and while I saw that there's a way to add custom attributes and/or calculate them using pipelines and extensions I was unable to tell whether the NER algorithm will use them as features by default or if I need to add custom code to it.
Is there any way to do this manually or is it custom behavior?
Thank you, and regards.

Related

Label custom entities in Resume (NER)

How I can perform NER for custom named entity. e.g. If I want to identify if particular word is skill in resume. If (Java, c++) is occurring in my text i should be able to label them as skill. I don't want to use spacy with custom corpus.I want to create the dataset e.g.
words will be my features and label(skill) will be my dependent variable.
what is the best approach to handle these kinda problems.
The alternative to custom dictionaries and gazettes is to create a dataset where you assign to each word the corrisponding label. You can define a set of labels (e.g. {OTHER, SKILL}) and create a dataset with examples like:
I OTHER
can OTHER
program OTHER
in OTHER
Python SKILL
. OTHER
And with a large enough dataset you train a model to predict the corresponding label.
You can try to get a list of "coding language" synonims (or the specific skills you are looking for) from word embeddings trained on your CV corpus and use this information to automatically label other corpora. I would say that key point is to find a way to at least partially automatize the labeling otherwise you won't have enough examples to train the model on your custom NER task. Use tools like https://prodi.gy/ that reduce the labeling effort.
As features you can also use word embeddings (or other typical NLP features like n-grams, POS tag, etc. depending on the model you are using)
Another option is to apply transfer learning from other NER/NLP models and finetune them on your CV labeled dataset.
I would put more effort in creating the right dataset and then test gradually more complex models selecting what best fit your needs.

Difference between Stanford CoreNLP and Stanford NER

What is the difference between using CoreNLP (https://stanfordnlp.github.io/CoreNLP/ner.html) and the standalone distribution Stanford NER (https://nlp.stanford.edu/software/CRF-NER.html) for doing Named Entity Recognition? I noticed that the standalone distribution comes with a GUI, but are there any other differences in terms of supported functionality?
I'm trying to decide which one to use for a commercial purpose. I'm working on English models only.
There's no difference in terms of what algorithm is run. I would suggest the full version since you can use the pipeline code. But both versions use the exact same code for the actual NER part.

opennlp sample training data for disease

I'm using OpenNLP for data classification. I could not find TokenNameFinderModel for disease here. I know I can create my own model but I was wondering is there any large sample training data available for disease?
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Hope this helps!

Customizing my Own model in Stanford NER

Could I ask about Stanford NER?? Actually, I'm trying to train my own model, to use it later for learning. According to the documentation, I have to add my own features in SeqClassifierFlags and add code for each Feature in NERFeatureFactory.
My questions is that, I have my tokens with all features extracted and Last column represents the label. So, is there any way in Stanford NER to give it my Tab-Delimeted file which contains 30 columns (1 is word , 28 are featurs, and 1 is label) to train my own model without spending time for extracting features???
Of course, in Testing phase, I will give it a file like the the aforementioned file without label to predict the label.
Is this possible or Not??
Many thanks in Advance
As explained in the FAQ page, the only way to customize the NER model is by inserting the data and specifying the features that you want to extract.
But, wait ... you have the data, and you have managed to extract the features, so I think you don't need the NER model, you need a classifier. I know this answer is pretty pretty late, but maybe this classifier will be a good place to start.

Clarification about opennlp algorithm

I'm trying to get in with openNlp. I need it to get new organizations(startups) from news websites (for example: techcrunch). I have a model with organizations, which I use to recognize organizations in publications(en-ner-organization). And here I have a question:
In case there is a publication about new startup, which was born yesterday,
will openNlp recognize it as organization?
As far as I understand - no. Until I don't train model with this new startup, right?
If all my assumptions are correct, the model partially contains of organizations names, so if I want my model to recognize new organization, I have to train it with it's name.
Thanks
As far as I know, OpenNLP should use a statistical model to address named entity recognition: this means that, if OpenNLP has been properly trained with enough data, it should be able to recognize new startups (it's not a grep of known tokens over a file).
Of course metrics such as precision, recall and F1 are useful to determine the accuracy of the algorithm.

Resources