opennlp sample training data for disease - nlp

I'm using OpenNLP for data classification. I could not find TokenNameFinderModel for disease here. I know I can create my own model but I was wondering is there any large sample training data available for disease?

You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Hope this helps!

Related

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

Detecting questions in text

I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.

Training an NER classifier to recognise Author names

I want to use NER(CRF classifier) to identify Author names in a query. I trained NER following the method given in nlp.stanford.edu site using the training file:training-data.col. And tested using the file:testing-data.tsv.
The NER is tagging every input as Author, even the data that is tagged as non-Author in the training data. Can anyone tell me why NER is tagging the non-Authors in training data as Authors and how to train NER to identify Authors(I have the list of Author names to train).
Any suggestions for reference material on NER other than nlp.stanford.edu site will be helpful.
That's a very small piece of training data, so I'm not surprised that it made the wrong inferences. Since the only example it has seen of "Atal" is as Author, it's tagging "Atal" as such.
But more so, if you want to discriminate between people listed at the beginning as Author and people listed in the text as 0, Stanford NER is not going to do that. Stanford NER is intended to make long distance inferences about the named-entity tags of tokens in natural language text. In other words, it's doing the opposite of what you're trying to do.
You could probably do this with some simple pattern recognition---if your documents are formatted in a similar way, with the authors together, I would start with exploiting that. You could use the NER to tag the authors as PERSON, and then use that tag as a feature in your own tagging.

Customizing my Own model in Stanford NER

Could I ask about Stanford NER?? Actually, I'm trying to train my own model, to use it later for learning. According to the documentation, I have to add my own features in SeqClassifierFlags and add code for each Feature in NERFeatureFactory.
My questions is that, I have my tokens with all features extracted and Last column represents the label. So, is there any way in Stanford NER to give it my Tab-Delimeted file which contains 30 columns (1 is word , 28 are featurs, and 1 is label) to train my own model without spending time for extracting features???
Of course, in Testing phase, I will give it a file like the the aforementioned file without label to predict the label.
Is this possible or Not??
Many thanks in Advance
As explained in the FAQ page, the only way to customize the NER model is by inserting the data and specifying the features that you want to extract.
But, wait ... you have the data, and you have managed to extract the features, so I think you don't need the NER model, you need a classifier. I know this answer is pretty pretty late, but maybe this classifier will be a good place to start.

NLTK NER: Continuous Learning

I have been trying to use NER feature of NLTK. I want to extract such entities from the articles. I know that it can not be perfect in doing so but I wonder if there is human intervention in between to manually tag NEs, will it improve?
If yes, is it possible with present model in NLTK to continually train the model. (Semi-Supervised Training)
The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job).
You could refer this link for performing he same.
Also, there is a module called timex in nltk_contrib which might help you with your needs.
If you are interested to perform the same in Java better look into Stanford SUTime, it is a part of Stanford CoreNLP.

Resources