NLP : Is Gazetteer a cheat

NLP : Is Gazetteer a cheat - nlp

In NLP there is a concept of Gazetteer which can be quite useful for creating annotations. As far as i understand,
A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to ﬁnd occurrences of these names in text, e.g. for the task of named entity recognition.
So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, i would want to detect named entities using NLP techniques. Otherwise how is it any better than a regex pattern matcher.
Does that make sense?

Depends on how you built/use your gazetteer. If you are presenting experiments in a closed domain and you custom picked your gazetteer, then yes, you are cheating.
If you are using some openly available gazetteer and performing experiments on a large dataset or using it in an application in the wild where you don't control the input then you are fine.
We found ourselves in a similar situation. We partition our dataset and use the training data to automatically build our gazetteers. As long as you report your methodology you should not feel like cheating (let the reviewers complain).

Related

semantic similarity for mix of languages

I have a database of several thousands of utterances. Each record (utterance) is a text representing a problem description, which a user has submitted to a service desk. Sometimes also the service desk agent's response is included. The language is highly technical, and it contains three types of tokens:
words and phrases in Language 1 (e.g. English)
words and phrases in Language 2 (e.g. French, Norwegian, or Italian)
machine-generated output (e.g. listing of files using unix command ls -la)
These languages are densely mixed. I often see that in one conversation, a sentence in Language 1 is followed by Language 2. So it is impossible to divide the data into two separate sets, corresponding to utterances in two languages.
The task is to find similarities between the records (problem descriptions). The purpose of this exercise is to understand whether some bugs submitted by users are similar to each other.
Q: What is the standard way to proceed in such a situation?
In particular, the problem lies in the fact that the words come from two different corpora (corpuses), while in addition, some technical words (like filenames, OS paths, or application names) will not be found in any.

I don't think there's a "standard way" - just things you could try.
You could look into word-embeddings that are aligned between langauges – so that similar words across multiple languages have similar vectors. Then ways of building a summary vector for a text based on word-vectors (like a simple average of all a text's words' vectors), or pairwise comparisons based on word vectors (like "Word Mover's Distance"), may still work with mixed-language texts (even mixes of languages within one text).
That a single text, presumably about a a single (or closely related) set of issues, has mixed language may be a blessing rather than a curse: some classifiers/embeddings you train from such texts might then be able to learn the cross-language correlations of words with shared topics. But also, you could consider enhancing your texts with extra synthetic auto-translated text, for any monolingual ranges, to ensure downstream embeddings/comparisons get closer to your ideal of language-obliviousness.

Thank you for the suggestions. After several experiments I developed a method which is simple and works pretty well. Rather than using existing corpora, I created my own corpus based on all the utterances available in my multilingual database. Without translating them. The database has 130,000 utterances, including 3,5 million of words (in three languages: English, French and Norwegian) and 150,000 unique words. The phrase similarity based on the meaning space constructed this way works surprisingly well. I have tested this method on production and the results are good. I also see a lot of space for improvement, and will continue to polish it. I also wrote this article An approach to categorize multi-lingual phrases, describing all the steps in more detail. Critics or improvements welcome.

Rule based named entity recognizer without parts of speech label or any other information

I'm working on a project where I am trying to build a named entity recognizer from texts. So basically I want to build and experiment the NER in 3 different ways.
First, I want to build it using only segmented sentences-> tokenized words. To clarify, I want to input only split/tokenized words into the system. Once again, the NER system is rule-based. Hence, it can only use rules to conclude which is a named entity. In the first NER, it will not have any chunk information or part of speech label. Just the tokenized words. Here, the efficiency is not the concern. Rather the concern lies in comparing the 3 different NERs, how they perform. (The one I am asking about is the 1st one).
I thought of it for a while and could not figure out any rules or any idea of coming up with a solution to this problem. One naive approach would be to conclude all words beginning with an uppercase and that does not follow a period to be a named entity.
Am I missing anything? Any heads up or guidelines would help.

Typically NER relies on preprocessing such as part-of-speech tagging (named entities are typically nouns), so not having this basic information makes the task more difficult and therefore more prone to error. There will be certain patterns that you could look for, such as the one you suggest (although what do you do with sentence-initial named entities?). You could add certain regular expression patterns with prepositions, e.g. (Title_case_token)+ of (the)? (Title_case_token)+ would match "Leader of the Free World", "Prime Minister of the United Kindom", and "Alexander the Great". You might also want to consider patterns to match acronyms such as "S.N.C.F.", "IBM", "UN", etc. A first step is probably to look for some lexical resources (i.e. word lists) like country names, first names, etc., and build from there.
You could use spaCy (Python) or TokensRegex (Java) to do token-based matching (and not use the linguistic features they add to the tokens).

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.

Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.

This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.

If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.

I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Focused Named Entity Recognition (NER)?

I want to recognize named entities in a specific field (e.g. baseball). I know there are tools available like StanfordNER, LingPipe, AlchemyAPI and I have done a little testing with them. But what I want them to be is field specific as I mentioned earlier. How this is possible?

One approach may be to
Use a general (non-domain specific) tool to detect people's names
Use a subject classifier to filter out texts that are not in the domain
If the total size of the data set is sufficient and the accuracy of the extractor and classifier good enough, you can use the result to obtain a list of people's names that are closely related to the domain in question (e.g. by restricting the results to those that are mentioned significantly more often in domain-specific texts than in other texts).
In the case of baseball, this should be a fairly good way of getting a list of people related to baseball. It would, however, not be a good way to obtain a list of baseball players only. For the latter it would be necessary to analyse the precise context in which the names are mentioned and the things said about them; but perhaps that is not required.
Edit: By subject classifier I mean the same as what other people might refer to simply as categorization, document classification, domain classification, or similar. Examples of ready-to-use tools include the classifier in Python-NLTK (see here for examples) and the one in LingPipe (see here).

Have a look at smile-ner.appspot.com which covers 250+ categories. In particaul, it covers a lot of persons/teams/clubs on sports. May be useful for your purpose.

How to find references to dates in natural text?

What I want to do is to parse raw natural text and find all the phrases that describe dates.
I've got a fairly big corpus with all the references to dates marked up:
I met him <date>yesterday</date>.
Roger Zelazny was born <date>in 1937</date>
He'll have a hell of a hangover <date>tomorrow morning</date>
I don't want to interpret the date phrases, just locate them. The fact that they're dates is irrelevant (in real life they're not even dates but I don't want to bore you with the details), basically it's just an open-ended set of possible values. The grammar of the values themselves can be approximated as context-free, however it's quite complicated to build manually and with increasing complexity it gets increasingly hard to avoid false positives.
I know this is a bit of a long shot so I'm not expecting an out-of-the-box solution to exist out there, but what technology or research can I potentially use?

One of the generic approaches used in academia and in industry is based on Conditional Random Fields. Basically, it is a special probabilistic model, you train it first with your marked up data and then it can label certain types of entities in a given text.
You can even try one of the systems from Stanford Natural Language Processing Group: Stanford Named Entity Recognizer
When you download the tool, note there are several models, you need the last one:
Included with the Stanford NER are a 4 class model trained for CoNLL,
a 7 class model trained for MUC, and a 3 class model trained on both
data sets for the intersection of those class sets.
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Time, Location, Organization, Person, Money, Percent, Date
Update. You can actually try that tool online here. Select the muc.7class.distsim.crf.ser.gz classifier and try some text with dates. It doesn't seem to recognize "yesterday", but it recognizes "20th century", for example. In the end, this is a matter of CRF training.

Keep in mind CRFs are rather slow to train and require human-annotated data, so doing it yourself is not easy. Read the answers to this for another example of how people often do it in practice- not much in common with current academic research.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string