Token sequence labeling - nlp

I have a task which is half a matcher and half an entity extraction. I want to label some words that in some contexts refer to some label. Named entity extraction would be the way to go, but these words do not necessarily share structure (they can be verbs, nouns... etc). I could simply use a dictionary, but I would like to use context to label them. I am having trouble finding a solution to this problem. Can NER be used for this or is this a completely different task?
To give some examples, consider that I am interested in the following category "customer acceptance". Now, these can be possible sentences: "this is a fair amount of data!" and "this condition is not fair". I want my word extractor to find only the second 'fair'.
In other words, it is like a dictionary that takes context into account.

Related

How to identify the similar words using the word2vec

input: I have a set of words(N) & input sentence
problem statement:
the sentence is dynamic, the user can give any sentence related to one business domain. we have to map the input sentence tokens to the set of words based on the closeness.
for example, we can use different words to ask the same meaning questions, and hard to maintain all the synonyms hence we have a mechanism to find similar words, we can map easily.
1) A meeting scheduled by john
2) A meeting organized by john
user can frame a sentence in different ways, like the above example.
scheduled & organized are very close.
N set has the word, scheduled. if a user gives a sentence like (2), I have to map the organized with scheduled.
Take a look at "Word Mover's Distance", a way to calculate differences between texts that's essentially based on "bags of word-vectors". It can be expensive to calculate, especially on longer texts, but generally identifies "similar" ranges-of-text better than a simple baseline like "average of all word-vectors".
Beyond that, some of the deeper-neural-network methods of vectorizing text – BERT, ELMo, etc – may do an even-more effective job of placing such "similar intent by different words" texts into close positions in a shared coordinate space.

Rule based named entity recognizer without parts of speech label or any other information

I'm working on a project where I am trying to build a named entity recognizer from texts. So basically I want to build and experiment the NER in 3 different ways.
First, I want to build it using only segmented sentences-> tokenized words. To clarify, I want to input only split/tokenized words into the system. Once again, the NER system is rule-based. Hence, it can only use rules to conclude which is a named entity. In the first NER, it will not have any chunk information or part of speech label. Just the tokenized words. Here, the efficiency is not the concern. Rather the concern lies in comparing the 3 different NERs, how they perform. (The one I am asking about is the 1st one).
I thought of it for a while and could not figure out any rules or any idea of coming up with a solution to this problem. One naive approach would be to conclude all words beginning with an uppercase and that does not follow a period to be a named entity.
Am I missing anything? Any heads up or guidelines would help.
Typically NER relies on preprocessing such as part-of-speech tagging (named entities are typically nouns), so not having this basic information makes the task more difficult and therefore more prone to error. There will be certain patterns that you could look for, such as the one you suggest (although what do you do with sentence-initial named entities?). You could add certain regular expression patterns with prepositions, e.g. (Title_case_token)+ of (the)? (Title_case_token)+ would match "Leader of the Free World", "Prime Minister of the United Kindom", and "Alexander the Great". You might also want to consider patterns to match acronyms such as "S.N.C.F.", "IBM", "UN", etc. A first step is probably to look for some lexical resources (i.e. word lists) like country names, first names, etc., and build from there.
You could use spaCy (Python) or TokensRegex (Java) to do token-based matching (and not use the linguistic features they add to the tokens).

How can I use Natural Language Processing to check if a paragraph contains predefined topics?

We have a system that allows users to answer a question as free text and we want to check whether their answer contains any of our predefined topics. These topics will be defined prior to answers being submitted.
We tried to use a method similar to spam detection, but this is only good for determining whether something is true/false, incorrect/correct. We need the response to say which of the predefined topics a piece of text contains. Is there an algorithm that would solve this problem?
Maybe you will try to use "bag of words" for feature extraction and "naive Bayes classifier with multinomial model" for classification.
In this page this described more detail link.
You could also try explicit semantic analysis (ESA)[1][2]. Given a set of documents that represent concepts (in your case your topics) you can train a model and given any new sentence as input you can get a ranked list of the closest concepts "evoked" by that sentence. Of course this assume you have a document with some text describing every concept you want to identify (that's why the most common thing to do is to use Wikipedia pages as concepts), but if this is the case you could give it a try.
[1] https://en.wikipedia.org/wiki/Explicit_semantic_analysis
[2] http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Finding how relevant a text is, given a whitelist and blacklist of words/phrases

This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?
Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.

Resources