Google NLP entities analysis found more entities of the same word

Google NLP entities analysis found more entities of the same word - google-cloud-nl

I am learning how to use the NLP using the API with python and there are a couple of things I could not find in the documentation:
https://cloud.google.com/natural-language/docs/basics
If I call the entity analysis in german using the html text for the following url:
https://www.hahn-bestattungen.de
I get that the word 'Bestattungen' has 4 entities but in the website it only appears 3 (Control+F).
I actually expected the response will only have one entity per word grouping the information about the salience and the amount of times the word appears int he text, but in the end there are several entities created out of the same word. One with a very high salience value and the rest a very low salience value. It happened as well in other URLs I tried. Why is that?

Related

Is there a way to have a reference term in addition to a label with Doccano?

Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks

Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.

rule based information extraction from raw text

I have a some phrases from aviation communication domain for eg: " Metro tower, 4 Delta Tango Charlie, request climb to flight level 350, wind 220" In this case
"metro tower" = Air traffic control tower name,
"four Delta Tango Charlie" = airplane call sign ,
"requset climb to flight level 350" = type of clearance request,
"350" = flight level
"wind 220" = wind info
I need to separate and extract these values corresponding to the tag names mentioned above to be used in later processing. As per my research I have figured out that this could be achieved by using custom Named Entity Recognition classes and rules, but I am not clear if this is the most efficient way to do it since this is to be used in a chat application and the processing time and response time has to be really quick. Please tell me if there are any other algorithms or techniques to do this.
Next problem is "four Delta Tango Charlie" part which consist of numbers and phonetic alphabet (A=Alpha, B=Bravo, C=Charlie,P=papa etc). What are the possible ways of creating a term dictionary for this alphabet and use the dictionary to extract the call sign from the raw text ?
Please also tell me what sought of algorithms could be used to solve my problem

Classical Named Entity Recognition (NER) is usually statistical (CRF, neural networks) and is trained on a large annotated corpus. If you do not have such a corpus, you cannot go this route. (Moreover, these are mostly not named entities but simply entities).
Instead I would simply search for the items on the list. With the parameters you mentioned you can use brute force, but since you mentioned it is an assignment, you should probably use something smarter.
You might want to compile all the items to search for into a finite state automaton (see Aho Corasick algorithm). The states can be tokens or simply letters.
Standardization of the phonetic alphabet depends on ambiguity (is Charlie always C? or can it be literally Charlie in some contexts). It can be done as a preprocessing step, postprocessing step or it can be compiled into the search algorithm (using a transducer instead of an automaton).
You might also want to use token regexes in Stanford NLP. Or Apache Lucene.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.

Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.

This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.

If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.

I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

associated words

I am developing a program but stuck on a particular hurdle. I need to find words associated with other words. EG "green" might be associated with "environment", "leaf", "earth", "wind", "electric", "hybrid", etc. All I can find is Google Sets. Is there any other resource that is better?

If you have a large text collection (say Wikipedia, Project Gutenberg) you can use co-occurrence scores extract this kind of data. See e.g. Padó and Lapata and the references therein.
I recently built a tool that mines this kind of associations from Wikipedia database dumps by another method. It requires a lot of memory though; other folks have tried to do the same using randomized methods.

If you're still looking for a resource of semantically related words, I've just recently developed an API that takes a query and returns semantically related words. It offers parts of speech, relationships to the query word, and a word similarity measurement.
https://kiingo.co/rapid-associations-api
Disclaimer: I'm the developer of this API.

blindly classifying new trends in incoming data

how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?
i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."

No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:
POS (part-of-speech) tag the text
We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
Normalize the text
In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
Also, remove noise words and noise expressions.
identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
This can be used in subsequent steps to provide more weight to some N-grams
Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
Identify the existing topics (from a list of known topics)
[optionally] Manually review the new topics
This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).

The main algorithms behind Google News have been published in the academic literature by Google researchers:
Original paper.
Talk: Google News Personalization: Scalable Online Collaborative Filtering
Blog discussion.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string