Problems with Google NL's Location Entities Recognition - google-cloud-nl

I've been playing with Google Natural Language API and in particular used the locations recognition to extract the locations from HN's "Who Is Hiring" page. If I pass a text like
Blockai | San Francisco, CA | CV/ML and Front-end Engineers -
https://blockai.com"
(from https://news.ycombinator.com/item?id=12631335)
Then the NL API returns the following entities:
The problem is "ML" and "CV" are recognized as locations, but they actually stand for "Machine Learning" and "Computer Vision" respectively. I guess the algorithm concludes that CV/ML are the locations because they're close to other locations(San Francisco, CA) in the text.
I was wondering how I can recognize such "fake" locations in the API's output? I thought that maybe using "Salience" parameter would help, but I am not sure what rule of thumb would be suitable..I even found the API sometimes responses with Salience values that are greater than 1 despite of the docs say that these values are "in the [0, 1.0] range.", f.e.:
{
"name":"San Francisco",
"type":"LOCATION",
"metadata":{
"wikipedia_url":"http://en.wikipedia.org/wiki/San_Francisco"
},
"salience":1.4515763148665428,
"mentions":[ ]
},
Any help is highly appreciated!

Sometimes it's very tricky for the underlying algorithms to disambiguate entities, esp. when there is not enough context. Salience does not help with this, because salience shows how central an entity is, regardless of its type. In this particular case, you could potentially use the provided metadata (e.g. wikipedia url) to further assess whether the entity is indeed a location.

Related

Searching for known phrases in text using Azure Cognitive Services

I'm trying to ascertain the "right tool for the job" here, and I believe Cognitive Services can do this but without disappearing down an R&D rabbit-hole I thought I'd make sure I was tunnelling in the right direction first.
So, here is the brief:
I have a collection of known existing phrases which I want to look for, but these might be written in slightly different ways, be that grammar or language.
I want to be able to parse a (potentially large) volume of text to scan and look for those phrases so that I can identify them.
For example, my phrase could be "the event will be in person" but that also needs to identify different uses of language; for example "in-person event", "face to face event", or "on-site event" - as well as the various synonyms and variations you can get with such things.
LUIS initially appeared to be the go-to tool for this kind of thing, and includes the ability to write your own Features (aka Phrase Lists) to augment the model, but it isn't clear whether that would hit the brief - LUIS appears to be much more about "intent" and user interaction (for example building a chat Bot, or understanding intent from emails).
Text Analytics also seems a likely candidate, but again seems more focused about identifying "entities" (such as people / places / organisations) rather than a natural language "phrase" - would this tool work if I was defining my own "Topics" or is that really just barking up the wrong tree?
.. or ... is there actually something else I should be looking at completely different?
At this point - I'm really looking for a "which tool should I spend lots of time learning about".
Thanks all in advance - I appreciate this is a fairly open-ended requirement.
It seems your scenario aligns more with our text analytics service. I was going to recommend Key Phrase Extraction API which evaluates unstructured text and returns a list of key phrases. However, since you require to use known (custom) phrase list, it may not be the solution you're looking for. We currently don't support custom key phrase extraction today, however it's on our roadmap. If interested, we can connect you with the product team to learn more about your scenario.
Updated:
Please try custom NER capability.

Entity over-generalisation on Api.ai

We’ve been having a great deal of difficulty with chatbot entities over-generalising on Api.ai, i.e. returning values that have not been specified for that entity when using the “Define Synonyms” feature on custom entities, even when the “Allow automated expansion” flag is turned off.
Our key example is an entity we use for confirming a user choice called confirm_accept. We had an entry: “that’s it”, with synonyms: “thats it”, “that is it”, “that’s it thanks”, “thats it thanks”, “that is it thanks”. This entity value was being returned unexpectedly in expressions where just a stray “it” was appearing.
In general, we have seen a lot of inappropriate entity generalisation which seems to indicate there is some form of stop word removal and stemming/lemmatization going on during entity identification... and which can’t be turned off.
This returns poor entity classifications, making it difficult to create entities for which very precise values are important, e.g. where a single word or character can make a big difference in meaning. Our key use case involves a lot of address processing, so it is important we get back only values we have specified.
Types of over-generalisations we’ve seen include:
inappropriate identification of determiners (a, an, the, this, that, etc.) as part of entities: as in “it” returning “that’s it”
stemmed words: as in stray mentions of “driving”, returning “drive” (a valid street type entity)
inappropriate plural stems: a stray mention of “children” returning “child”, or a stray “will” returning “wills” (which in our case “child” and “wills” are street name entities, so we don’t want “children” or “will” to be returned)
This is currently making it difficult to create a production quality chatbot using the Api.ai service.
Anyone had more luck at either getting a response from Api.ai or solving the over-generalisation problem?
Entities are meant to extract information from conversation:
API.AI's entities are meant to be used to extract data from conversational input not parse different phrases and parts of speech. For your examples (that’s it, thats it, that is it, that’s it thanks, thats it thanks, that is it thanks) all seem to indicate that the user's intent is to indicate that the last message from the API.AI agent was correct. For instances like these, it would be best to use these phrases as examples for an intent or an existing intent with other responses indicating that the user wants to indicate that the last response was correct.
API.AI captures entity tenses and plurals automatically: To address your other concern (driving entity, returning drive value, children returning child, or wills returning wills): API.AI intentionally captures different tenses and plurals of entities to provide a better experience for users who many not know the exact entities you've entered in your database. This allows users of your conversational app to have a natural conversation with your users and not require precise wording or

Parsing addresses with ambiguous data

I have data of phone numbers and village names collected from the villagers via forms. Because of various reasons the data is inaccurate or incomplete.
The idea is to validate these two data points before adding them to the data base/store.
The phone numbers are being formatted programmatically and validated via an external API. (That gives me the service provider and province information).
The problem is with the addresses.
No standardized address line. Tons of ambiguity.
Numeric street names and door numbers exist.
Input string will sometimes contain an addressee.
Possible solutions I can think of
Reverse geocoding helps. But not very accurate when it comes to Indian context. The Google TOS also prohibits automated queries. (correct me if I'm wrong here)
Soundexing. Again not very accurate with Indian data.
I understand it's difficult to such highly unstructured data, but I'm looking for a ways to achieve atleast enough accuracy to map addresses to the nearest point of interest.
Queries
Given a village name from the villager who might spell it wrong or incorrectly or abbreviate it how do I get the correct official name of the village and location?
Any possible ways to sanitize bad location/addresses or decode complex/poorly formed addresses?
Are there any machine learning solutions that can help so I can learn from every computation?(I have 0 knowledge on ML, do correct me if I'm wrong here.)
What you want is a geolocation system that works with informal text input. I have a previously used a Text-based geolocation model trained on Twitter data.
To solve your problem, you need training data in the form of:
informal_text village_name
If you have access to such data (e.g. using the addresses which can be geolocated) then you can train a text-based classifier that given a new informal address can predict where on the map it points to. In your case every village becomes a class label. You can use scikit-learn to train the classifier.

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

Resources