Information extraction. Counting mentions to measure relevance

Information extraction. Counting mentions to measure relevance - nlp

Is it possible to count how many times an entity has been mentioned in an article? For example
ABC Company is one of the largest car manufacturers in the
world. It is also the largest
company in terms of annual production.
It is also the second largest exporter of luxury cars, after XYZ
company. Both ABC and XYZ
together produces over n% of total car
production in the country.
mentions ABC company 4 times.

Yes, this is possible. It's a combination of
named-entity recognition (NER), which for English is practically a solved problem, and
coreference resolution, which is the subject of ongoing research (but give this package a try)

Related

Problem creating a hispanic person in dialogflow

So,
Due to cultural differences people in hispanic countries have quite a number of surnames.
Taking someone elses surname isn´t the norm, you just combine your surnames in most cases:
1st husband, 1st bride, 2nd husband, 2nd bride, 3rd husband, 3rd bride, 4th husband, 4th bride.
You have to add a second surname to get Spanish nationality and some people just repeat their last name because they refuse to understand how culturally important this is in Spain
Athletic Bilbao can get away with saying all of their players have basque origins if they start tracing back the multiple surnames and have been known to do so/approach foreign players with basque surnames among the neverending list to ask if they would be interested in joining.
This can be quite problematic in some cases but it makes it easy to differentiate people:
There can be an elevated number of Thomas Smith's in your city, there is hardly ever two Thomas Smith matchingCommonSecondSurname in the same areas.
Because of this people are used to use at least two of their surnames in hispanic countries unless their name is unique enough.
On to my issue:
My dialogflow agent asks someone to identify themselves in order to provide some extra information to the business.
I have added multiple examples with several surnames, they are identified correctly by the training proccess but the agent struggles with them in actual conversation picking either the second surname as the full person or the person first surname as the entity, never the full thing.
Neither option is valid in a hispanic country where I would be using this solution.
Anything I can do to improve this?
Creating a custom entity for a person seems like an arduous task to me.
It is not vital and I could do without the extra tidbit as I am storing their email already. It just seems like a basic thing that should be doable and I am struggling to believe I am the first person to face this issue.

How do semantic text comparison APIs work

I am currently doing a project where we are trying to gauge explanatory answers submitted by users against a correct answer. I have come across APIs like dandelion and paralleldots, both of which are capable of checking how close 2 texts are to each other semantically.
These APIs are giving me favorable responses for questions like:
What is the distinction between debtor and creditor?
Answer1: A debtor is a person or enterprise that owes money to another
party. A creditor is a person, bank, or other enterprise that has
lent money or extended credit to another party.
Answer2: A debtor has a debt or legal obligation to pay an amount to
another person or entity, from whom goods were purchased or services
were obtained. A creditor may be a bank, supplier
Dandelion gave me a score of 81% and paralleldots gave me 4.8/5 for the same answer. This is quite expected.
However, before I prepare a demo and plan to eventually use them in production, I am interested in understanding to some extent how these APIs are generating these scores.
Is it a tf-idf based vector product of the stemmed POSses??
PS: Not an expert in NLP

This question is very broad: semantic sentence similarity is an open issue in NLP and there are a variety of ways of performing this task, all of them being far from perfect at the current stage. As an example, just consider that:
Trump is the president of the United States
and
Trump has never been the president of the United States
have a semantic similarity of 5 according to paralleldots. Now, according to your definition of similarity this may be ok or not, but the point is that according to what you have to do with this similarity it may not be fully suitable if you have specific requirements.
Anyway, as for the implementation, there's no single "standard" way of performing this and there's a pletora of features that can be used: tf-idf (or equivalent), syntactic structure of the sentence (i.e. constituency or dependency parse tree), mention of entities extracted from the text, etc... or, following the latest trends, a deep neural network which doesn't need any explicit feature.

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.

The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.

It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

How do you extract various meanings of a certain word

Given "violence" as input would it be possible to come up with how violence construed by a person (e.g. physical violence, a book, an album, a musical group ..) as mentioned below in Ref #1.
Assuming if the user meant an Album, what would be the best way to look for violence as an album from a set of tweets.
Is there a way to infer this via any of the NLP API(s) say OpenNLP.
Ref #1
violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.

Pure automatic inference is a little to hard in general for this problem.
Instead we might use :
Resources like WordNet, or a semantics dictionary.
For languages other than English you can look at eurowordnet (non free) dataset.
To get more meaning (i.e. for the album sense) we process some well managed resource like Wikipedia. Wikipedia as a lot of meta information that would be very useful for this kind of processing.
The reliability of the process is achieve just by combining the maximum number of data source and processing them correctly, with specialized programs.
As a last resort you may try hand processing/annotating. Long and costly, but useful in enterprise context where you need only a small part of a language.
No free lunch here.

If you're working on English NLP in python, then you can try the wordnet API as such:
from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition
If you're working on other human languages, maybe you can take a look at the open wordnets available from http://casta-net.jp/~kuribayashi/multi/
NOTE: the reason for str(ss.offset).zfill(8)+'-'+ss.pos, it's because it is used as the unique id for each sense of a specific word. And this id is consistent across the open wordnets for every language. the first 8 digits gives the id and the character after the dash is the Part-of-Speech of the sense.

Check this out: Twitter Filtering Demo from Idilia. It does exactly what you want by first analyzing a piece of text to discover the meaning of its words and then filtering the texts that contain the sense that you are looking for. It's available as an API.
Disclaimer: I work for Idilia.

You can extract all contexts "violence" occurs in (context can be a whole document, or a window of say 50 words), then convert them into features (using say bag of words), then cluster these features. As clustering is unsupervised, you won't have names for the clusters, but you can label them with some typical context.
Then you need to see which cluster "violence" in the query belongs to. Either based on other words in the query which act as a context or by asking explicitly (Do you mean violence as in "...." or as in "....")

This will be incredibly difficult due to the fact that the proper noun uses of the word 'Violence' will be incredibly infrequent as a proportion of all words and their frequency distribution is likely highly skewed in some way. We run into these problems almost any time we want to do some form of Named Entity Disambiguation.
No tool I'm aware of will do this for you, so you will be building your own classifier. Using Wikipedia as a training resource as Mr K suggested is probably your best bet.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!

Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string