How do semantic text comparison APIs work - nlp

I am currently doing a project where we are trying to gauge explanatory answers submitted by users against a correct answer. I have come across APIs like dandelion and paralleldots, both of which are capable of checking how close 2 texts are to each other semantically.
These APIs are giving me favorable responses for questions like:
What is the distinction between debtor and creditor?
Answer1: A debtor is a person or enterprise that owes money to another
party. A creditor is a person, bank, or other enterprise that has
lent money or extended credit to another party.
Answer2: A debtor has a debt or legal obligation to pay an amount to
another person or entity, from whom goods were purchased or services
were obtained. A creditor may be a bank, supplier
Dandelion gave me a score of 81% and paralleldots gave me 4.8/5 for the same answer. This is quite expected.
However, before I prepare a demo and plan to eventually use them in production, I am interested in understanding to some extent how these APIs are generating these scores.
Is it a tf-idf based vector product of the stemmed POSses??
PS: Not an expert in NLP

This question is very broad: semantic sentence similarity is an open issue in NLP and there are a variety of ways of performing this task, all of them being far from perfect at the current stage. As an example, just consider that:
Trump is the president of the United States
and
Trump has never been the president of the United States
have a semantic similarity of 5 according to paralleldots. Now, according to your definition of similarity this may be ok or not, but the point is that according to what you have to do with this similarity it may not be fully suitable if you have specific requirements.
Anyway, as for the implementation, there's no single "standard" way of performing this and there's a pletora of features that can be used: tf-idf (or equivalent), syntactic structure of the sentence (i.e. constituency or dependency parse tree), mention of entities extracted from the text, etc... or, following the latest trends, a deep neural network which doesn't need any explicit feature.

Related

What Kind of Multi Criteria Decisoin Making methods i need for my problem?

I'm making an application to find the best products to buy based on several criteria. can be called a decision support system.
some examples of the criteria I use are:
location, the more the sending location is with my city, the better.
I have determined the weight of the location, I determine the weight
of my city with a value of 100, the farther the shipping city with
my city, then the weight will be smaller.
the number of reviews owned by a product, more means better
rating value, the higher the rating, the more means better
price, the cheaper the price the better
I was recommended to use a method called AHP, I have read about AHP and although I think AHP is a good method, in my opinion what I want can not be fulfilled entirely with AHP because it does not take into account the nominal value of the rating and price, it only counts one thing importance to another
my questions are :
with the requirements of the criteria, what MCDM method should I use?
Does AHP actually can accommodate my needs? if yes, how? is it using Fuzzy-AHP? if so, I will start learning Fuzzy and things related to it
Thanks for the question. So, AHP*1 is a method used in decision-making (DM) to methodically assign weights to the different criterion. In order to score, rank and select the most-desirable alternative you need to complement AHP with another MCDC method that fulfils those tasks.
There several methods to do that. TOPSIS and ELECTRE, for instance, are commonly used to that purpose. *2-3. I leave you a link on the papers and tutorials of those methods so you understand how they work. -- SEE RESOURCES.
In regards to using fuzzy logic in AHP. While there are several proposals on using FAHP*4, Saaty himself, creator of the AHP states that this is redundant*5-7 since the scale in which criteria are assessed to weighing in AHP already operates with a fuzzy logic.
However, in the case, your criteria are based on qualitative data and therefore you are dealing with uncertainty and potentially, incomplete information, you can use fuzzy numbers in TOPSIS for those variables. You can check the tutorials in resources to understand how to apply those methods.
In recent years, some researchers have argued that fuzzy TOPSIS only considers the membership function. (That is, the closest an imprecise parameter is to reality) and ignores the non-membership and indeterminacy degree *9-10, so how false and not determinable is that parameter. The neutrosophic theory was mainly pioneered by *10 Smarandache.
So, in response, nowadays, neutrosophic TOPSIS is being to be used to deal with uncertainty. I recommend reading the papers below to understand the concept.
So, in summary, I will personally recommend applying AHP and Fuzzy or Neutrosophic TOPSIS to address your problem.
Resources:
Manoj Mathew. Tutorial Youtube FAHP. Fuzzy Analytic Hierarchy Process (FAHP) - Using Geometric Mean. Retrieved from: https://www.youtube.com/watch?v=5k3Wz1AfVWs
Manoj Mathew. Tutorial Youtube FTOPSIS. Fuzzy TOPSIS. Retrieved from: https://www.youtube.com/watch?v=z188EQuWOGU
Manoj Mathew. TOPSIS - Technique for Order Preference by Similarity to Ideal Solution Retrieved from: https://www.youtube.com/watch?v=kfcN7MuYVeI
MCDC in R: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19
MCDC in JS: https://www.npmjs.com/package/electre-js
MCDC in Python: https://github.com/pyAHP/pyAHP
REFERENCES:
1 Saaty, R. W. (1987). The analytic hierarchy process—what it is and how it is used. Mathematical Modelling, 9(3-5), 167.
doi:10.1016/0270-0255(87)90473-8
2 Hwang, C. L., & Yoon, K. (1981). Methods for multiple attribute decision making. In Multiple attribute decision making (pp. 58-191). Springer, Berlin, Heidelberg.
3 Figueira, J., Mousseau, V., & Roy, B. (2005). ELECTRE methods. In Multiple criteria decision analysis: State of the art surveys (pp. 133-153). Springer, New York, NY.
4 Mardani, A., Nilashi, M., Zavadskas, E. K., Awang, S. R., Zare, H., & Jamal, N. M. (2018). Decision Making Methods Based on Fuzzy Aggregation Operators: Three Decades Review from 1986 to 2017.
International Journal of Information Technology & Decision Making, 17(02), 391–466. doi:10.1142/s021962201830001x
5 Saaty, T. L. (1986). Axiomatic Foundation of the Analytic Hierarchy Process. Management Science, 32(7), 841.
doi:10.1287/mnsc.32.7.841
6 Saaty, R. W. (1987). The analytic hierarchy process—what it is and how it is used. Mathematical Modelling, 9(3-5), 167.
doi:10.1016/0270-0255(87)90473-8
7 Aczél, J., & Saaty, T. L. (1983). Procedures for synthesizing ratio judgements. Journal of Mathematical Psychology, 27(1), 93–102. doi:10.1016/0022-2496(83)90028-7
8 Wang, Y. M., & Elhag, T. M. (2006). Fuzzy TOPSIS method based on alpha level sets with an application to bridge risk assessment. Expert systems with applications, 31(2), 309-319
9 Zhang, Z., Wu, C.: A novel method for single-valued neutrosophic multi-criteria decision making with incomplete weight information. Neutrosophic Sets Syst. 4, 35–49 (2014)
10 Biswas, P., Pramanik, S., & Giri, B. C. (2018). Neutrosophic TOPSIS with Group Decision Making. Studies in Fuzziness and Soft Computing, 543–585. doi:10.1007/978-3-030-00045-5_21
10 Smarandache, F.: A Unifying Field in Logics. Neutrosophy: Neutrosophic Probability, Setand Logic. American Research Press, Rehoboth (1998)

Wiki-distance: distance between Wiki topics and categories?

Is there something a [directional?] notion/implementation of distance between Wikipedia categories/pages?
For example consider: A) "Saint Louis University" B) "university"
Clearly "A" is a type of "B". How can you extract this from Wiki?
If you extract all the categories connect to A, you'd see that it gives
Category:1818 establishments in Missouri Territory
Category:Articles containing Latin-language text
Category:Association of Catholic Colleges and Universities
Category:Commons category with local link same as on Wikidata
Category:Coordinates on Wikidata
Category:Educational institutions established in 1818
Category:Instances of Infobox university using image size
Category:Jesuit universities and colleges in the United States
Category:Roman Catholic Archdiocese of St. Louis
Category:Roman Catholic universities and colleges in Missouri
and it does not contain anything that would directly connect to B (https://en.wikipedia.org/wiki/University). But essentially if you look further, you should be able to find a multi-hop path between A and B, possibly multiple hops. What are the popular ways of accomplishing this?
Some ideas/resources I collected. Will update this if I find more.
-- Using DBPedia: knowledge base curated based on Wiki. They provide an SparQL end-point to query this KB. But one has to simulate the desired similarity/distance behavior via their SparQL interface. Some ideas are here and here, but they seem to be outdated.
-- Using UMBEL: http://umbel.org/ which is a knowledge graph of concepts. I think the size of this knowledge graph is relatively small. But the I suspect that its precision is probably high. That being said, I'm not sure how this relates to Wikipedia at all. They have this api for calculating the distance measure between any pair of their concepts (at the moment of writing this post, their similarity API is down. So not a feasible solution at the moment).
-- Using http://degreesofwikipedia.com/ I don't the details of their algorithm and how they do, but they provide a distance between Wiki-concepts. And also this is directional. For example this and this.
If you have the entire Wikipedia category taxonomy, then you can compute the distance (shortest path length) between two categories. If one category is the ancestor of other, it is straight forward.
Otherwise you can find the Least Common Subsumer which is defined as follows.
Least common subsumer of two concepts A and B is the most specific
concept which is an ancestor of both A and B.
Then compute the distance between them via LCS.
I encourage you to go through similarity measures where you will find state-of-art techniques to compute semantic similarity between words.
Resource: My project on extracting Wikipedia category/concept might help you.
One very good related example
Compute semantic similarity between words using WordNet. WordNet organizes English words in hierarchical fashion. See this wordnet similarity for java demo. It uses eight different state-of-techniques to compute semantic similarity between words.
You might be looking for the "is a" relationship: Q734774 (the Wikidata item for Saint Louis University) is a university, a building and a private not-for-profit educational institution. You can use SPARQL to query it:
is Saint Louis University a university?
how far is Saint Louis University removed from the concept of "university"? (although I doubt this would produce anything meaningful)

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Identifying the context of word in sentence

I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)

Word Map for Emotions

I am looking for a resource similar to WordNet. However, I want to be able to look up the positive/negative connotation of a word. For example:
bribe - negative
offer - positive
I'm curious as to whether anyone has run across any tool like this in AI/NLP research, or even in linguistics.
UPDATE:
For the curious, the accepted answer below put me on the right track towards what I needed. Wikipedia listed several different resources. The two I would recommend (because of ease of use/free use for a small number of API calls) are AlchemyAPI and Lymbix. I decided to go with AlchemyAPI, since people affiliated with academic institutions (like myself) and non-profits can get even more API calls per day if they just email the company.
Start looking up topics on 'sentiment analysis': http://en.wikipedia.org/wiki/Sentiment_analysis
The are some vocabulary compilations regarding affect, aka dictionaries of affect, such as the Affective Norms of English Words (ANEW) or the Dictionary of Affect in Language (DAL). They provide a dimensional representation of affect (valence, activation and control) that may be of use in a sentiment analysis scenario (detection of positive/negative connotation). In this sense, EmoLib works with the former, by default, but may be easily extended with a more specific lexicon to tackle particular needs (for example, EmoLib provides an additional neutral label that is more appropriate than the positive/negative tag set alone in a Text-To-Speech synthesis setting).
There is also SentiWordNet, which gives you positive, negative and objective scores for each WordNet synset.
However, you should be aware that the positive and negative connotation of a term often depends on the context in which it is used. A great introduction to this topic is the book Opinion mining and sentiment analysis by Bo Pang and Lillian Lee, which is available online for free.

Resources