what is the difference between identity coreference and appositive coreference?
In the following sentence for example:
Mohammad traveled to Washington last week. He was on leave of absence. The 30-year old man stayed in an hotel overlooking the National Mall.
As per what I understand, there is an identity coreference between Mohammad and he. Is there an appositive coreference between he and the 30-year old man? or Mohammad and the 30-year old man'?
An appositive is a noun or noun phrase that renames another noun right beside it. Appositive can be a long or short combination of words. let's have a little example :
This impressive detective, Sherlock Holmes always solves all kinds of problems.
So no appositive coreference between "he" and "the 30-year old man" or "Mohammad" and "the 30-year old man'".
Related
Let's say we have two sentences:
Jacob is going to watch a movie with Justin.
He will be back by 10 pm.
How does Stanford NLP identify "he" refers to Jacob and not Justin?
This is called coreference resolution and is a well-studied problem in NLP. As such, there are many possible ways to do it. Stackoverflow is not the right venue for a literature review, but here are a few links to get you started
http://www-labs.iro.umontreal.ca/~felipe/IFT6010-Hiver2015/Presentations/Abbas-Coreference.pdf
https://nlp.stanford.edu/projects/coref.shtml and links therein
What is the etymology for JJ tag denoting POS for adjectives? I am unable to find any references online. There are several resources listing all the tags, but none describing the reason.
It may be impossible to get an official answer. JJ has been used since the Brown corpus, and appears without comment in publications going back to at least 1981 (just after publication of the 1979 Form C "revised and amplified" edition).
Per this record of the corpus, the main publication by the authors accompanying Form C is the manual, available here. It contains the list, with plenty of explanations of how words are classified and none for how the tags were made.
After reviewing Role of the Brown Corpus in the History of Corpus Linguistics (Olga Kholkovskaia, 2017), I agree that the authors generally focused on the massive compilation and tagging method over commentary. The 1967 classic "Computational analysis of present-day American English" is mostly frequency tables, with no instance of "adjective" or JJ in it. Thus, I found no publications where lead authors Wilson and Kucera discusss their choice of tags, and both passed away in the 2000s.
This limits us to speculation. The authors had 82 tags that needed to be short, memorable (the tagging process was partly manual), and allow various modifiers to be appended without creating confusion. Vowels are fairly useless for this, with every part of speech in the table containing at least one. Verb (VB) and noun (NN) go by first-and-last letters, while others may use initialisms (coordinating conjunction CC, foreign word FW), syllable initialisms (modal MD, predeterminer PDT), first letters (possessive POS), arbitrary associations (interjections UH).
Adjective's JJ is odd in using a letter absent from the phrase and does not make intuitive sense like UH, possessive P$, or plural S - but hardly the strangest tag choice, even in the reduced Penn Treebank table. Perhaps someone wanted to match NN's style, and doubled the first relatively uncommon letter in adjective. Any more detailed answer may only be possible by finding unpublished notes or still-living colleagues.
I am currently doing a project where we are trying to gauge explanatory answers submitted by users against a correct answer. I have come across APIs like dandelion and paralleldots, both of which are capable of checking how close 2 texts are to each other semantically.
These APIs are giving me favorable responses for questions like:
What is the distinction between debtor and creditor?
Answer1: A debtor is a person or enterprise that owes money to another
party. A creditor is a person, bank, or other enterprise that has
lent money or extended credit to another party.
Answer2: A debtor has a debt or legal obligation to pay an amount to
another person or entity, from whom goods were purchased or services
were obtained. A creditor may be a bank, supplier
Dandelion gave me a score of 81% and paralleldots gave me 4.8/5 for the same answer. This is quite expected.
However, before I prepare a demo and plan to eventually use them in production, I am interested in understanding to some extent how these APIs are generating these scores.
Is it a tf-idf based vector product of the stemmed POSses??
PS: Not an expert in NLP
This question is very broad: semantic sentence similarity is an open issue in NLP and there are a variety of ways of performing this task, all of them being far from perfect at the current stage. As an example, just consider that:
Trump is the president of the United States
and
Trump has never been the president of the United States
have a semantic similarity of 5 according to paralleldots. Now, according to your definition of similarity this may be ok or not, but the point is that according to what you have to do with this similarity it may not be fully suitable if you have specific requirements.
Anyway, as for the implementation, there's no single "standard" way of performing this and there's a pletora of features that can be used: tf-idf (or equivalent), syntactic structure of the sentence (i.e. constituency or dependency parse tree), mention of entities extracted from the text, etc... or, following the latest trends, a deep neural network which doesn't need any explicit feature.
Is there something a [directional?] notion/implementation of distance between Wikipedia categories/pages?
For example consider: A) "Saint Louis University" B) "university"
Clearly "A" is a type of "B". How can you extract this from Wiki?
If you extract all the categories connect to A, you'd see that it gives
Category:1818 establishments in Missouri Territory
Category:Articles containing Latin-language text
Category:Association of Catholic Colleges and Universities
Category:Commons category with local link same as on Wikidata
Category:Coordinates on Wikidata
Category:Educational institutions established in 1818
Category:Instances of Infobox university using image size
Category:Jesuit universities and colleges in the United States
Category:Roman Catholic Archdiocese of St. Louis
Category:Roman Catholic universities and colleges in Missouri
and it does not contain anything that would directly connect to B (https://en.wikipedia.org/wiki/University). But essentially if you look further, you should be able to find a multi-hop path between A and B, possibly multiple hops. What are the popular ways of accomplishing this?
Some ideas/resources I collected. Will update this if I find more.
-- Using DBPedia: knowledge base curated based on Wiki. They provide an SparQL end-point to query this KB. But one has to simulate the desired similarity/distance behavior via their SparQL interface. Some ideas are here and here, but they seem to be outdated.
-- Using UMBEL: http://umbel.org/ which is a knowledge graph of concepts. I think the size of this knowledge graph is relatively small. But the I suspect that its precision is probably high. That being said, I'm not sure how this relates to Wikipedia at all. They have this api for calculating the distance measure between any pair of their concepts (at the moment of writing this post, their similarity API is down. So not a feasible solution at the moment).
-- Using http://degreesofwikipedia.com/ I don't the details of their algorithm and how they do, but they provide a distance between Wiki-concepts. And also this is directional. For example this and this.
If you have the entire Wikipedia category taxonomy, then you can compute the distance (shortest path length) between two categories. If one category is the ancestor of other, it is straight forward.
Otherwise you can find the Least Common Subsumer which is defined as follows.
Least common subsumer of two concepts A and B is the most specific
concept which is an ancestor of both A and B.
Then compute the distance between them via LCS.
I encourage you to go through similarity measures where you will find state-of-art techniques to compute semantic similarity between words.
Resource: My project on extracting Wikipedia category/concept might help you.
One very good related example
Compute semantic similarity between words using WordNet. WordNet organizes English words in hierarchical fashion. See this wordnet similarity for java demo. It uses eight different state-of-techniques to compute semantic similarity between words.
You might be looking for the "is a" relationship: Q734774 (the Wikidata item for Saint Louis University) is a university, a building and a private not-for-profit educational institution. You can use SPARQL to query it:
is Saint Louis University a university?
how far is Saint Louis University removed from the concept of "university"? (although I doubt this would produce anything meaningful)
Is it possible to count how many times an entity has been mentioned in an article? For example
ABC Company is one of the largest car manufacturers in the
world. It is also the largest
company in terms of annual production.
It is also the second largest exporter of luxury cars, after XYZ
company. Both ABC and XYZ
together produces over n% of total car
production in the country.
mentions ABC company 4 times.
Yes, this is possible. It's a combination of
named-entity recognition (NER), which for English is practically a solved problem, and
coreference resolution, which is the subject of ongoing research (but give this package a try)