Find all references to a supplied noun in StanfordNLP - nlp

I'm trying to parse some text to find all references to a particular item. So, for example, if my item was The Bridge on the River Kwai and I passed it this text, I'd like it to find all the instances I've put in bold.
The Bridge on the River Kwai is a 1957 British-American epic war film
directed by David Lean and starring William Holden, Jack Hawkins, Alec
Guinness, and Sessue Hayakawa. The film is a work of fiction, but
borrows the construction of the Burma Railway in 1942–1943 for its
historical setting. The movie was filmed in Ceylon (now Sri Lanka).
The bridge in the film was near Kitulgala.
So far my attempt has been to go through all the mentions attached to each CorefChain and loop through those hunting for my target string. If I find the target string, I add the whole CorefChain, as I think this means the other items in that CorefChain also refer to the same thing.
List<CorefChain> gotRefs = new ArrayList<CorefChain>();
String pQuery = "The Bridge on the River Kwai";
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
boolean addedChain = false;
for (CorefChain.CorefMention cm : corefMentions) {
if ((!addedChain) &&
(pQuery.equals(cm.mentionSpan))) {
gotRefs.add(cc);
addedChain = true;
}
}
}
I then loop through this second list of CorefChains, re-retrieve the mentions for each chain and step through them. In that loop I show which sentences have a likely mention of my item in a sentence.
for (CorefChain gr : gotRefs) {
List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentionsUsing) {
//System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum);
}
}
It finds some of my references, but not that many, and it produces a lot of false positives. As might be entirely apparently from reading this, I don't really know the first thing about NLP - am I going about this entirely the wrong way? Is there a StanfordNLP parser that will already do some of what I'm after? Should I be training a model in some way?

I think a problem with your example is that you are looking for references to a movie title, and there isn't support in Stanford CoreNLP for recognizing movie titles, book titles, etc...
If you look at this example:
"Joe bought a laptop. He is happy with it."
You will notice that it connects:
"Joe" -> "He"
and
"a laptop" -> "it"
Coreference is an active research area and even the best system can only really be expected to produce an F1 of around 60.0 on general text, meaning it will often make errors.

Related

making link between a person name and pronoun in GATE

Is it possible to make a link between a person name and its PRP? in GATE E.g i have document "Maria Sharapova is a tennis player from russia. She participates in international tennis tournament. She is known for winning Wimbledon, US Open and Australian Open titles as for her looks, decibel-breaking grunts and commercial savvy - all of which made her the world's highest-paid female athlete." i want to annotate the "she" as "Maria Sharapova". I have written the following JAPE rule which identifies a pattern having a PRP after a person name
Phase: Simple
Input: Lookup Token Split
Options: control = appelt
Rule:joblookup
(
({Lookup.majorType == person_first}|
{Lookup.majorType == person_full})
({Token.kind==word})+
{Split.kind==internal}
{Token.category==PRP}
):sameathlete
-->
:sameathlete.sameAthlete1 = {kind="athlete", rule="same-athlete"}
How can i make the annotation that from She means we are talking about the same person whose name is mentioned 1 or 2 sentence before??
Please help
Have you tried Co-reference PR for gate?

Train model using Named entity

I am looking on standford corenlp using the Named Entity REcognizer.I have different kinds of input text and i need to tag it into my own Entity.So i started training my own model and it doesnt seems to be working.
For eg: my input text string is "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"
I go through the examples to train my own models and and look for only some words that I am interested in.
My jane-austen-emma-ch1.tsv looks like this
Toyota PERS
Land Cruiser PERS
From the above input text i am only interested in those two words. The one is
Toyota and the other word is Land Cruiser.
The austin.prop look like this
trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
Run the following command to generate the ner-model.ser.gz file
java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
public static void main(String[] args) {
String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
try {
NERClassifierCombiner classifier = new NERClassifierCombiner(false, false,
serializedClassifier2,serializedClassifier);
String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
System.out.println("---");
List<List<CoreLabel>> out = classifier.classify(ss);
for (List<CoreLabel> sentence : out) {
for (CoreLabel word : sentence) {
System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
}
System.out.println();
}
} catch (ClassCastException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Here is the output I am getting
Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS
which i think its wrong.I am looking for Toyota/PERS and Land Cruiser/PERS(Which is a multi valued fied.
Thanks for the Help.Any help is really appreciated.
I believe you should also put examples of 0 entities in your trainFile. As you gave it, the trainFile is just too simple for the learning to be done, it needs both 0 and PERSON examples so it doesn't annotate everything as PERSON. You're not teaching it about your not-of-interest entities. Say, like this:
Toyota PERS
of 0
Portfolio 0
49 0
and so on.
Also, for phrase-level recognition you should look into regexner, where you can have patterns (patterns are good for us). I'm working on this with the API and I have the following code:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);
with the following customLocationFileName:
Make Believe Town figure of speech ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ ) salut PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] ) SCHOOL
and text: Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).
The output I get
Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION
For more info on how to do this you can look at the example that got me going.
The NERClassifier* is word level, that is, it labels words, not phrases. Given that, the classifier seems to be performing fine. If you want, you can hyphenate words that form phrases. So in your labeled examples and in your test examples, you would make "Land Cruiser" to "Land_Cruiser".

Extract Noun phrase using stanford NLP

I am trying to find the Theme/Noun phrase from a sentence using Stanford NLP
For eg: the sentence "the white tiger" I would love to get
Theme/Nound phrase as : white tiger.
For this I used pos tagger. My sample code is below.
Result I am getting is "tiger" which is not correct. Sample code I used to run is
public static void main(String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("the white tiger)");
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation
.get(CoreAnnotations.SentencesAnnotation.class);
System.out.println("the size of the senetence is......"
+ sentences.size());
for (CoreMap sentence : sentences) {
System.out.println("the senetence is..." + sentence.toString());
Tree tree = sentence.get(TreeAnnotation.class);
PrintWriter out = new PrintWriter(System.out);
out.println("The first sentence parsed is:");
tree.pennPrint(out);
System.out.println("does it comes here.....1111");
TregexPattern pattern = TregexPattern.compile("#NP");
TregexMatcher matcher = pattern.matcher(tree);
while (matcher.find()) {
Tree match = matcher.getMatch();
List<Tree> leaves1 = match.getChildrenAsList();
StringBuilder stringbuilder = new StringBuilder();
for (Tree tree1 : leaves1) {
String val = tree1.label().value();
if (val.equals("NN") || val.equals("NNS")
|| val.equals("NNP") || val.equals("NNPS")) {
Tree nn[] = tree1.children();
String ss = Sentence.listToString(nn[0].yield());
stringbuilder.append(ss).append(" ");
}
}
System.out.println("the final stringbilder is ...."
+ stringbuilder);
}
}
}
Any help is really appreciated.Any other thoughts to get this achieved.
It looks like you're descending the dependency trees looking for NN.*.
"white" is a JJ--an adjective--which won't be included searching for NN.*.
You should take a close look at the Stanford Dependencies Manual and decide what part of speech tags encompass what you're looking for. You should also look at real linguistic data to try to figure out what matters in the task you're trying to complete. What about:
the tiger [with the black one] [who was white]
Simply traversing the tree in that case will give you tiger black white. Exclude PP's? Then you lose lots of good info:
the tiger [with white fur]
I'm not sure what you're trying to accomplish, but make sure what you're trying to do is restricted in the right way.
You ought to polish up on your basic syntax as well. "the white tiger" is what linguists call a Noun Phrase or NP. You'd be hard pressed for a linguist to call an NP a sentence. There are also often many NPs inside a sentence; sometimes, they're even embedded inside one another. The Stanford Dependencies Manual is a good start. As in the name, the Stanford Dependencies are based on the idea of dependency grammar, though there are other approaches that bring different insights to the table.
Learning what linguists know about the structure of sentences could help you significantly in getting at what you're trying to extract or--as happens often--realizing that what you're trying to extract is too difficult and that you need to find a new route to a solution.

Configuring Solr for Suggestive/Predictive Auto Complete Search

We are working on integrating Solr 3.6 to an eCommerce site. We have indexed data & search is performing really good.
We have some difficulties figuring how to use Predictive Search / Auto Complete Search Suggestion. Also interested to learn the best practices for implementing this feature.
Our goal is to offer predictive search similar to http://www.amazon.com/, but don't know how to implement it with Solr. More specifically I want to understand how to build those terms from Solr, or is it managed by something else external to solr? How the dictionary should be built for offering these kind of suggestions? Moreover, for some field, search should offer to search in category. Try typing "xper" into Amazon search box, and you will note that apart from xperia, xperia s, xperia p, it also list xperia s in Cell phones & accessories, which is a category.
Using a custom dictionary this would be difficult to manage. Or may be we don't know how to do it correctly. Looking to you to guide us on how best utilize solr to achieve this kind of suggestive search.
I would suggest you a couple of blogpost:
This one which shows you a really nice complete solution which works well but requires some additional work to be made, and uses a specific lucene index (solr core) for that specific purpose
I used the Highlight approach because the facet.prefix one is too heavy for big index, and the other ones had few or unclear documentation (i'm a stupid programmer)
So let's suppose the user has just typed "aaa bbb ccc"
Our autocomplete function (java/javascript) will call solr using the following params
q="aaa bbb"~100 ...base query, all the typed words except the last
fq=ccc* ...suggest word filter using last typed word
hl=true
hl.q=ccc* ...highlight word will be the one to suggest
fl=NONE ...return empty docs in result tag
hl.pre=### ...escape chars to locate highlight word in the response
hl.post=### ...see above
you can also control the number of suggestion with 'rows' and 'hl.fragsize' parameters
the highlight words in each document will be the right candidates for the suggestion with "aaa bbb" string
more suggestion words are the ones before/after the highlight words and, of course, you can implement more filters to extract valid words, avoid duplicates, limit suggestions
if interested i can send you some examples...
EDITED: Some further details about the approach
The portion of example i give supposes the 'autocomplete' mechanism given by jquery: we invoke a jsp (or a servlet) inside a web application passing as request param 'q' the words just typed by user.
This is the code of the jsp
ByteArrayInputStream is=null; // Used to manage Solr response
try{
StringBuffer queryUrl=new StringBuffer('putHereTheUrlOfSolrServer');
queryUrl.append("/select?wt=xml");
String typedWords=request.getParameter("q");
String base="";
if(typedWords.indexOf(" ")<=0) {
// No space typed by user: the 'easy case'
queryUrl.append("&q=text:");
queryUrl.append(URLEncoder.encode(typedWords+"*", "UTF-8"));
queryUrl.append("&hl.q=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
} else {
// Space chars present
// we split the search in base phrase and last typed word
base=typedWords.substring(0,typedWords.lastIndexOf(" "));
queryUrl.append("&q=text:");
if(base.indexOf(" ")>0)
queryUrl.append("\""+URLEncoder.encode(base, "UTF-8")+"\"~1000");
else
queryUrl.append(URLEncoder.encode(base, "UTF-8"));
typedWords=typedWords.substring(typedWords.lastIndexOf(" ")+1);
queryUrl.append("&fq=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
queryUrl.append("&hl.q=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
}
// The additional parameters to control the solr response
queryUrl.append("&rows="+suggestPageSize); // Number of results returned, a parameter to control the number of suggestions
queryUrl.append("&fl=A_FIELD_NAME_THAT_DOES_NOT_EXIST"); // Interested only in highlights section, Solr return a 'light' answer
queryUrl.append("&start=0"); // Use only first page of results
queryUrl.append("&hl=true"); // Enable highlights feature
queryUrl.append("&hl.simple.pre=***"); // Use *** as 'highlight border'
queryUrl.append("&hl.simple.post=***"); // Use *** as 'highlight border'
queryUrl.append("&hl.fragsize="+suggestFragSize); // Another parameter to control the number of suggestions
queryUrl.append("&hl.fl=content,title"); // Look for result only in some fields
queryUrl.append("&facet=false"); // Disable facets
/* Omitted section: use a new URL(queryUrl.toString()) to get the solr response inside a byte array */
is=new ByteArrayInputStream(solrResponseByteArray);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//response/lst[#name=\"highlighting\"]/lst/arr[#name=\"content\"]/str");
NodeList valueList = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
Vector<String> suggestions=new Vector<String>();
for (int j = 0; j < valueList.getLength(); ++j) {
Element value = (Element) valueList.item(j);
String[] result=value.getTextContent().split("\\*\\*\\*");
for(int k=0;k<result.length;k++){
String suggestedWord=result[k].toLowerCase();
if((k%2)!=0){
//Highlighted words management
if(suggestedWord.length()>=suggestedWord.length() && !suggestions.contains(suggestedWord))
suggestions.add(suggestedWord);
}else{
/* Words before/after highlighted words
we can put these words inside another vector
and use them if not enough suggestions */
}
}
}
/* Finally we build a Json Answer to be managed by our jquery function */
out.print(request.getParameter("json.wrf")+"({ \"suggestions\" : [");
boolean firstSugg=true;
for(String suggestionW:suggestions) {
out.print((firstSugg?" ":" ,"));
out.print("{ \"suggest\" : \"");
if(base.length()>0) {
out.print(base);
out.print(" ");
}
out.print(suggestionW+"\" }");
firstSugg=false;
}
out.print(" ]})");
}catch (Exception x) {
System.err.println("Exception during main process: " + x);
x.printStackTrace();
}finally{
//Gracefully close streams//
try{is.close();}catch(Exception x){;}
}
Hope to be helpfull,
Nik
This might help you out.I am trying to do the same.
http://solr.pl/en/2010/10/18/solr-and-autocomplete-part-1/

Algorithm to find articles with similar text

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question.
I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in "real time" on text that I just typed.
How?
Edit distance isn't a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you'd actually be interested in searching.
Something like Lucene is the way to go. You index all your documents, and then when you want to find documents similar to a given document, you turn your given document into a query, and search the index. Internally Lucene will be using tf-idf and an inverted index to make the whole process take an amount of time proportional to the number of documents that could possibly match, not the total number of documents in the collection.
It depends upon your definition of similiar.
The edit-distance algorithm is the standard algorithm for (latin language) dictionary suggestions, and can work on whole texts. Two texts are similiar if they have basically the same words (eh letters) in the same order. So the following two book reviews would be fairly similiar:
1) "This is a great book"
2) "These are not great books"
(The number of letters to remove, insert, delete or alter to turn (2) into (1) is termed the 'edit distance'.)
To implement this you would want to visit every review programmatically. This is perhaps not as costly as it sounds, and if it is too costly you could do the comparisions as a background task and store the n-most-similiar in a database field itself.
Another approach is to understand something of the structure of (latin) languages. If you strip short (non-capitialised or quoted) words, and assign weights to words (or prefixes) that are common or unique, you can do a Bayesianesque comparision. The two following book reviews might be simiplied and found to be similiar:
3) "The french revolution was blah blah War and Peace blah blah France." -> France/French(2) Revolution(1) War(1) Peace(1) (note that a dictionary has been used to combine France and French)
4) "This book is blah blah a revolution in french cuisine." -> France(1) Revolution(1)
To implement this you would want to identify the 'keywords' in a review when it was created/updated, and to find similiar reviews use these keywords in the where-clause of a query (ideally 'full text' searching if the database supports it), with perhaps a post-processing of the results-set for scoring the candidates found.
Books also have categories - are thrillers set in France similiar to historical studies of France, and so on? Meta-data beyond title and text might be useful for keeping results relevant.
The tutorial at this link sounds like it may be what you need. It is easy to follow and works very well.
His algorithm rewards both common substrings and a common ordering of those substrings and so should pick out similar titles quite nicely.
I suggest to index your articles using Apache Lucene, a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Once indexed, you could easily find related articles.
One common algorithm used is the Self-Organizing Map.
It is a type of neural network that will automatically categorize your articles. Then you can simply find the location that a current article is in the map and all articles near it are related. The important part of the algorithm is how you would vector quantize your input. There are several ways to do with with text. You can hash your document/title, you can count words and use that as an n dimensional vector, etc. Hope that helps, although I may have opened up a Pandora's box for you of an endless journey in AI.
SO does the comparison only on the title, not on the body text of the question, so only on rather short strings.
You can use their algorithm (no idea what it looks like) on the article title and the keywords.
If you have more cpu time to burn, also on the abstracts of your articles.
Seconding the Lucene suggestion for full-text, but note that java is not a requirement; a .NET port is available. Also see the main Lucene page for links to other projects, including Lucy, a C port.
Maybe what your looking for is something that does paraphrasing. I only have cursory knowledge of this, but paraphrasing is a natural language processing concept to determine if two passages of text actually mean the same thing - although the may use entirely different words.
Unfortunately I don't know of any tools that allow you to do this (although I'd be interested in finding one)
If you are looking for words that wound alike, you could convert to soundex and the the soundex words to match ... worked for me
I tried some method but none works well.One may get a relatively satified result like this:
First: get a Google SimHash code for every paragraph of all text and store it in databse.
Second: Index for the SimHash code.
Third: process your text to be compared as above,get a SimHash code and search all the text by SimHash index which apart form a Hamming distance like 5-10. Then compare simility with term vector.
This may works for big data.
Given a sample text, this program lists the repository texts sorted by similarity: simple implementation of bag of words in C++. The algorithm is linear in the total length of the sample text and the repository texts. Plus the program is multi-threaded to process repository texts in parallel.
Here is the core algorithm:
class Statistics {
std::unordered_map<std::string, int64_t> _counts;
int64_t _totWords;
void process(std::string& token);
public:
explicit Statistics(const std::string& text);
double Dist(const Statistics& fellow) const;
bool IsEmpty() const { return _totWords == 0; }
};
namespace {
const std::string gPunctStr = ".,;:!?";
const std::unordered_set<char> gPunctSet(gPunctStr.begin(), gPunctStr.end());
}
Statistics::Statistics(const std::string& text) {
std::string lastToken;
for (size_t i = 0; i < text.size(); i++) {
int ch = static_cast<uint8_t>(text[i]);
if (!isspace(ch)) {
lastToken.push_back(tolower(ch));
continue;
}
process(lastToken);
}
process(lastToken);
}
void Statistics::process(std::string& token) {
do {
if (token.size() == 0) {
break;
}
if (gPunctSet.find(token.back()) != gPunctSet.end()) {
token.pop_back();
}
} while (false);
if (token.size() != 0) {
auto it = _counts.find(token);
if (it == _counts.end()) {
_counts.emplace(token, 1);
}
else {
it->second++;
}
_totWords++;
token.clear();
}
}
double Statistics::Dist(const Statistics& fellow) const {
double sum = 0;
for (const auto& wordInfo : _counts) {
const std::string wordText = wordInfo.first;
const double freq = double(wordInfo.second) / _totWords;
auto it = fellow._counts.find(wordText);
double fellowFreq;
if (it == fellow._counts.end()) {
fellowFreq = 0;
}
else {
fellowFreq = double(it->second) / fellow._totWords;
}
const double d = freq - fellowFreq;
sum += d * d;
}
return std::sqrt(sum);
}
you can use the following
Minhash/LSH https://en.wikipedia.org/wiki/MinHash
(also see: http://infolab.stanford.edu/~ullman/mmds/book.pdf Minhash chapter), also see http://ann-benchmarks.com/ for state of the art
collaborative filtering if you have info of users interaction with articles (clicks/likes/views): https://en.wikipedia.org/wiki/Collaborative_filtering
word2vec or similar embeddings to compare articles in 'semantic' vector space: https://en.wikipedia.org/wiki/Word2vec
Latent semantic analysis: https://en.wikipedia.org/wiki/Latent_semantic_analysis
Use Bag-of-words and apply some distance measure, like Jaccard coefficient to compute set similarity https://en.wikipedia.org/wiki/Jaccard_index, https://en.wikipedia.org/wiki/Bag-of-words_model
The link in #alex77's answer points to an the Sorensen-Dice Coefficient which was independently discovered by the author of that article - the article is very well written and well worth reading.
I have ended up using this coefficient for my own needs. However, the original coefficient can yield erroneous results when dealing with
three letter word pairs which contain one misspelling, e.g. [and,amd] and
three letter word pairs which are anagrams e.g. [and,dan]
In the first case Dice erroneously reports a coefficient of zero whilst in the second case the coefficient turns up as 0.5 which is misleadingly high.
An improvement has been suggested which in its essence consists of taking the first and the last character of the word and creating an additional bigram.
In my view the improvement is only really required for 3 letter words - in longer words the other bigrams have a buffering effect that covers up the problem.
My code that implements this improvement is given below.
function wordPairCount(word)
{
var i,rslt = [],len = word.length - 1;
for(i=0;i < len;i++) rslt.push(word.substr(i,2));
if (2 == len) rslt.push(word[0] + word[len]);
return rslt;
}
function pairCount(arr)
{
var i,rslt = [];
arr = arr.toLowerCase().split(' ');
for(i=0;i < arr.length;i++) rslt = rslt.concat(wordPairCount(arr[i]));
return rslt;
}
function commonCount(a,b)
{
var t;
if (b.length > a.length) t = b, b = a, a = t;
t = a.filter(function (e){return b.indexOf(e) > -1;});
return t.length;
}
function myDice(a,b)
{
var bigrams = [],
aPairs = pairCount(a),
bPairs = pairCount(b);
debugger;
var isct = commonCount(aPairs,bPairs);
return 2*commonCount(aPairs,bPairs)/(aPairs.length + bPairs.length);
}
$('#rslt1').text(myDice('WEB Applications','PHP Web Application'));
$('#rslt2').text(myDice('And','Dan'));
$('#rslt3').text(myDice('and','aMd'));
$('#rslt4').text(myDice('abracadabra','abracabadra'));
*{font-family:arial;}
table
{
width:80%;
margin:auto;
border:1px solid silver;
}
thead > tr > td
{
font-weight:bold;
text-align:center;
background-color:aqua;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js"></script>
<table>
<thead>
<tr>
<td>Phrase 1</td>
<td>Phrase 2</td>
<td>Dice</td>
</tr>
<thead>
<tbody>
<tr>
<td>WEB Applications</td>
<td>PHP Web Application</td>
<td id='rslt1'></td>
</tr>
<tr>
<td>And</td>
<td>Dan</td>
<td id='rslt2'></td>
</tr>
<tr>
<td>and</td>
<td>aMd</td>
<td id='rslt3'></td>
</tr>
<tr>
<td>abracadabra</td>
<td>abracabadra</td>
<td id='rslt4'></td>
</tr>
</tbody>
</table>
Note the deliberate misspelling in the last example: abracadabra vs abracabadra. Even though no extra bigram correction is applied the coefficient reported is 0.9. With the correction it would have been 0.91.
You can use SQL Server Full-text index to get the smart comparison, I believe that SO is using an ajax call, that does a query to return the similar questions.
What technologies are you using?
The simplest and fastest way to compare similarity among abstracts is probably by utilizing the set concept. First convert abstract texts into set of words. Then check how much each set overlaps. Python's set feature comes very hand performing this task. You would be surprised to see how well this method compares to those "similar/related papers" options out there provided by GScholar, ADS, WOS or Scopus.

Resources