Elastic search on long search phrases - search

I wish to use my Elastic search index for mining the data out of long search phrases.
These phrases are multiple sentences, usually approximately 100 words long.
Is Elastic search any good at this? Is there something I should consider doing to the sentences, for example, keyword extraction, before launching a search query?

Related

nlp- Difference between Sentences and a Document in Stanford OpenNLP?

Let us say we have an article that we want to annotate. If we input the text as one really long Sentence as opposed to a Document, does Stanford do anything differently between annotating that one long Sentence as opposed to looping through every Sentence in the Document and culminating all of its results together?
EDIT: I ran a test and it seems like the two approaches return two different NER sets. I might be just doing it wrong, but it's certainly super interesting and I'm curious as to why this happens.
To confirm: you mean Stanford CoreNLP (as opposed to Apache OpenNLP), right?
The main difference in the CoreNLP Simple API between a Sentence and a Document is tokenization. A Sentence will force the entire text to be considered as a single sentence, even if it has punctuation. A Document will first tokenize the text into a list of sentences, and then annotate each sentence.
Note that for annotators like the constituency parser, very long sentences will take prohibitively long to annotate. Also, note that coreference only works on documents, not sentences.

How to produce a bag of words depending on relevance across corpus

I understand that TF-IDF(term frequency-inverse document frequency) is the solution here? But see, TF of the TF-IDF is specific to a single document only. I need to produce a bag of words that are relevant to the WHOLE corpus. Am I doing this wrong or is there an alternative?
You may be able to do this if you count the IDF on a different corpus. A general corpus containing newswire texts may be suitable. Then you can treat your own corpus as a single document to count the TF. You will also need a strategy for the words that are present in your corpus but not present in the external corpus as they won't have a IDF value. Finally, you can rank the words in your corpus according to the TF-IDF.

What is the best way to split a sentence for a keyword extraction task?

I'm doing a keyword extraction using TD-IDF on a large number of documents. Currenly I'm splitting each sentence based on n-gram. More particularly I'm using tri-gram. However, this is not the best way to split each sentence into ints constituting keywords. For example a noun phrase like 'triple heart bypass' may not always get detected as one term.
The other alternative to chunk each sentence into its constituting elements look to be part of speech tagging and chunking in Open NLP. In this approach phrase like 'triple heart bypass' always gets extracted as a whole but the downside is in TF-IDF the frequency of extracted terms (phrases) dramatically drops.
Does anyone have any suggestion on either or these two approaches or have any other ideas to improve the quality of the keywords?
What is :
the goal of your application ?
--impacts the tokenization rules and defines the quality of your keywords
type of documents?
--chunking is not the same if you have forum data or news article data.
You can implement some boundary recognizer by yourself, or using a statistical model as in openNLP.
The typical pipeline is that you should first tokenize as simple as possible, apply stop words removal (language-dependent), and then if needed POS tagging-based filtering (but this is a costly operation).
other options : java.text.BreakIterator, com.ibm.icu.text.BreakIterator, com.ibm.icu.text.RuleBasedBreakIterator...

How to find similarity of sentence?

How to find the semantic similarity between any two given sentences?
Eg:
what movies did ron howard direct?
movies directed by ron howard.
I know its a hard problem. But, would like to ask the views of experts.
I don't know how to use the Parts of Speech to achieve this.
http://nlp.stanford.edu:8080/parser/index.jsp
Its a broad problem. I would personally go for cosine similarity.
You need to convert your sentences into a vector. For converting the sentence into vector you can consider several rules, like number of occurances, order, synonyms etc. Then taking the cosine distance as mentioned here
You can also explore elasticsearch for finding associated words. You can create your custom analyzers, stemmer, tokenizer, filters(like synonyms) etc. which can be very helpful in finding similar sentences. Elasticsearch also provides more like this query which finds similar documents using the tf-idf scores.

Clustering phrases around a theme

I have encountered a very unusual problem. I have a set of phrases (noun phrases) extracted from a large corpus of documents. These phrases are >=2 and <=3 words of length. There is a need to cluster these phrases because the number of phrases extracted are very large in number and showing them as a simple list might not be useful for the user.
We are thinking of nice very simple ways of clustering these. Is there a quick tool/software/method that I could use to cluster these so that all phrases inside a cluster belong to a particular theme/topic, if I keep the number of topics as a fixed initially? I don't have any training set or any other clusters that I can use as a training set.
Topic classification is not an easy problem.
The conventional methods used to classify long documents (100's of words) are usually based on frequent words, and not suitable for very short messages. I believe that your problem is somewhat similar to tweet classification.
Two very interesting papers are:
Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia
(presented at HCI International 2011)
Eddi: Interactive Topic-based Browsing of Social Status Streams (presented at UIST'10)
If you want to include knowledge about the world so that, e.g., cat and dog will be clustered together, you can use WordNet's domains hierarchy.

Resources