Lucene - proximity to start of field text? - search

I have a search field "title", within it, I want to say "things that match nearer the start of the title should be scored higher in the search results".
e.g.
Title: "The quick brown fox jumps over the lazy dog"
Title: "the lazy dogs were under the jumping quick brown fox"
Title: "The lazy brown fox jumps over the quick dog"
Title: "The brown fox made quick jumps over the sleazy dog"
if I search for "quick", I want the first result to be ranked top, and the 4th result to be ranked 2nd.
Is this possible within lucene? I'm using Lucene.NET / Version_29 if it makes any difference.

During indexing, along with every term, you can store the position of it's first occurrence in the payload of the corresponding term. During retrieval, you can work with a modified similarity function where you could take into consideration the stored position of a term in addition to the term weights.
A related SO question is here

Related

SpaCy doc.similarity limitations

I'm building an information retrieval tool that receives an user's request and returns the most similar label in the corpus.
With Spacy's vanilla similarity, I have the following limitation :
request = nlp("cute cat")
label1 = nlp("cute dog")
label2 = nlp("lovable cat")
print(request.similarity(label1))
print(request.similarity(label2))
# Returns
# 0.9046133562567831
# 0.8776915657921017
In this case I would like cat label to have a higher similarity because the request is about a (cute/lovable/...) cat.
Also, "ugly cat" should have a lower score then "cute dog".
I'm thinking of overwriting Spacy's similarity so doc.similarity will be a weighted sum of similarity between nouns and similarity between adjectives. The first one will have a higher weight.
Do you think it would be a good idea ? Do you know better ways or tools for this ?
Also, labels are not that simple. I'm thinking of dependency parsing to handle labels as "cute dog in a garden" (I'm inventing them). Here dog and garden are nouns but dog is the 'main' one.

Algorithm For Determining Sentence Subject Similarity

I'm looking to generate an algorithm that can determine the similarity of a series of sentences. Specifically, given a starter sentence, I want to determine if the following sentence is a suitable addition.
For example, take the following:
My dog loves to drink water.
All is good, this is just the first sentence.
The dog hates cats.
All is good, both sentences reference dogs.
It enjoys walks on the beach.
All is good, "it" is neutral enough to be an appropriate communication.
Pizza is great with pineapple on top.
This would not be a suitable addition, as the sentence does not build on to the "narrative" created by the first three sentences.
To outline the project a bit, I've created a library that generated Markov text chains based on the input text. That text is then corrected grammatically to produce viable sentences. I now want to string these sentences together to create coherent paragraphs.

Penn tree Bank tagset for NLTK

I am using NLTK as a part of my project, and have implemented Viterbi algorithm for the purpose of tagging (although I am aware of the fact that a tagger is very much available).
I have used the following snipped in my code
tagdict = load('help/tagsets/brown_tagset.pickle')
taglist = tagdict.keys()
tag_sequence_corpus = brown.tagged_sents(tagset='brown')
The first two lines have been used to extract the keys out of the brown tag-set, where the keys are the list of tags available in the Brown tag-set.
The argument tag-set='brown' in third line is used to tag the brown corpus according to the tag-set offered by the Brown corpus.
Is there any means by which I can set the tag-set argument to the Penn bank tag-set? The motivation for pursuing so is the fact that the Penn Bank tree has some 36-45 tags, which makes it feasible to implement the Viterbi algorithm (complexity of the algorithms being O(n*|S|^3) ) where n is the length of the sentence ans |S| refers to the magnitude of the tag-set, while the brown corpus has some ~226 tags in it (which increases the computation time). And the universal tag-set is prone to word sense ambiguity.
If PTB tagger is not available, may anyone suggest an alternative to it (apart from Brown/universal)?
Thankyou.
The last sentence in your question indicates that you're aware of the universal tagset: It only has about 10 POS tags, because they need to be broad enough for other tagsets to be mapped to them. The Penn Treebank tagset has a many-to-many relationship to Brown, so no (reliable) automatic mapping is possible.
What you can do is use one of the corpora that are already tagged with the Penn Treebank tagset. The NLTK's sample of the treebank corpus is only 1/10th the size of Brown (100,000 words), but it might be enough for your purposes.
Alternately, you can simplify the Brown corpus yourself: If you only keep the first part of compound tags like VBN-TL-HL or PPS+HVD, the 472 complex tags are reduced to 71. If that's still too many, inspect the definitions and manually collapse it further, e.g. by merging NN and NNS (singular and plural).

Lexicon-based text analysis. Any algorithm out there that does probabilistic category assignment?

I'm using a lexicon-based approach to text analysis. Basically I have a long list of words marked with whether they are positive/negative/angry/sad/happy etc. I match the words in the text I want to analyze to the words in the lexicon in order to help me determine if my text is positive/negative/angry/sad/happy etc.
But the length of the texts I want to analyze vary. Most of them are under 100 words, but consider the following example:
John is happy. (1 word in the category 'happy' giving a score of 33% for happy)
John told Mary yesterday that he was happy. (12.5% happy)
So comparing across different sentences, it seems that my first sentence is more 'happy' than my second sentence, simply because the sentence is shorter, and gives a disproportionate % to the word 'happy'.
Is there an algorithm or way of calculation you can think of that would allow me to make a fairer comparison, perhaps by taking into account the length of the sentence?
As many pointed out, you have to go down to syntactic tree, something similar to this work.
Also, consider this:
John told Mary yesterday that he was happy.
John told Mary yesterday that she was happy.
The second one tells nothing about John's happiness, but naive algorithm would be confused quickly. So in addition to syntax parsing, pronouns have to represent linking to the subjects. In particular, that means that the algorithm should know that John is he and Mary is she.
Ignoring the issue of negation raised by HappyTimeGopher, you can simply divide the number of happy words in the sentence by the length of the sentence. You get:
John is happy. (1 word in the category 'happy' / 3 words in sentence = score of 33% for happy)
John told Mary yesterday that he was happy. (1/8 = 12.5% happy)
Keep in mind word-list based approaches will only go so far. What should be the score for "I was happy with the food, but the waiter was horrible"? Consider using a more sophisticated system--- the papers below are a good place to start your research:
Choi, Y., & Cardie, C. (2008). Learning with compositional semantics as structural inference for subsentential sentiment analysis.
Moilanen, K., & Pulman, S. (2009). Multi-entity sentiment scoring.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques.
Turney, P. D., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association.

Split sentence in words with most weight

I'm working on a game where I need to find the biggest weight for a specific sentence.
Suppose I have the sentence "the quick brown fox" and assume only single words with their defined weight: "the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8
In this case the problem is trivial, as the solution consists in adding each words' weight.
Now assume we also add double words, so besides the above words, we also have "the quick" -> 5, "quick brown" -> 10, "brown fox" -> 1
I'd like to know which combination of single and double words provides the biggest weight, in this case it would be "the", "quick brown", "fox"
My question is, besides the obvious brute force approach, is there any other possible way to obtain a solution? Needless to say, I'm looking for some optimal way to achive this for larger sentences.
Thank you.
You can look at Integer Linear Program libraries like lp_solve. In this case, you will want to maximize the scores, and your objective function will contain the weights. Then you can subject it to constraints, like you cannot have "quick brown" and "brown" at the same time.
For word alignment, this was used in this paper, but your problem is way simpler than that, but you can browse through the paper to get an idea on how ILP was used. There's probably other algorithms other than ILP that can be used to solve this optimally, but ILP can solve it optimally and efficiently for small problems.
"the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8
Say for the above individual words , I shall take an array
[10,5,3,8] for words 0,1,2,3
Traverse through the list and get if the combination of two scores is less than the combined score
for example
10+5 >5 the + quick >the quick
5+3 < 10 quick brown > quick + brown . Mark This
and so on
While marking the combined solution mark them along continuous ranges .
for example
if words scores are
words = [1,2,5,3,1,4,6,2,6,8] and [4,6,9,7,8,2,9,1,2]
marked ranges (inclusive of both ends)
are [0,1],[2,5],[6,7]
Pseudo code is given below
traverse from 0 to word length - 1
if number not in range :
add word[number] to overall sum.
else:
if length of range = 1 :
add combined_word_score [ lower_end_number]
else if length of range = 2 :
add combined_word_score [ lower_end_number+next number]
else if length of range > 2 and is odd number :
add max (alternate_score_starting at lower_end_number ,
word[lower_end]+word[higher_end]+alternate_score_starting at
next_number)
else if length of range > 2 and is even number :
add max (alternate_score_starting at lower_end_number +word[higher_end],
word[lower_end]+alternate_score_starting at
next_number).
This feels like a dynamic programming question.
I can imagine the k words of the sentence placed beside each other with a light bulb in between each word (i.e. k-1 light bulbs in total). If a light bulb is switched on, that means that the words adjoining it are part of a single phrase, and if its off, they are not. So any configuration of these light bulbs indicates a possible combination of weights.. of course many configurations are not even possible because we don't have any scores for they phrases they require. So k-1 light bulbs mean there are a max of 2^(k-1) possible answers for us to go through.
Rather than brute forcing it, we can recognize that there are parts of each computation that we can reuse for other computations, so for (The)(quick)(brown fox ... lazy dog) and (The quick)(brown fox ... lazy dog), we can compute the maximum score for (brown fox ... lazy dog) only once, memoize it and re-use it without doing any extra work the next time we see it.
Before we even start, we should first get rid of the light bulbs that can have only 1 possible value (suppose we did not have the phrase 'brown fox' or any bigger phrase with that phrase in it, then the light bulb between 'brown' and 'fox' would always have to be turned off).. Each removed bulb halves the solution space.
If w1, w2, w3 are the words, then the bulbs would be w1w2, w2w3, w3w4, etc. So
Optimal(w1w2 w2w3 w3w4 ...) = max(Optimal(w2w3 w3w4 ...) given w1w2 is on, Optimal(w2w3 w3w4 ...) given w1w2 is off)
(Caveat if we reach something where we have no possible solution, we just return MIN_INT and things should work out)
We can solve the problem like this, but we can probably save even more time if were clever about the order in which we approached the bulbs. Maybe attacking the center bulbs first might help.. I am not sure about this part.

Resources