Really is it possible to filter out low score from solr results?
This is a very helpful thing, even if documentation advise against it.
Let's say that I am doing a search after 4 fields, like
field 1
field 2
field 3
field 4
and then merge all this in to another filed and do some soundex filter.
field 5 soundex
Well it is happen that sometimes a word is found only in the 5th field base on soundex, but the score is usually very very low. So in this case is safe to discard the result.
And then how can I do this in a simple way; is it possible?
Related
For example, the search query is "rank" and there is few results:
Row #1: ranking good
Row #2: rank greater
Row #3: my custom rank
But would like
Row #1: rank greater
Row #2: ranking good
Row #3: my custom rank
I using
expr('sum((10*(101-IF(min_hit_pos<100,min_hit_pos,100))+exact_hit)*user_weight)*1000+bm25')
Another way with LCS is not optimal for me, because there is a blend_chars and "multi-part" takes more weight then "multi" or "part" separately
expr('sum((4*lcs+(101-IF(min_hit_pos<100,min_hit_pos,100))+exact_hit)*user_weight)*1000+bm25')
Sphinx 2.1.2
Thanks for the help
You can't do it within ranking alone, because the ranker doesnt know anything about the length of words.
To sphinx words are just numbers, (either crc or a lookup table - depending on dict) so it doesnt know how long the word is.
The only thing I can think that might help is index_exact_words which will affect rank on exact word matches. (so rank will rank higher than ranking as its an exact match)
(to clarify is exact_hit is just all query words words match the field words, but knows nothing match morpholoyg or prefix words matching. so "rankng" > "rank" is still an exact_hit, as they are the same word subject to stemming)
It is quite an in depth excel sheet (to me) so here is a link to it: https://dl.dropboxusercontent.com/u/19122839/Movies.xlsm
On the Filters sheet, I have a search feature. This allows you to put in different genres, years, etc. and will pull up results.
The genre part does not seem to be working correctly for some reason.
In the movie_genres sheet, there is a Genre Equals and Genre Count column that seem to be marking the information correctly, but when you go to the movies sheet, the Matches Genre column does not. I use this function:
=INDEX(Genres[Genre Count],MATCH(Movies[[#This Row],[ID]],Genres[ID],0))
Which, to me, should pull the Genre Count, but in the case where there are more than one genre (I used Blank Check as an example in this case), it doesn't mark it as a 1. How can I make it so that this gets corrected.
For example, if you add the Comedy as a second genre, it pulls up more results than if you only have Family. I think I just need a fresh pair of eyes looking at this and it is probably something dumb, but any help would be great.
I believe I need to make it so that the index/match function I use in Movies[Matches Genre] will work as long as there is a 1 in Genres[Genre Count] for that ID. It only seems to work if there is a 1 in the first instance of the ID.
EDIT: I have added in a COUNT feature to better explain what I am talking about. With only Family as a genre, it shows there are 10 results, but when you add Comedy as a second genre, you get 40 results. This number should never go up as you add genres.
Perhaps try using SUMIF like this
=SUMIF(Genres[ID],[#ID],Genres[Genre Count])
If one movie might have several 1s but you only want 1 maximum then change to
=IF(SUMIF(Genres[ID],[#ID],Genres[Genre Count])>0,1,0)
I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.
I have this number extracting problem.
I want to get all matches that don't have a certain number in it
ex : 125501874, 125001873
Every number that as 55 at the position 2 are not to be considered.
The first numbers range is 0 to 9 and the second is 1-9 so the real range is [01-99]
(we cannot have 00 as the first two number)
With Lucene I wanted to add NOT field:[01-99]55*
But it doesn't seem to work. Is there an easy way to find ??55* and disregard it in a Search("NOT field:[01-99]55*")?
Thank you Lucene guru
Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.
Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"
Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.
Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.
Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.
But like you said before, better start with an index on the relevant digits straighaway.
I have another solution.
NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*
It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start.
Now I'm testing this on a million of row and it's pretty efficient for our needs.
I need to write an algorithm that returns the closest match for a contact based on the name and address entered by the user. Both of these are troubling, since there are so many ways to enter a company name and address, for instance:
Company A, 123 Any Street Suite 200, Anytown, AK 99012
Comp. A, 123 Any St., Suite 200, Anytown, AK 99012
CA, 123 Any Street Ste 200, Anytown, AK 99012
I have looked at doing a Levenshtein distance on the Name, but that doesn't seem a great tool, since they could abbreviate the name. I am looking for something that matches on the most information possible.
My initial attempt was to limit the results first by the first 5 digits of the postal code and then try to filter down to one based on other information, but there must be a more standard approach to getting this done. I am working in .NET but will look at any code you can provide to get an idea on how to accomplish this.
I don't exactly now how this is accomplished, but all major delivery companies (FedEx, USPS, UPS) seem to have a way of matching an address you input against their database and transforming it to a normalized form. As I've seen this happen on multiple websites (Amazon comes to mind), I am assuming that there is an API to this functionality, but I don't know where to look for it and whether it is suitable for your purposes.
Just a thought though.
EDIT: I found the USPS API
I have solved this problem with a combination of address normalization, Metaphone, and Levenshtein distance. You will need to separate the name from the address since they have different characteristics. Here are the steps you need to do:
1) Narrow down you list of matches by using the (first six characters of the) zip code. Basically you will need to calculate the Levenshtein distance of the two strings and select the ones that have a distance of 1 or 2 at the most. You can potentially precompute a table of zip codes and their "Levenshtein neighbors" if you really need to speed up the search.
http://en.wikipedia.org/wiki/Levenshtein_distance
2) Convert all the address abbreviations to a standard format using the list of official prefix and suffix abbreviations from the USPS. This will help make sure your results for the next step are more uniform:
https://www.usps.com/send/official-abbreviations.htm
3) Convert the address to a short code using the Methaphone algorithm. This will get rid of most common spelling mistakes. Just make sure that your implementation can eliminate all non word characters, pass numbers intact and handle multiple words (make sure each word is separated by a single space):
http://en.wikipedia.org/wiki/Metaphone
4) Once you have the Methaphone result of the compare the address strings using the Levenshtein distance. Calculate a percentage of change score by dividing the result by the number of characters in the longer string.
5) Repeat steps 3 and 4 but now use the names instead of the addresses.
6) Compute the score for each entry using this formula: (Weight for address * Address score) + (Weight for name * Name score). Pick your weights based on what is more important. I would start with .9 for the address (since the address is more specific) and .1 for the name but the weights may depend on your application. Pick the entry with the lowest score. If the score is too high (say over .15 you may declare that there are no matches).
I think filtering based on zip code first would be the easiest, as finding it is fairly unambiguous. From there you can probably extract the city and street. I'm not sure how you would go about finding the name, but it seems matching it against the address if you already have a database of (name, address) pairs is feasible.
Dun & Bradstreet do this. They charge money because it's really hard. There's no "standard" solution. It's mostly a painful choice between a service like D&B or roll your own.
As a start, I'd probably do a word-indexed search. That would mean two stages:
Offline stage: Generate an index of all the addresses by their keywords. For example, "Company", "A" and "123" would all become an keywords for the address you provided above. You could do some stemming, which would mean for words like "street" you'd also add a word "st" into its index.
Online stage: The user gives you a search query. Break down the search query into all its keywords, and find all possible matches of each keyword in the database. Tally the number of matched keywords on each address. Then sort the results by the number of matched keywords. This should be able to be done quite quickly if there aren't too many matches, as its just a few sorted list merges and increments, followed finally by a sort.
Given that you know the domain of your problem, you could specialise the algorithm to use knowledge about the domain - for example the zip code filtering mentioned before.
Also just to enable me to provide you with a better answer, are you using an SQL database at all? I ask because the way I would do it is I'd store the keyword index in the SQL database, and then the SQL query to search by keyword becomes quite easy, since the database does all the work.
Maybe instead of using Levenshtein for the name only, it could be useful when used with the entire string representation of a contact. For instance, the distance of your first example to the second is 7 and to the third 9. Considering the strings have lengths 54, 50 and 45, this seems to be a relatively useful and quite simple similarity measure.
This is what I would do. I am not aware of algorithms, so I just use what makes sense.
I am assuming that the person would provide name, street address, city name, state name, and zipcode.
If the zipcode is provided in 9 numbers, or has a hyphen, I would strip it down to 5 numbers. I would search the database for all of the addresses that has that zipcode.[query 1]
Then I would compare the state letter with the one from the database. If it's not a match, then I would tell that to the user. Same goes for the city name.
From what I understand, a street name is not in numbers, only the house on a street had numbers in it. Further more, the house number is usually at the beginning unless it is house or suite number.
So I would do regex to search for the numbers and the next space or comma next to it. Then find position of the first word that does not has a period(.) or ends in comma. I have part of the street name, so I could do a comparison against the rows fetched earlier, or I would change the query to have the street name LIKE %streetName%.
I am guessing the database has a beginning number and ending number of the house on a block. I would check against that street row to see if the provided street number is on that street.
By now you would know the correct data to show, and could look up in a different table as to which name is associated with that house number. I am not sure why you want to compare it. Only use for name comparing would be if you want to find people whose address was not provided. You can look here for comparing string ways Similar String algorithm
If you can reliably figure out general structure of each address (perhaps by the suggestions in the other answers), your best bet would be to run the data through a USPS-certified (meaning: the results are reliable, accurate, and conform to federal standards) address verification service.
#RyanDelucchi, it is a fun problem, but only once you've solved it. So, #SteveBering, I would recommend submitting your list of contacts to a list processing service which will flag duplicates based on the address -- according to USPS guidelines.
Since I work in the address verification field, I would suggest SmartyStreets (which I work for) since it will deliver the most value to your specific need -- however, there are a few CASS-Certified vendors who will do basically similar things.