I have this number extracting problem.
I want to get all matches that don't have a certain number in it
ex : 125501874, 125001873
Every number that as 55 at the position 2 are not to be considered.
The first numbers range is 0 to 9 and the second is 1-9 so the real range is [01-99]
(we cannot have 00 as the first two number)
With Lucene I wanted to add NOT field:[01-99]55*
But it doesn't seem to work. Is there an easy way to find ??55* and disregard it in a Search("NOT field:[01-99]55*")?
Thank you Lucene guru
Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.
Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"
Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.
Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.
Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.
But like you said before, better start with an index on the relevant digits straighaway.
I have another solution.
NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*
It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start.
Now I'm testing this on a million of row and it's pretty efficient for our needs.
Related
Within excel I have a list of artists, songs, edition.
This list contains over 15000 records.
The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.
So for example some records:
ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official
Each dash indicates a separate column (so 3 columns A, B, C are filled in)
How would I mark them as duplicates within Excel?
I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.
Any regex magic or vba script what can help me out?
It'd also be alright to see how much similar the row is (say 80% similar).
One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:
https://stackoverflow.com/a/4243652/1278553
From there, you can use the function directly in your spreadsheet to find similarities between instances:
You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:
select
s1.group, s2.group, s1.song, s2.song,
levenshtein (s1.group, s2.group) as group_match,
levenshtein (s1.song, s2.song) as song_match
from
songs s1
cross join songs s2
order by
group_match, song_match
Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.
You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000
=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)
This way will also look back up the list, to indicate the ones found
=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))
The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +
try this worksheet fucntions in your loop:
=COUNTIF(Range,"*yourtexttofind*")
I've adapted this solution from a couple of years ago:
=LOOKUP(2^15,FIND(Keywords,A2),Categories)
I use this for searching within a description field for keywords in a named list, in order to return a corresponding category from an adjacent named list.
However I do not understand the significance of 2^15. Can someone explain?
Also it's unclear in what order the search operates. If two keyword options were "check" and "deposit," and they were assigned to different categories, but both appeared in the same description field cell, how do I know which will be found first? Is it placement in the string, or order in the list?
2^15 is simply an arbitrarily large number, which lookup attempts to find - when it can't find it, it takes the next lowest number.
Effectively your formula looks at Keywords, and attempts to find the value in A2. For each word that actually matches A2, it provides a non-error message. Then out of the whole list, it attempts to find that line number in categories, resulting in many errors, and a single correct value. Lookup picks the value by using 2^15. Though this seems to be a weird way of doing it; it is likely a holdover of pre-2007, as Lookup is generally used now only for backwards compatibility purposes. Also using 1 instead of 2^15 worked for a couple of simple cases that I tried when writing this up.
What is the use of hashmap when we can associate only one value to one key. We can directly search for that value insted of searching for the key..? am i right..? If not than please explain.
Key1---2->5->8->2;
Key2---->15->14;
Key3---45->15->10;
If it will be like this we can search values using key with less no. of iterations.
It comes in handy when you know the key of the value you need. If this is the case, the search time is constant (doesn't vary with the size of the array). Yes, you can search the array, but you'd have to iterate it, which leads to a linear delay (the greater the array, the longer it would take to find the needed value)
I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks
One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.
It's my first post here, so please bear with me :-).
Problem Background:
I've multiple text files of the form:
<ticker>,<date>,<open>,<high>,<low>,<close>,<vol>
A,20120904 0926,37.14,37.14,37.14,37.14,693
.
.
.
ZZ,20120904 1602,1.6,1.6,1.6,1.6,11771
As you might have guessed it's stock ticks. When I load it to matlab, it creates a structure with an array (of the numerical values) and a cell (for the strings) which is fine at this point as I can work with it.
Problem:
I'd like to find the most efficient way to search the array for a specific symbol (~70K lines). While it's easy to do a naive or halving searches, I don't think these approaches are very useful for multiple files and/or multiple searches to extract the beginning and end indices of a given symbol/string.
I've looked into past posts here and read about Rabin-Karp, Bitap and hash tables, but I'm not sure any of them fully answers my needs.
So far, I've leaning towards running through the cell once and creating a hash table for each letter (i.e. 'A', 'B', etc) and then running a naive search or anything else you might suggest :-). The reason for hashing is that I might use the same file to look up different stock symbols, so I think running through it once and labeling letters will reduce the complexity in the long run.
What are your thoughts on the matter? Am I in the right direction?
I'm using matlab btw.
Thank you
You can store all your tickers in a struct array. Each column being a property. Assuming you have non-empty values, you can do the following,
tickers = [S.tickers];
dates = [S.date];
You can easily do queries to get the index you want from your struct array S. You can go further and index tickers by ticker name, by creating an index with ticker name as keys.