I have a problem with my search.
I want to search at a range of sizes, but it does not work.
Size: [0 TO 1000] and I show the files which have a size of between 0 and 1000.
Is it possible to do a search of this type in Lucene?
Thank you!
When searching on field with a range of numbers you have to first make sure, the field is of numeric type when indexing a document and adding the size field.
While searching (if you have trouble with the syntax) you can create a RangeQuery, and combine multiple ranges with BooleanQuery.
Related
I have a table that I would like it to select the smallest size picture frame that could be used based on the size values, basically return the smallest frame that would fit the image.
So far I have a vertical array formula that can select the smallest frame that will fit the size requirements but I have one column that I would want to stay static i.e another match that would only give the results from the selection with the same type ID/
My current formula is as follows:
= INDEX($A$2:$A$16,MATCH(4,MMULT((I2:L2<=$B$2:$E$16)+0,{1;1;1;1}),0))
At the minute i am just referencing the type as another lookup but i would like to have it so it will only attempt to match ones with the corresponding type, currently if the size is larger than availible within the correct type it will select a type that has that size availible.
I’ve tried to show what i mean in the screenshot! I want it to only pick up type 1 but it is selecting type 3 because the mmult is seeing that is the only one that would fit.
Help is much appreciated!
Thanks!
If the frame sizes to be looked up are in ascending order, you could use something like this
=INDEX($A$2:$A$4,MIN(IF((($B$2:$B$4>=F2)*($C$2:$C$4>=G2)*($D$2:$D$4>=H2)),ROW($C$2:$C$4)))-1,1)
based on this sort of data layout
Ended up using a load of nested if statements to section of the types to make it simpler to code
Thanks anyway peeps!
It's my first post here, so please bear with me :-).
Problem Background:
I've multiple text files of the form:
<ticker>,<date>,<open>,<high>,<low>,<close>,<vol>
A,20120904 0926,37.14,37.14,37.14,37.14,693
.
.
.
ZZ,20120904 1602,1.6,1.6,1.6,1.6,11771
As you might have guessed it's stock ticks. When I load it to matlab, it creates a structure with an array (of the numerical values) and a cell (for the strings) which is fine at this point as I can work with it.
Problem:
I'd like to find the most efficient way to search the array for a specific symbol (~70K lines). While it's easy to do a naive or halving searches, I don't think these approaches are very useful for multiple files and/or multiple searches to extract the beginning and end indices of a given symbol/string.
I've looked into past posts here and read about Rabin-Karp, Bitap and hash tables, but I'm not sure any of them fully answers my needs.
So far, I've leaning towards running through the cell once and creating a hash table for each letter (i.e. 'A', 'B', etc) and then running a naive search or anything else you might suggest :-). The reason for hashing is that I might use the same file to look up different stock symbols, so I think running through it once and labeling letters will reduce the complexity in the long run.
What are your thoughts on the matter? Am I in the right direction?
I'm using matlab btw.
Thank you
You can store all your tickers in a struct array. Each column being a property. Assuming you have non-empty values, you can do the following,
tickers = [S.tickers];
dates = [S.date];
You can easily do queries to get the index you want from your struct array S. You can go further and index tickers by ticker name, by creating an index with ticker name as keys.
If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure ,
which ways should we do ?
1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .
2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .
I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?
thanks a lot.
As the wikipedia entry well explains,
There are two main variants of
inverted indexes: A record level
inverted index (or inverted file index
or just inverted file) contains a list
of references to documents for each
word. A word level inverted index (or
full inverted index or inverted list)
additionally contains the positions of
each word within a document. The
latter form offers more functionality
(like phrase searches), but needs more
time and space to be created.
Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.
To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).
So, in either case, option (1) is more promising (unless your documents are peculiarly small).
Can any one suggest me the best way to get Hits( no of occurrences ) of a word per document in Lucene?..
Lucene uses a field-based, rather than document-based, index.
In order to get term counts per document:
Iterate over documents using IndexReader.document() and isDeleted().
In document d, iterate over fields using Document.getFields().
For each field f, get terms using getTermFreqVector().
Go over the term vector and sum frequencies per terms.
The sum of term frequencies per field will give you the document's term frequency vector.
SpanTermQuery.getSpans will give an enumeration of docs and where the terms appears. The docs are sorted, so you can just count the number of times each doc appears, ignoring the position info.
I have this number extracting problem.
I want to get all matches that don't have a certain number in it
ex : 125501874, 125001873
Every number that as 55 at the position 2 are not to be considered.
The first numbers range is 0 to 9 and the second is 1-9 so the real range is [01-99]
(we cannot have 00 as the first two number)
With Lucene I wanted to add NOT field:[01-99]55*
But it doesn't seem to work. Is there an easy way to find ??55* and disregard it in a Search("NOT field:[01-99]55*")?
Thank you Lucene guru
Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.
Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"
Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.
Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.
Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.
But like you said before, better start with an index on the relevant digits straighaway.
I have another solution.
NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*
It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start.
Now I'm testing this on a million of row and it's pretty efficient for our needs.