I don't know what the ReaderUtil.subReader of Lucene does. Does anybody know what it does?
See the class definition here: ReaderUtil.
Is it used to read each segment separately?
A Lucene index is divided into segments. In short, from each segment only a chunk of the index is read. And subreaders are the actual readers which work directly on a segment (one segment => one segment reader). And the IndexReader, which clients use, is an aggregated implementation that uses the subreaders to perform the actual work.
Related
Is it possible to have an big size array like arr[200000] as output in top function of Vivado_HLS.
Yes BUT:
which kind of "type" are the elements of the array? int? char? a single bit?
which kind of interface do you want to use? if you want pass all the elements at the same time, the operation may be impossible because maybe you don't have enough space on the fpga. If you are using a streaming or serial interface you can do this.
Normally you don't have this kind of limitation but you should evaluate, case by case, what is the best solution with the hardware that you have.
The long-sized arrays are generally implemented on BRAM because the LUTS are used for small-sized arrays. So you have to consider whether you have enough resources to utilize.
If your application allows using FIFO or axi-stream or axi-full with burst then you may think to use them to transfer the data without holding the whole array on the PL, maybe you can use a buffer that holds a small chunk of your array.
So it depends both on your board and your algorithm.
In SQL an index is typically some kind of balanced tree (ordered nodes that point to the real table row to make searching possible in O(log n)). Walking through such a tree tree actually is the searching process.
Now Lucene uses an inverted index with term frequencies: It stores for each term how often it occurs in which documents. This is easy to understand. But this doesn't explain how a search on such an index is actually performed.
The search string is analyzed and split the same way into terms, of course, and then "the index is searched" for these terms to find the documents containing them – but how? Is the Lucene index itself also ordered and tree-like organized in some way for O(log n) to be possible? Or is walking the Lucene index on search-time actually linear, so O(n)?
There is no simple answer to this. First, because the internal format got improved from release to release and second, because with the advent of Lucene 4 configurable codecs were introduced which serve as an abstraction between the logical format and the actual physical format.
An index is made up of shards and replicas, each of them being a Lucene index itself. A Lucene index is then made of of multiple segments, whereas each segment is again a Lucene index. A segment is read-only and made up of multiple artefacts which can be held in the file system or in RAM.
What's in a Lucene index from Adrien Grand is an excellent presentation on the organisation of a Lucene index. This blog article from Michael McCandless and this blog article from Elastic are about codecs introduced with Lucene 4.
So querying a Lucene index actually results in querying multiple segments in parallel making use of a specific codec. A codec can represent a structure in file system or in RAM, typically optimized/compressed for a particular need. Internal format can be anything from a tree, a Hashmap, a Finite State Machine just to name a few. As soon as you make use of wildcard characters in the search your query ("?" or "*") this automatically results in a more or less deep traversal in your index.
Is there a space efficient data structure that can help answer the following question:
Assume I have a database of a large number of strings (in the
millions). I need to be able to answer quickly if a given string is
a substring of one these strings in the database.
Note that it's not even necessary in this case to tell which string it is a substring of, just that it's a substring of one.
As clarification, the ideal is to keep the data as small as possible, but query speed is really the most important issue. The minimum requirement is being able to hold the query data structure in RAM.
The right way to go about this is to avoid using your Java application to answer the question. If you solve the problem in Java, your app is guaranteed to read the entire table, and this is in addition to logic you will have to run on each record.
A better strategy would be to use your database to answer the question. Consider the following SQL query (assuming your database is some SQL flavor):
SELECT COUNT(*) FROM your_table WHERE column LIKE "%substring%"
This query will return the number of rows where 'column' contains some 'substring'. You can issue a JDBC call from your Java application. As a general rule, you should leave the heavy database lifting to your RDBMS; it was created for that.
I am giving a hat tip to this SO post which was the basis for my response: http://www.stackoverflow.com/questions/4122193/how-to-search-for-rows-containing-a-substring
Strings are highly compact structures, so for regular English text it is unlikely that you will find any other kind of structure that would be more space efficient than strings. You can perform various tricks with bits so as to make each character occupy less space in memory, (at the expense of supporting other languages,) but the savings there will be linear.
However, if your strings have a very low degree of variation, (very high level of repetition,) then you might be able to save space by constructing a tree in which each node corresponds to a letter. Each path of nodes in the tree then forms a possible word, as follows:
[c]-+-[a]-+-[t]
+
+-[r]
So, the above tree encodes the following words: cat, car. Of course this will only result in savings if you have a huge number of mostly similar strings.
Doeas anybody know if riaksearch has the ability to generate excerpt with highlight points in it similar to lucene does?
Riak Search doesn't expose this functionality out of the box, but with a little work you can create a rough approximation.
Riak Search allows you to feed search results into a MapReduce job. If you do this, then your Map or Reduce function will also get a list of token positions in the document that matched the query (this is exposed as keydata, http://www.basho.com/search.php?q=keydata). Using these positions, you can write code to mark up the document or excerpt portions of text.
I think this functionality will hardly ever be implemented in Riak since it's philisophy implies that it doesn't care about what exactly is stored in the values and therefore does not process them in any meaningful way except providing some metadata like indices.
I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:
"000ab56cb24aef9b817ac98d55695c6a"
Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).
For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?
The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.
It also affects storage, as longer keys take more space, space takes time.
However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.
However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).
So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.
Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.
If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.
From the CouchDB: The definitive guide book:
I need to draw a picture of this at
some point, but the reason is if you
think of the idealized btree, when you
use UUID’s you might be hitting any
number of root nodes in that tree, so
with the append only nature you have
to write each of those nodes and
everything above it in the tree. but
if you use monotonically increasing
id’s then you’re invalidating the same
path down the right hand side of the
tree thus minimizing the number of
nodes that need to be rewritten. would
be just the same for monotonically
decreasing as well. and it should
technically work if you’re updates can
be guaranteed to hit one or two nodes
in the inside of the tree, though
that’s much harder to prove.
So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.