Text search in NoSQL with mapreduce - search

I am working on an app that requires searching a large list of titles. Ideally I would like to use NoSQL but it seems that text search across the whole database is not as good as in SQL databases (please correct me if I am wrong)
In any case, I do want to optimize the speed of searches. A normal search might be fast enough, but I do want a responsive live-search AND fuzzy search. Therefore I can only think of two approches:
Load the whole list of titles in memory and indexed as a trie or prefix tree
Implement some type of trie algorithm with a mapreduce function. This would be preffered solution, but I am not sure if it can be done or the disk space cost might outweigh the benefits.
any ideas? Also I am not sure if a "fuzzy search" is the best implemented with a trie or with a B+ tree.
Since the "titles" are unique. Should I just use the full title as the ID?

To do this efficiently, you'll have to index your text by words.
In other words, the object foo entitled MapReduce: Simplified Data Processing on Large Clusters will be mapped to the following keys:
MapReduce: Simplified Data Processing on Large Clusters,
Simplified Data Processing on Large Clusters,
Data Processing on Large Clusters,
Processing on Large Clusters,
on Large Clusters,
Large Clusters,
Clusters.
If the text is too long, you can truncate keys to a given number of characters (say 24).
Here is a code sample for CouchDB:
function map(o) {
const SIZE = 24;
function format(text, begin) {
return text.substr(begin, SIZE).toLowerCase();
}
const WORD_MATCHER = /\S+/g;
while ((match = WORD_MATCHER.exec(o.title))) {
var begin = match.index;
emit(format(o.title, begin), {position: begin});
}
}
Then if you ask for keys between data process and data processZ, you will get:
{"key": "data processing on large clusters", "id": "foo", "value":{"position": 22}}

Related

How does Cassandra store variable data types like text

assumption is, Cassandra will store fixed length data in column family. like a column family: id(bigint), age(int), description(text), picture(blob). Now description and picture have no limit. How does it store that? Does Cassandra externalize through an ID -> location way?
For example, looks like, in relational databases, a pointer is used to point to the actual location of large texts. See how it is done
Also, looks like, in mysql, it is recommended to use char instead of varchar for better performance. I guess simply because, there is no need for an "id lookup". See: mysql char vs varchar
enter code here
`
Cassandra stores individual cells (column values) in its on-disk files ("sstables") as a 32-bit length followed by the data bytes. So string values do not need to have a fixed size, nor are stored as pointers to other locations - the complete string appears as-is inside the data file.
The 32-bit length limit means that each "text" or "blob" value is limited to 2GB in length, but in practice, you shouldn't use anything even close to that - with Cassandra documentation suggesting you shouldn't use more than 1MB. There are several problems with having very large values:
Because values are not stored as pointers to some other storage, but rather stored inline in the sttable files, these large strings get copied around every time sstable files get rewritten, namely during compaction. It would be more efficient to keep the huge string on disk in a separate files and just copy around pointers to it - but Cassandra doesn't do this.
The Cassandra query language (CQL) does not have any mechanism for store or retrieving a partial cell. So if you have a 2GB string, you have to retrieve it entirely - there is no way to "page" through it, nor a way to write it incrementally.
In Scylla, large cells will result in large latency spikes because Scylla will handle the very large cell atomically and not context-switch to do other work. In Cassandra this problem will be less pronounced but will still likely cause problems (the thread stuck on the large cell will monopolize the CPU until preempted by the operating system).

data modeling of cassandra for node based use cases

I have a cql table which has 2 columns
{
long minuteTimeStamp -> only minute part of epoch time. seconds are ignored.
String data -> some data
}
I have a 5 node cassandra cluster and I want to distribute per minute data uniformly on all 5 nodes. So if per minute data is ~10k records, so each node should consume ~2k data.
I also want to consume each minute data parallelly, means 5 different readers read data 1 on each node.
I came to one solution like I also keep one more column in table like
{
long minuteTimeStamp
int shardIdx
String data
partition key : (minuteTimeStamp,shardIdx)
}
By doing this while writing the data, I will do circular round-robin on shardIdx. Since cassandra uses vnodes, so it might be possible that (min0,0) goes to node0, and (min0,1) also goes to node0 only as this token might also belong to node0. This way I can create some hotspots and it will also hamper read, as 5 parallel readers who wanted to read 1 on each node, but more than one reader might land to same node.
How can we design our partition-key so that data is uniformly distributed without writing a custom partitioner ?
There's no need to make the data distribution more complex by sharding.
The default Murmur3Partitioner will distribute your data evenly across nodes as you approach hundreds of thousands of partitions.
If your use case is really going to hotspot on "data 1", then that's more an inherent problem with your use case/access pattern but it's rare in practice unless you have a super-node issue (for example) in a social graph use case where you have Taylor Swift or Barack Obama having millions more followers than everyone else. Cheers!

which is faster: views or allDocs with Array.filter?

I was wondering about performance differences between dedicated views in CouchDb/PouchDb VS simply retrieving allDocs plus filtering them with Array.prototype.filter later on.
Let's say we want to get 5,000 todo docs stored in a database.
// Method 1: get all tasks with a dedicated view "todos"
// in CouchDB
function (doc) {
if (doc.type == "todo"){
emit(doc._id);
}
}
// on Frontend
var tasks = (await db.query('myDesignDoc/todos', {include_docs: true})).rows;
// Method 2: get allDocs, and then filter via Array.filter
var tasks = (await db.allDocs({include_docs: true})).rows;
tasks = tasks.filter(task => {return task.doc.type == 'todo'});
What's better? What are the pros and cons of each of the 2 methods?
The use of the view will scale better. But which is "faster" will depend on so many factors that you will need to benchmark for your particular case on your hardware, network and data.
For the "all_docs" case, you will effectively be transferring the entire database to the client, so network speed will be a large factor here as the database grows. If you do this as you have, by putting all the documents in an array and then filtering, you're going to hit memory usage limits at some point - you really need to process the results as a stream. This approach is O(N), where N is the number of documents in the database.
For the "view" case, a B-Tree index is used to find the range of matching documents. Only the matching documents are sent to the client, so the savings in network time and memory depend on the proportion of matching documents from all documents. Time complexity is O(log(N) + M) where N is the total number of documents and M is the number of matching documents.
If N is large and M is small then this approach should be favoured. As M approaches N, both approaches are pretty much the same. If M and N are unknown or highly variable, use a view.
You should consider one other thing - do you need the entire document returned? If you need only a few fields from large documents then views can return just those fields, reducing network and memory usage further.
Mango queries may also be of interest instead of views for this sort of query. You can create an index over the "type" field if the dataset size warrants it, but it's not mandatory.
Personally, I'd use a Mango query and add the index if/when necessary.

Streaming big data while sorting

I have huge data and as a result I cannot hold all of it in memory and I always get out of memory errors; obviously one of the solutions would be using streaming in Node.JS; but streaming is not possible(as far as I know) with sorting which is one the functionalities which I apply on my data; is there any algorithm maybe Divide and conquer algorithm that I can use for the combination of streaming and sorting (which is one of the functionalities which I apply on my data?)
You can stream the data using Kinesis and use the Kinesis Client Library, or subscribe a Lambda function to your Kinesis stream and incrementally maintain sorted materialized views. Where you store your sorted materialized views and how you divide your data will depend on your application. If you cannot store the entire sorted materialized views, you could have rolling views. If your data is time-series, or has some other natural order, you could divide the range of your ordered attribute into chunks. Then, you could have for example, 1-day or 1-hour sorted chunks of your data. In other words, choose the sorted subdivision that allows you to keep the information in memory as needed.

Significant terms causes a CircuitBreakingException

I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).
I want to try out the new "Significant terms" aggregation so I fired off the following query...
{
"query": {
"ids": {
"type": "document",
"values": [
"xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
]
}
},
"aggregations": {
"Keywords": {
"significant_terms": {
"field": "Body"
}
}
},
"size": 0
}
Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.
Unfortunately, this invariably results in a
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];
after a minute or two and seems to imply I haven't got enough memory.
The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.
The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.
Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?
There is a warning on the documentation for this aggregation about RAM use on free-text fields for very large indices [1]. On large indices it works OK for lower-cardinality fields with a smaller vocabulary (e.g. hashtags) but the combination of many free-text terms and many docs is a memory-hog. You could look at specifying a filter on the loading of FieldData cache [2] for the Body field to trim the long-tail of low-frequency terms (e.g. doc frequency <2) which would reduce RAM overheads.
I have used a variation of this algorithm before where only a sample of the top-matching docs were analysed for significant terms and this approach requires less RAM as only the top N docs are read from disk and tokenised (using TermVectors or an Analyzer). However, for now the implementation in Elasticsearch relies on a FieldData cache and looks up terms for ALL matching docs.
One more thing - when you say you want to "compare the body of the document specified" note that the usual mode of operation is to compare a set of documents against the background, not just one. All analysis is based on doc frequency counts so with a sample set of just one doc all terms will have the foreground frequency of 1 meaning you have less evidence to reinforce any analysis.

Resources