elasticsearch unsorted size/from pagination while indexing - pagination

I am using from/size pagination to iterate over a large, unsorted query result set while concurrently indexing documents that are not part of the query result set. Ignoring the fact that scroll/scan would be a more efficient solution for my scenario, can I expect consistent results?
I understand that if I were concurrently indexing documents that were part of the result set I should expect duplicate and missing results. In this scenario I am indexing documents that are not part of the result set and I am not sure if the inconsistent results I am getting are expected behavior due to this paging strategy.
I am using elasticsearch version 1.2.2. I have verified that the construction of the queries are consistent with the documentation.
{
"from" : 0, "size" : 50000,
"query" : {
"term" : { "user" : "kimchy" }
}
}
-
{
"from" : 50000, "size" : 50000,
"query" : {
"term" : { "user" : "kimchy" }
}
}
The correct number of documents are always returned (about 2.6 million), most of the time there are a small number of duplicates in place of the correct documents (about 10).

Deep Pagination can't be expected as save as it's usually executed spanning multiple shards (https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html). So even when not indexing at the same time (which would break your pagination for sure) there are mergings of shards from time to time done in the background. When this happens you may lost a document and got a duplicate instead for it.
So: do scroll/scan.

The issue of inconsistent results can be resolved using scroll/scan pagination instead of from/size pagination.
I do not know for sure if my usage of from/size paging is supported usage but the getting started documentation seems to suggests that it is. This may indicate a bug in from/size paging of version 1.2.2 of elasticsearch, though I have not done the necessary testing to identify or verify this.

Related

Design a counting measure of the element number in a MongoDB array property in cube.js

I'm using cube.js with MongoDB through MongoDB Connector for BI and MongoBI Driver and so far so good. I'd like to have a cube.js numerical measure that counts the element length from a MongoDB array of object nested property. Something like:
{
"nested": {
"arrayPropertyName": [
{
"name": "Leatha Bauch",
"email": "Leatha.Bauch76#hotmail.com"
},
{
"name": "Pedro Hermiston",
"email": "Pedro76#hotmail.com"
}
]
}
}
I wasn't able to figure that out looking at the docs and I was wondering if that is even possible.
I tried with type: count:
MyNestedArrayPropertyCounter: {
sql: `${CUBE}.\`nested.arrayPropertyName\``,
type: `count`,
format: `number`,
},
but I'm getting
Error: Error: Unknown column 'nested.arrayPropertyName' in 'field list'
Any help/advice is really appreciated. Thanks
BI treats nested arrays as separate relational tables. See https://www.mongodb.com/blog/post/introducing-the-mongodb-connector-for-bi-20
That's why you get unknown column error, it's not part of the parent document table.
So my guess you have to build schema on the nested array and then build measure count with dimension on parent object id.
Hope it halps.
I followed Michael Parshin's advice and here's my findings and outcomes to overcome the problem:
LEFT JOIN approach with cube.js joins. I found it painfully slow and most of the time it endend out in a timeout even when querying was performed through command line SQL clients;
Launch mongosqld with --prejoin flag. That was a better option since mongosqld automatically adds master table columns/properties to the secondary tables thus enabling you to conveniently query cube.js measures without joining a secondary Cube;
Wrote a mongo script that fetch/iterate/precalc and persist nested.arrayPropertyName count in a separate property of the collection documents.
Conclusion
Leaving out option 1, option 3 significantly outperforms option 2, typically less than a seconds against more than 20 seconds on my local machine. I compared both options with the same measure, different timeDimension ranges and granularity.
Most probably I'll incorporate array count precalculation into mongo document back-end persisting logic.

Search records by value or conditional value in Nosql with 100 millions records

We are looking NoSQL database where we can store more than 100 million records with many fields in Value like sets in Redis.
And database should be searchable with value. We checked Redis but it not supporting any option to search by value. because we have millions of records and we update some fields of records and then take a bunch of records which not updated at a specific time.
So, run a query on all records and then check which records not update from specific time take more time. Because in this solutions we are updating 100-200 records per minute and then take bunch record based on value.
So, Redis will not work here. We have the option to store into MongoDB but we are looking key-value database which supports search by value kind of features.
{
"_id" : ObjectId("5ac72e522188c962d024d0cd"),
"itemId" : 11.0,
"url" : "http://www.testurl.com",
"failed" : 0.0,
"proxyProvider" : "Test",
"isLocked" : false,
"syncDurationInMinute" : 60.0,
"lastUpdatedTimeUTC" : "",
"nextUpdateTimeUTC" : "",
"targetCountry" : "US",
"requestContentType" : "JSON",
"group" : "US"
}
In Aerospike, you can use predicate filtering to find records that have not been updated since a point in time, and return only the metadata of that record, which includes the record digest (its unique identifier). You can process the matched digests and do whatever update you need to do. This type of predicate filter is very fast because it only has to look at the primary index entry, which is kept in memory. See the examples in the Java client's repo.
You would not need to use a secondary index here, because you want to scan all the records in a namespace (or set of that namespace) and just check the 'last-update-time' piece of metadata of each record. Since you'll be returning just the record's digest (unique ID) and not any of its actual data, this scan will never need to read anything from SSD. It'll be very fast and lightweight on the results (again, only metadata is sent back). In the client you'll iterate the result set, build a list of IDs and then act on those with a subsequent write.

How do I resolve RequestRateTooLargeException on Azure Search when indexing a DocumentDB source?

I have a DocumentDB instance with about 4,000 documents. I just configured Azure Search to search and index it. This worked fine at first. Yesterday I updated the documents and indexed fields along with one UDF to index a complex field. Now the indexer is reporting that DocumentDB is reporting RequestRateTooLargeException. The docs on that error suggest throttling calls but it seems like Search would need to do that. Is there a workaround?
Azure Search code uses DocumentDb client SDK, which retries internally with the appropriate timeout when it encounters RequestRateTooLarge error. However, this only works if there're no other clients using the same DocumentDb collection concurrently. Check if you have other concurrent users of the collection; if so, consider adding capacity to the collection.
This could also happen because, due to some other issue with the data, DocumentDb indexer isn't able to make forward progress - then it will retry on the same data and may potentially encounter the same data problem again, akin a poison message. If you observe that a specific document (or a small number of documents) cause indexing problem, you can choose to ignore them. I'm pasting an excerpt from the documentation we're about to publish:
Tolerating occasional indexing failures
By default, an Azure Search indexer stops indexing as soon as even as single document fails to be indexed. Depending on your scenario, you can choose to tolerate some failures (for example, if you repeatedly re-index your entire datasource). Azure Search provides two indexer parameters to fine- tune this behavior:
maxFailedItems: The number of items that can fail indexing before an indexer execution is considered as failure. Default is 0.
maxFailedItemsPerBatch: The number of items that can fail indexing in a single batch before an indexer execution is considered
as failure. Default is 0.
You can change these values at any time by specifying one or both of these parameters when creating or updating your indexer:
PUT https://service.search.windows.net/indexers/myindexer?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]
{
"dataSourceName" : "mydatasource",
"targetIndexName" : "myindex",
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 5 }
}
Even if you choose to tolerate some failures, information about which documents failed is returned by the Get Indexer Status API.

Significant terms causes a CircuitBreakingException

I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).
I want to try out the new "Significant terms" aggregation so I fired off the following query...
{
"query": {
"ids": {
"type": "document",
"values": [
"xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
]
}
},
"aggregations": {
"Keywords": {
"significant_terms": {
"field": "Body"
}
}
},
"size": 0
}
Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.
Unfortunately, this invariably results in a
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];
after a minute or two and seems to imply I haven't got enough memory.
The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.
The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.
Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?
There is a warning on the documentation for this aggregation about RAM use on free-text fields for very large indices [1]. On large indices it works OK for lower-cardinality fields with a smaller vocabulary (e.g. hashtags) but the combination of many free-text terms and many docs is a memory-hog. You could look at specifying a filter on the loading of FieldData cache [2] for the Body field to trim the long-tail of low-frequency terms (e.g. doc frequency <2) which would reduce RAM overheads.
I have used a variation of this algorithm before where only a sample of the top-matching docs were analysed for significant terms and this approach requires less RAM as only the top N docs are read from disk and tokenised (using TermVectors or an Analyzer). However, for now the implementation in Elasticsearch relies on a FieldData cache and looks up terms for ALL matching docs.
One more thing - when you say you want to "compare the body of the document specified" note that the usual mode of operation is to compare a set of documents against the background, not just one. All analysis is based on doc frequency counts so with a sample set of just one doc all terms will have the foreground frequency of 1 meaning you have less evidence to reinforce any analysis.

How to sort and then apply a limit filter in Elasticsearch

I want to sort the results according to a specific field AND THEN apply the limit filter.
A simple common example would be a SQL query like select name from users order by name limit 100 would have returned the sorted results limited by 100.
I tried the same in Elasticsearch. However, this is not happening. It first applies the limits AND THEN sorts which is giving me undesired results. Can anyone point me what I am doing wrong ? This is the query I am currently using.
{
"sort": [
"full_name"
],
"filter": {
"limit": {
"value": 100
}
}
}
How about using "size" in ElasticSearch? http://www.elasticsearch.org/guide/reference/api/search/from-size.html
The limit filter works on each shard first and then results are merged on one of the nodes. If you have more than one shard, that can produce interesting results. I think, as Gabbar already mentioned, what you are looking for is the "size" parameter.

Resources