Inconsistent results when sorting documents by _doc - apache-spark

I want to fetch elasticsearch hits using the sort+search_after paging mechanism.
The elasticsearch documentation states:
_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort field is 0 for one hit, and some specific number for the other.
This obviously breaks the paging as it relies on the value returned in sorting to be later fed into sort_after for the next query.
No data is being written to the index while I am querying it, so this is not because of refreshes.
My questions are therefore:
Is it wrong to sort by _doc for paging? Seems the results I get are inconsistent.
How does sorting by _doc work internally? The documentation is lacking in this regard as it simply states the sort is performed by "index order".
The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.
es 7, index contains 2 shards, one primary and one replica
cheers.

The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.

Related

Best way to Fetch N rows in ScyllaDB

I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

AWS DynamoDB count query results without retrieving

I would like to check how many entries are in a DynamoDB table that matches a query without retrieving the actual entries, using boto3.
I want to run a machine learning job on data from DynamoDB table. The data I'm training on is a data that answers a query, not the entire table. I want to run the job only if I have enough data to train on.
Therefore, I want to check if I want to check that I have enough entries that match the query.
It is worth mentioning that the DynamoDB table I'm querying is really big, therefore actual retrieving is no option unless I actually want to run the job.
I know that I can use boto3.dynamodb.describe_table() to get how many entries there are in the entire table, but as I mentioned earlier, I want to know only how many entries match a query.
Any ideas?
This was asked and answered in the past, see How to get item count from DynamoDB?
Basically, you need to use the "Select" parameter to tell DynamoDB to only count the query's results, instead of retrieving them.
As usual in DynamoDB, this is truncated by paging: if the result set (not the count - the actual full results) is larger than 1 MB, then only the first 1 MB is retrieved, and the items in it counted, and you get back this partial count. If you're only interested in checking whether you have "enough" results - this may even be better for you - because you don't want to pay for reading a gigabyte of data just to check if the data is there. You can even ask for a smaller page, to read less - depending on what you consider enough data.
Just remember that you'll pay Amazon not by the amount of data returned (just one integer, the count) but by the amount of data read from disk. Using such counts excessively may lead to surprising large costs.

Mongoose Results' Default Order [duplicate]

When we run a Mongo find() query without any sort order specified, what does the database internally use to sort the results?
According to the documentation on the mongo website:
When executing a find() with no parameters, the database returns
objects in forward natural order.
For standard tables, natural order is not particularly useful because,
although the order is often close to insertion order, it is not
guaranteed to be. However, for Capped Collections, natural order is
guaranteed to be the insertion order. This can be very useful.
However for standard collections (non capped collections), what field is used to sort the results?
Is it the _id field or something else?
Edit:
Basically, I guess what I am trying to get at is that if I execute the following search query:
db.collection.find({"x":y}).skip(10000).limit(1000);
At two different points in time: t1 and t2, will I get different result sets:
When there have been no additional writes between t1 & t2?
When there have been new writes between t1 & t2?
There are new indexes that have been added between t1 & t2?
I have run some tests on a temp database and the results I have gotten are the same (Yes) for all the 3 cases - but I wanted to be sure and I am certain that my test cases weren't very thorough.
What is the default sort order when none is specified?
The default internal sort order (or natural order) is an undefined implementation detail. Maintaining order is extra overhead for storage engines and MongoDB's API does not mandate predictability outside of an explicit sort() or the special case of fixed-sized capped collections which have associated usage restrictions. For typical workloads it is desirable for the storage engine to try to reuse available preallocated space and make decisions about how to most efficiently store data on disk and in memory.
Without any query criteria, results will be returned by the storage engine in natural order (aka in the order they are found). Result order may coincide with insertion order but this behaviour is not guaranteed and cannot be relied on (aside from capped collections).
Some examples that may affect storage (natural) order:
WiredTiger uses a different representation of documents on disk versus the in-memory cache, so natural ordering may change based on internal data structures.
The original MMAPv1 storage engine (removed in MongoDB 4.2) allocates record space for documents based on padding rules. If a document outgrows the currently allocated record space, the document location (and natural ordering) will be affected. New documents can also be inserted in storage marked available for reuse due to deleted or moved documents.
Replication uses an idempotent oplog format to apply write operations consistently across replica set members. Each replica set member maintains local data files that can vary in natural order, but will have the same data outcome when oplog updates are applied.
What if an index is used?
If an index is used, documents will be returned in the order they are found (which does necessarily match insertion order or I/O order). If more than one index is used then the order depends internally on which index first identified the document during the de-duplication process.
If you want a predictable sort order you must include an explicit sort() with your query and have unique values for your sort key.
How do capped collections maintain insertion order?
The implementation exception noted for natural order in capped collections is enforced by their special usage restrictions: documents are stored in insertion order but existing document size cannot be increased and documents cannot be explicitly deleted. Ordering is part of the capped collection design that ensures the oldest documents "age out" first.
It is returned in the stored order (order in the file), but it is not guaranteed to be that they are in the inserted order. They are not sorted by the _id field. Sometimes it can be look like it is sorted by the insertion order but it can change in another request. It is not reliable.

Mongodb document insertion order

I have a mongodb collection for tracking user audit data. So essentially this will be many millions of documents.
Audits are tracked by loginID (user) and their activities on items. example: userA modified 'item#13' on date/time.
Case: I need to query with filters based on user and item. That's Simple. This returns many thousands of documents per item. I need to list them by latest date/time (descending order).
Problem: How can I insert new documents to the top of the stack? (like a capped collection) or Is it possible to find records from the bottom of the stack? (reverse order). I do NOT like the idea of find and sorting because when dealing with thousand and millions of documents sorting is a bottleneck.
Any solutions?
Stack: mongodb, node.js, mongoose.
Thanks!
the top of the stack?
you're implying there is a stack, but there isn't - there's a tree, or more precisely, a B-Tree.
I do NOT like the idea of find and sorting
So you want to sort without sorting? That doesn't seem to make much sense. Stacks are essentially in-memory data structures, they don't work well on disks because they require huge contiguous blocks (in fact, huge stacks don't even work well in memory, and growing stacks requires copying the entire data set, that would hardly work
sorting is a bottleneck
It shouldn't be, at least not for data that is stored closely together (data locality). Sorting is an O(m log n) operation, and since the _id field already encodes a timestamp, you already have a field that you can sort on. m is relatively small, so I don't see the problem here. Have you even tried that? With MongoDB 3.0, index intersectioning has become more powerful, you might not even need _id in the compound index.
On my machine, getting the top items from a large collection, filtered by an index takes 1ms ("executionTimeMillis" : 1) if the data is in RAM. The sheer network overhead will be in the same league, even on localhost. I created the data with a simple network creation tool I built and queried it from the mongo console.
I have encountered the same problem. My solution is to create another additional collection which maintain top 10 records. The good point is that you can query quickly. The bad point is you need update additional collection.
I found this which inspired me. I implemented my solution with ruby + mongoid.
My solution:
collection definition
class TrainingTopRecord
include Mongoid::Document
field :training_records, :type=>Array
belongs_to :training
index({training_id: 1}, {unique: true, drop_dups: true})
end
maintain process.
if t.training_top_records == nil
training_top_records = TrainingTopRecord.create! training_id: t.id
else
training_top_records = t.training_top_records
end
training_top_records.training_records = [] if training_top_records.training_records == nil
top_10_records = training_top_records.training_records
top_10_records.push({
'id' => r.id,
'return' => r.return
})
top_10_records.sort_by! {|record| -record['return']}
#limit training_records' size to 10
top_10_records.slice! 10, top_10_records.length - 10
training_top_records.save
MongoDb's ObjectId is structured in a way that has natural ordering.
This means the last inserted item is fetched last.
You can override that by using: db.collectionName.find().sort({ $natural: -1 }) during a fetch.
Filters can then follow.
You will not need to create any additional indices since this works on _id, which is indexed by default.
This is possibly the only efficient way you can achieve what you want.

Resources