Here is a piece of code that initialize a TableBatchOperation designed to retrieve two rows in a single batch:
TableBatchOperation batch = new TableBatchOperation();
batch.Add(TableOperation.Retrieve("somePartition", "rowKey1"));
batch.Add(TableOperation.Retrieve("somePartition", "rowKey2"));
//second call throws an ArgumentException:
//"A batch transaction with a retrieve operation cannot contain
//any other operation"
As mentionned, an exception is thrown, and it seems not supported to retrieve N rows in a single batch.
This is a big deal to me, as I need to retrieve about 50 rows per request. This issue is as much performance wise as cost wise. As you may know, Azure Table Storage pricing is based on the amount of transactions, which means that 50 retrieve operations is 50 times more expensive than a single batch operation.
Have I missed something?
Side note
I'm using the new Azure Storage api 2.0.
I've noticed this question has never been raised on the web. This constraint might have been added recently?
edit
I found a related question here: Very Slow on Azure Table Storage Query on PartitionKey/RowKey List.
It seems using TableQuery with "or" on rowkeys will results with a full table scan.
There's really a serious issue here...
When designing your Partition Key (PK) and Row Key (RK) scheme in Azure Table Storage (ATS) your primary consideration should be how you're going to retrieve the data. As you've said each query you run costs both money, but more importantly time so you need to get all of the data back in one efficient query. The efficient queries that you can run on ATS are of these types:
Exact PK and RK
Exact PK, RK range
PK Range
PK Range, RK range
Based on your comments I'm guessing you've got some data that is similar to this:
PK RK Data
Guid1 A {Data:{...}, RelatedRows: [{PK:"Guid2", RK:"B"}, {PK:"Guid3", RK:"C"}]}
Guid2 B {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}]
Guid3 C {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}];}
and you've retrieved the data at Guid1, and now you need to load Guid2 and Guid3. I'm also presuming that these rows have no common denominator like they're all for the same user. With this in mind I'd create an extra "index table" which could look like this:
PK RK Data
Guid1-A Guid2-B {Data:{....}}
Guid1-A Guid3-C {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Where the PK is the combined PK and RK of the parent and the RK is the combined PK and RK of the child row. You can then run a query which says return all rows with PK="Guid1-A" and you will get all related data with just one call (or two calls overall). The biggest overhead this creates is in your writes, so now when you right a row you also have to write rows for each of the related rows as well and also make sure that the data is kept up to date (this may not be an issue for you if this is a write once kind of scenario).
If any of my assumptions are wrong or if you have some example data I can update this answer with more relevant examples.
Try something like this:
TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>()
.Where(TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "partition1"),
TableOperators.And,
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row1"),
TableOperators.Or,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row2"))));
I know that this is an old question, but as Azure STILL does not support secondary indexes, it seems it will be relevant for some time.
I was hitting the same type of problem. In my scenario, I needed to lookup hundreds of items within the same partition, where there are millions of rows (imagine GUID as row-key). I tested a couple of options to lookup 10,000 rows
(PK && RK)
(PK && RK1) || (PK & RK2) || ...
PK && (RK1 || RK2 || ... )
I was using the Async API, with a maximum 10 degrees of parallelism (max 10 outstanding requests). I also tested a couple of different batch sizes (10 rows, 50, 100).
Test Batch Size API calls Elapsed (sec)
(PK && RK) 1 10000 95.76
(PK && RK1) || (PK && RK2) 10 1000 25.94
(PK && RK1) || (PK && RK2) 50 200 18.35
(PK && RK1) || (PK && RK2) 100 100 17.38
PK && (RK1 || RK2 || … ) 10 1000 24.55
PK && (RK1 || RK2 || … ) 50 200 14.90
PK && (RK1 || RK2 || … ) 100 100 13.43
NB: These are all within the same partition - just multiple rowkeys.
I would have been happy to just reduce the number of API calls. But as an added benefit, the elapsed time is also significantly less, saving on compute costs (at least on my end!).
Not too surprising, the batches of 100 rows delivered the best elapsed performance. There are obviously other performance considerations, especially network usage (#1 hardly uses the network at all for example, whereas the others push it much harder)
EDIT
Be careful when querying for many rowkeys. There is (or course) a URL length limitation to the query. If you exceed the length, the query will still succeed because the service can not tell that the URL was truncated. In our case, we limited the combined query length to about 2500 characters (URL encoded!)
Batch "Get" operations are not supported by Azure Table Storage. Supported operations are: Add, Delete, Update, and Merge. You would need to execute queries as separate requests. For faster processing, you may want to execute these queries in parallel.
Your best bet is to create a Linq/OData select query... that will fetch what you're looking for.
For better performance you should make one query per partition and run those queries simultaneously.
I haven't tested this personally, but think it would work.
How many entities do you have per partition? With one retrieve operation you can pull back up to 1000 records per query. Then you could do your Row Key filtering on the in memory set and only pay for 1 operation.
Another option is to do a Row Key range query to retrieve part of a partition in one operation. Essentially you specify an upper and lower bound for the row keys to return, rather than an entire partition.
Okay, so a batch retrieve operation, best case scenario is a table query. Less optimal situation would require parallel retrieve operations.
Depending on your PK, RK design you can based on a list of (PK, RK) figure out what is the smallest/most efficient set of retrieve/query operations that you need to perform. You then fetch all these things in parallel and sort out the exact answer client side.
IMAO, it was a design miss by Microsoft to add the Retrieve method to the TableBatchOperation class because it conveys semantics not supported by the table storage API.
Right now, I'm not in the mood to write something super efficient, so I'm just gonna leave this super simple solution here.
var retrieveTasks = new List<Task<TableResult>>();
foreach (var item in list)
{
retrieveTasks.Add(table.ExecuteAsync(TableOperation.Retrieve(item.pk, item.rk)));
}
var retrieveResults = new List<TableResult>();
foreach (var retrieveTask in retrieveTasks)
{
retrieveResults.Add(await retrieveTask);
}
This asynchronous block of code will fetch the entities in list in parallel and store the result in the retrieveResults preserving the order. If you have continuous ranges of entities that you need to fetch you can improve this by using a rang query.
There a sweet spot (that you'll have to find by testing this) is where it's probably faster/cheaper to query more entities than you might need for a specific batch retrieve then discard the results you retrieve that you don't need.
If you have a small partition you might benefit from a query like so:
where pk=partition1 and (rk=rk1 or rk=rk2 or rk=rk3)
If the lexicographic (i.e. sort order) distance is great between your keys you might want to fetch them in parallel. For example, if you store the alphabet in table storage, fetching a and z which are far apart is best to do with parallel retrieve operations while fetching a, b and c which are close together is best to do with a query. Fetching a, b c, and z would benefit from a hybrid approach.
If you know all this up front you can compute what is the best thing to do given a set of PKs and RKs. The more you know about how the underlying data is sorted the better your results will be. I'd advice a general approach to this one and instead, try to apply what you learn from these different query patterns to solve your problem.
Related
I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.
I would like to optimise my Azure Cosmos DB SQL API queries for consumed RUs (in part in order to reduce the frequency of 429 responses).
Specifically I thought that including the partition key in WHERE clauses would decrease consumed RUs (e.g. I read https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-queries and https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview which made me think this).
However, when I run
SELECT TOP 1 *
FROM c
WHERE c.Field = "some value"
AND c.PartitionKeyField = "1234"
ORDER BY c.TimeStampField DESC
It consumes 6 RUs.
Whereas without the partition key, e.g.
SELECT TOP 1 *
FROM c
WHERE c.Field = "some value"
ORDER BY c.TimeStampField DESC
It consumes 5.76 RUs - i.e. cheaper.
(whilst there is some variation in the above numbers depending on the exact document selected, the second query is always cheaper, and I have tested against both the smallest and largest partitions.)
My database currently has around 400,000 documents and 29 partitions (both are expected to grow). Largest partition has around 150,000 documents (unlikely to grow further than this).
The above results indicate to me that I should not pass the partition key in the WHERE clause for this query. Please could someone explain why this is so as from the documentation I thought the opposite should be true?
There might a few reasons and it depends on which index the query engine decides to use or if there is an index at all.
First thing I can say is that there is likely not much data in this container because queries without a partition key get progressively more expensive the larger the container, especially when they span physical partitions.
The first one could be more expensive if there is no index on the partition key and did a scan on it after filtering by the c.field.
It could also be more expensive depending on whether there is a composite index and whether it used it.
Really though you cannot take query metrics for small containers and extrapolate. The only way to measure is to put enough data into the container. Also the amount here is so small that it's not worth optimizing over. I would put the amount of data into this container you expect to have once in production and re-run your queries.
Lastly, with regards to measuring and optimizing, pareto principle applies. You'll go nuts chasing down every optimization. Find your high concurrency queries and focus on those.
Hope this is helpful.
I am trying to understand if Google Datastore can fit my needs.
I have a lot of entities and I need to perform a sum on a certain property.
Basically, I would like to be able to do select count(value1) from entity1 where [some filter], entity1 is an entity that keeps track of some sort of data in its field/property value1.
I know these kind of functions are not available in datastore since it's not a relational database, so the most immediate solution would be to perform a select and then calculate the sum on the result set in the application. So I would have something like (using nodejs, but i don't care about the language):
query = client.query(kind='Task')
query.add_filter('done', '=', False)
results = list(query.fetch())
total = 0
for(v in results)
total += v.value
The problem is that I have thousands of records, so results may be like 300 000 records.
What's the best way to do this without getting a bottleneck?
You can store a total sum in a separate entity. No matter how frequently users request it, you can return it within milliseconds.
When an entity which is included in the total changes, you change the total entity. For example, if a property changes from 300 to 500, you increase the total by 200. This way your total is always accurate.
If updates are very frequent, you can implement these updates to the total as tasks (Task Queue API) in order to prevent race conditions. These tasks will be executed very quickly, so your users will get very "fresh" total every time they ask.
Maybe the best way to make count on Google Datastore is the official solution: Shard Count.
I have a situation where I need to create a SAS token based on a range of PartitionKeys and RowKeys both.
To be more precise, my PK is based on Ticks of timestamp (there is a partion for every 10-minute range). My RK is based on some string.
I'm trying to call storage from a browser and get data for a range of PKs (based on some time range) and within those PK's, based for a range of some RKs. IE:
PK > 100000000 && PK < 200000000 && RK > "aaa" && RK <"mmm"
When I create the token, response from storage returns correct partitions, but entities for all RK's.
var sas = table.GetSharedAccessSignature(new SharedAccessTablePolicy
{
Permissions = SharedAccessTablePermissions.Query,
SharedAccessExpiryTime = DateTime.UtcNow.Add(period)
}, null, startPk, startRk, endPk, endRk);
Any ideas how to make the call only follow provided RK range without me having to filter out on the client unnecessary entities?
#GauravMantri pointed me to a helpful article: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/06/12/introducing-table-sas-shared-access-signature-queue-sas-and-update-to-blob-sas.aspx
What I was trying to do is not supported. PK/RK ranges are given for a continous range from start PK/RK to end PK/RK, rather than a filter query as I have thought.
Note that your partition key pattern may result in bad performance because of the "Append Only" pattern that you are setting up: https://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/#subheading28
Azure Storage learns your usage pattern, and adjusts the partition distribution according to the load, adaptively. So if you have your load across a number of partition keys, then it can split those partitions into different servers internally, to balance your load. However, if you load is all on one partition, and that partition changes periodically (like it does with append-only pattern), then the adaptive load balancing logic gets ineffective. To avoid this, you should avoid using dates or date-times as your partition key, if your query patterns allow it.
I have a bunch of primary keys - tens of thousands, and I want to retrieve their associated table entities. All row keys are empty strings. The best way I know of doing so, is querying them one by one async. It seems fast, but ideally I would like to bunch a few entities together in a single transaction. Playing with the new Storage Client, I have the following code failing:
var sample = GetSampleIds(); //10000 pks
var account = GetStorageAccount();
var tableClient = account.CreateCloudTableClient();
var table = tableClient.GetTableReference("myTable");
//I'm trying to get first and second pk in a single request.
var keyA = sample[0];
var keyB = sample[1];
var filterA = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyA);
var filterB = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyB));
//filterAB = "(PartitionKey eq 'keyA') or (PartitionKey eq 'keyB')"
var filterAB = TableQuery.CombineFilters(filterA, TableOperators.Or, filterB);
var query = new TableQuery<TweetEntity>().Where(filterAB);
//Does something weird. I thought it might be fetching a range at one point.
//Whatever it does it doesn't return. Expected the following line to get an array of 2 items.
table.ExecuteQuery(query).ToArray()
// replacing filterAB in query with either filterA or filterB works as expected
Examples always show CombineFilters working on PK and then RK, but this is of no use to me. I'm assuming that this is not possible.
Question
Is it possible to bundle together entities by PK? I know the maximum filter length is 15, but even 2 is a potential improvement when you are fetching 10,000 items. Also, where is the manual? Can't find proper documentation anywhere. For example MSDN for CombineFilters is a basic shell wrapping less information that intellisense provides.
tl;dr: sounds like you need to rethink your partitioning strategy. Unique, non-sequential IDs are not good PKs when you commonly have to query or work on many. More:
Partition Keys are not meant to be 'primary' keys really. They are more thought of as grouped, closely related sets of data that you want to work with. You can group by id, date, etc. PKs are used to scale the system - in theory, you could have 1 partition server per PK working on your data.
To your question: you won't get very good performance doing what you are doing. In fact, OR queries are non-optimized and will require a full table scan (bad). So, instead of doing PK = "foo" OR PK = "bar", you really should be doing 2 queries (in parallel) as that will get you much better performance.
Back to your core issue, if you are using a unique identifier for a particular entity and describing that as a PK, then it also means you are not able to be working on more than 1 entity at time. In order to work on entit(ies) you really need a common partition key. Can you think of a better one that describes your entities? Does date/time work? Some other common attribute? Those tend to be good partion keys. The only other thing you can do is what is called partition ranging - where your queries tend to be ranged on partition keys. An example of this is date-time partition keys. You can use file ticks to describe your partition and end up with sequential data ticks as PKs. Your query can then use > and < queries to specify a range (no OR). Those can be more optimized, but you will still get potentially a ton of continuation tokens.
As dunnry has mentioned in his reply, the problem with this approach is that OR queries are horribly slow. I got my problem to work without the storage client (at this point, I'm not sure what's wrong with it, let's say it's a bug maybe), but getting 2 entities separately without the OR query turns out to be much(!) faster than getting them with the OR query.