How many transactions are fired for retrieving 1200 entities in Azure Storage Tables, keeping continuation Tokens in mind.
I have read that " Windows Azure Tables returns up to a maximum of 1000 entities in a single request and returns a continuation token when more results(remaining 200 entities) are available." See at (http://blog.smarx.com/posts/windows-azure-tables-expect-continuation-tokens-seriously).
Because Azure charges on the basis of the no. of transactions we perform over the cloud;
I just want to know: How many transactions will be executed for a single request that returns say 1200 entities(rows) with a continuation token after 1000th entity(row) result?
How many transactions will be executed for a single request that
returns say 1200 entities(rows) with a continuation token after 1000th
entity(row) result?
It actually depends. As the documentation states that Windows Azure Table returns up to a maximum of 1000 entities in a single request. What that means is that in your case, the minimum number of transactions would be 2 however the maximum number of transactions could be 1200. It all depends on how your data is partitioned and the load on your storage account. The more partitions you have, chances are that you'll receive lesser data per request thus more transaction. Again request execution time (server side) would also need be taken into consideration because if the execution takes more than the allocated time, the service will return the partial data.
Based on the documentation here: http://msdn.microsoft.com/en-us/library/windowsazure/dd179421.aspx, you can expect a continuation token if one or more conditions are true:
If the number of entities to be returned exceeds 1000.
If the server timeout interval is exceeded.
If the query crosses the partition boundary.
Related
We have a cosmos-db container which has about 1M records containing information about customers. The partition key for the documentDb is customerId which holds a unique GUID reference for the customer. I have read the partitioning and scaling document which would suggest that our choice of key appears appropriate, however if we want to query this data using a field such as DOB or Address, the query will be considered as a cross-partition query and will essentially send the same query to every record in the documentDb before returning.
The query stats in Data Explorer suggests that a query on customer address will return the first 200 documents at a cost of 36.9 RU's but I was under the impression that this would be far higher given the amount of records that this query would be sent to? Are these query stats accurate?
It is likely that we will want to extend our app to be able to query on multiple non-partition data elements so are we best replicating the customer identity and searchable data element within another documentDb using the desired searchable data element as the partition key. We can then return the identities of all customers who match the query. This essentially changes the query to be an in-partition query and should prevent additional expenditure?
Our current production database has a 4000 (Max Throughput)(Shared) so there appears to be adequate provision for cross-partition queries so would I be wasting my time building out a change-feed to maintain a partitioned representation of the data to support in-partition queries over cross-partition queries?
To get accurate estimate of query cost you need to do the measurement on a container that has a realistic amount of data within it. For example, if I have a container with 5000 RU/s and 5GB of data my cross-partition query will be fairly inexpensive because it only ran on a single physical partition.
If I ran that same query on a container with 100,000 RU/s I would have > 10 physical partitions and the query would show much greater RU/s reported due to the query having to execute across all 10 physical partitions. (Note: 1 physical partition has maximum 10,000 RU/s or 50GB of storage).
It is impossible to say at what amount of RU/s and storage you will begin to get a more realistic number for RU charges. I also don't know how much throughput or storage you need. If the workload is small then maybe you only need 10K RU and < 50GB of storage. It's only when you need to scale out that is where you need to first scale out, then measure your query's RU charge.
To get accurate query measurements, you need to have a container with the throughput and amount of data you would expect to have in production.
You don't necessarily need to be afraid of cross-partition queries in CosmosDB. Yes, single-partition queries are faster, but if you need to query "find any customers matching X" then cross-partition query is naturally required (unless you really need the hassle of duplicating the info elsewhere in optimized form).
The cross-partition query will not be sent to "each document" as long as you have good indexes in partitions. Just make sure every query has a predicate on a field that is:
indexed
with good-enough data cardinality
.. and the returned number of docs should be limited by business model or forced (top N). This way your RU should be more-or-less top-bound.
36RU per 200 returned docs does not sound too bad as long as it's not done too many times per sec. But if in doubt, test with predicted data volume and fire up some realistic queries..
recently I was using DynamoDB to build my service. I use the provisioned mode for my DynamoDB table.
In order to test how DynamoDB will react, I set both Read Capacity Unit and Write Capacity Unit to only 1. In addition, I insert 20 items which account for about 27KB in my table. I use Scan method with ReturnConsumedCapacity parameter. I use Postman to test it, the result shows that it consumes 2.5 capacity units!
Why does DynamoDB not reject my request? I only assign 1 to both RU & WU! Doesn't it mean that it should only be able to read as much as 4KB of data in one second?
This is the screenshot of Postman result
Reference -
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.ProvisionedThroughput.Manual
One read request unit represents one strongly consistent read request, or two eventually consistent read requests, for an item up to 4 KB in size. Transactional read requests require 2 read request units to perform one read for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units. The total number of read request units required depends on the item size, and whether you want an eventually consistent or strongly consistent read. For example, if your item size is 8 KB, you require 2 read request units to sustain one strongly consistent read, 1 read request unit if you choose eventually consistent reads, or 4 read request units for a transactional read request.
It is sometimes said that when using Azure Tables there is effectively a 3rd key partitioning data - the Table Name itself.
I noticed when executing a Segmented query that the TableContinuationToken has a NextTableName property. What is the purpose of this property? It could be useful if a query could span multiple tables?
It's for segmented queries if the full result can't be returned by a response.
Blockquote from https://learn.microsoft.com/en-us/rest/api/storageservices/query-tables :
A query against the Table service may return a maximum of 1,000 tables at one time and may execute for a maximum of five seconds. If the result set contains more than 1,000 tables, if the query did not complete within five seconds, or if the query crosses the partition boundary, the response includes a custom header containing the x-ms-continuation-NextTableName continuation token. The continuation token may be used to construct a subsequent request for the next page of data. For more information about continuation tokens, see Query Timeout and Pagination.
Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/
AFAIK the ExecuteQuery handles segmented queries internally. And every request (=call to storage) for a segment counts as a transaction to the storage (I mean billing transaction - http://www.windowsazure.com/en-us/pricing/details/storage/). Right?
Is there a way to understand in how many segments is splitted a query? If I run a query and I get 5000 items, I can suppose that my query was splitted into 5 segments (due to the limit of 1000 items per segment). But in case of a complex query there is also the timeout of 5 seconds per call.
I don't believe there's a way to get at that in the API. You could set up an HTTPS proxy to log the requests, if you just want to check it in development.
If it's really important that you know, use the BeginExecuteSegmented and EndExeceuteSegmented calls instead. Your code will get a callback for each segment, so you can easily track how many calls there are.