Unreliable performance in Azure Table Storage when joining point queries together - azure

I am having a consistent problem with the performance of Azure Table Storage. I'm querying a table which holds user accounts. The table stores the userId in both the PartitionKey and RowKey so I can easily make point queries.
My issue is because in several cases I need to retrieve multiple users in a single query. To achieve that I have a class which builds filter strings for me. The manner which this works is not related to the problem, however this is an example of the output:
(PartitionKey eq '00540de6-dd2b-469f-8730-e7800e06ccc0' and RowKey eq '00540de6-dd2b-469f-8730-e7800e06ccc0') or
(PartitionKey eq '02aa11b7-974a-4ee9-9a8e-5fc09970bb99' and RowKey eq '02aa11b7-974a-4ee9-9a8e-5fc09970bb99') or
(PartitionKey eq '040aec50-ebcd-4e5d-8f58-82aea616bd82' and RowKey eq '040aec50-ebcd-4e5d-8f58-82aea616bd82') or
// up to 22 more (25 total)
Upon first execution of the query it takes a long time to execute, between 2-5 seconds, and is missing data which is leading to errors. When run a second time the query takes between 0.2 and 0.5 seconds to complete and has all data contained within it.
Note that I also tried it just supplying just the PartitionKey, however it made no difference. I had assumed that a point query would perform better.
From this presentation of the bug I can only presume it's caused by the data being 'cold' when first requested and then pulled from a 'hot' cache upon successive requests.
If this is the case, how can I change the filter string to improve performance? Alternatively, how can I change the timeout of the table storage query to give it more time to complete? Is it possible to increase the scaling of my table storage?

Please don't use point query strings concatenated with 'or', since Azure Storage Table can't treat it as multiple point queries. Instead, Azure Table will treat it as a full table scan, which is terrible in performance. You should execute 25 point queries respectively to improve performance.

Related

Cosmos DB partition key and query design for sequential access

We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.

Azure Table querying

I'm new to Azure Tables and NOSQL in general.
I currently have a table in Azure Tables with about 60,000 rows. It contains job vacancies with columns such as ID (RowKey), Employer (PartitionKey), title, region, category and other columns.
When someone on the website performs a search, the request is sent to an external matching system. The matching system will then return applicable jobs. The matching engine returns for each job the ID and the employer.
A search could contain 10 distinct employers covering 300 jobs. If I was using a SQL Server database I would simply create an "IN" clause on the ID which would be the primary key in order to select those jobs from my table.
With Azure Table's, the recommended approach is to use PartitionKey's (which I have created the Employer on) so my plan was to issue a query for each employer to the table and then union the results, and then finally sort before passing to the UI.
My problem here is that I can't run a query such as "WHERE Employer = "MyEmployer" AND ID IN (1,2,3,4,5,6).
What would be the best way to achieve this? I don't want all jobs from an Employer returning as I might not use them all. I want certain jobs from a certain employer based on the ID's I get back.
This is something that I thought should be simple, but so far proving tricky!
My problem here is that I can't run a query such as "WHERE Employer =
"MyEmployer" AND ID IN (1,2,3,4,5,6).
You certainly can do IN queries though the syntax will be a bit different. This is what you will use assuming you're using REST API and writing query using ODATA syntax:
(PartitionKey eq 'MyEmployer' and (RowKey eq '1' or RowKey eq '2' or RowKey eq '3' or RowKey eq '4' or RowKey eq '5' or RowKey eq '6')
A few things you would need to consider:
This will result in full Partition scan. This may not be an issue if the number of entities in a partition are less but could become an issue as the number of entities grow.
Since the query is passed as query string in the request to table service, there's a limit of number of or clauses you can have in your query. If I am not mistaken, max length of a query string is 1024 characters.
You may want to consider firing many single queries in parallel where each query will be targeting individual entity e.g. PartitionKey eq 'MyEmployer' and RowKey eq '1'.
Lastly, you may want to read this blog post from Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx. This will provide you with very good insights.

Setting RowKey in Azure Table Storage

I am storing some logs in Azure Table Storage. I've identified the PartitionKey I should use. However, I'm having trouble determining what I should use for the RowKey. If I was using Sql Server, I would use an auto-incrementing integer. From what I can tell, having an auto-generated RowKey is not an option with Azure Table Storage. I'm fine using a GUID, however, everyone seems to warn against using a GUID. Yet, I'm not sure what I should be using.
Can anyone provide me a pointer for what I should use as the RowKey for storing log data? I've seen the following syntax (RowKey: {'_': '1'}), as shown below, but can't find out what it means:
var task = {
PartitionKey: {'_':'hometasks'},
RowKey: {'_': '1'}
};
Thanks!
There are many approaches you can take. One such approach would be to store date/time value in ticks as RowKey. This would help you in fetching logs data for a particular time range. Just remember that since RowKey is of String data type, you may want to pre-pad it with zeros so that all values are of same length. For example,
DateTime.UtcNow.Ticks.ToString("d20")
With this, you could take 2 approaches:
Store them in chronological order as shown in example above.
Store them in reverse chronological order. The advantage of this approach is that the latest entries will always be added on top of the table. So you could just query the table on PartitionKey and get top 'x' rows and they will be latest. You will do something like:
(DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d20")
You also mentioned in your comment that I expect the data sets to be quite large.. I hope you are not using a single PartitionKey. Because if number of records are quite large and all of them are put in same partition, the performance might be impacted.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Performance - Table Service, SQL Azure - insert. Query speed on large amount of data

I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.
Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.
Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql

Resources