I am fetching the list of persons from Cosmos DB using a LINQ expression mentioned below from an array(String[] getPersons) as input.
var container = db.GetContainer(containerId);
var q = container.GetItemLinqQueryable<Person>();
var iterator = q.Where(p => getPersons.Contains(p.Name)).ToFeedIterator();
var results = await iterator.ReadNextAsync();
I could able to get result but it is taking more time(>15 sec) to retrieve data(more than 1K records) from CosmosDB. I need to get the data with in < 1 sec. Is there any way to optimise the above query to achieve this?
There could be multiple things that you can do but my intuition thinks this is not an SDK issue but one of design.
Is the partition key person name? If not then you are running a cross-partition query which will never scale. In fact, your performance will worsen as you add more data.
I suggest taking a look at Choosing a Partition key to learn more about how to create a database that can scale easily.
Also, if you're new to this type of database, this article is super helpful too, How to model and partition data on Azure Cosmos DB using a real-world example
Once you have a design that can scale where queries are answered from a single (or a bounded set of) partitions, your query performance will drastically increase.
Related
We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.
I have inserted exactly 1 million documents in an Azure Cosmos DB SQL container using the Bulk Executor. No errors were logged. All documents share the same partition key. The container is provisioned for 3,200 RU/s, unlimited storage capacity and single-region write.
When performing a simple count query:
select value count(1) from c where c.partitionKey = #partitionKey
I get varying results varying from 303,000 to 307,000.
This count query works fine for smaller partitions (from 10k up to 250k documents).
What could cause this strange behavior?
It's reasonable in cosmos db. Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
So, your error is resulted of bottleneck of RUs setting. The count query is limited by the number for RUs allocated to your collection. The result that you would have received will have a continuation token.
You may have 2 solutions:
1.Surely, you could raise the RUs setting.
2.For cost, you could keep looking for next set of results via continuation token and keep on adding it so that you will get total count.(Probably in sdk)
You could set value of Max Item Count and paginate your data using continuation tokens. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
I imported exactly 30k documents into my database.Then I tried to run the query
select value count(1) from c in Query Explorer. It turns out only partial of total documents every page. So I need to add them all by clicking Next Page button.
Surely, you could do this query in the sdk code via continuation token.
I have a scenario where I store large amounts of third party data for ad-hoc analysis by business users. Most queries against the data will be complicated, using multiple self-joins, projections, and ranges.
When it comes to picking a PartitionKey for use in Azure DocumentDB, I see people advising to use a logical separator such as TenantId, DeviceId, etc.
Given the parallel nature of DocumentDB, however, I was curious how it would handle a PartitionKey based either on some sort of GUID or large integer so that, during large reads, it would be highly parellelized.
With that in mind, I devised a test with two collections:
test-col-1
PartitionKey is TenantId with roughly 100 possible values
test-col-2
PartitionKey is unique value assigned by third party that follows the pattern "AB1234568". Guaranteed to be globally unique by the third party.
Both collections are set to 100,000 RUs.
In my experiment I loaded both collections with roughly 2,000 documents. Each document is roughly 20 KB in size and is highly denormalized. Each document is an order, which contains multiple jobs, each of which contain users, prices, etc.
Example query:
SELECT
orders.Attributes.OrderNumber,
orders.Attributes.OpenedStamp,
jobs.SubOrderNumber,
jobs.LaborTotal.Amount As LaborTotal,
jobs.LaborActualHours As LaborHours,
jobs.PartsTotal.Amount As PartsTotal,
jobs.JobNumber,
jobs.Tech.Number As TechNumber,
orders.Attributes.OrderPerson.Number As OrderPersonNumber,
jobs.Status
FROM orders
JOIN jobs IN orders.Attributes.Jobs
JOIN tech IN jobs.Techs
WHERE orders.TenantId = #TentantId
AND orders.Attributes.Type = 1
AND orders.Attributes.Status IN (4, 5)";
In my testing I adjusted the following settings:
Default ConnectionPolicy
Best practices ConnectionPolicy
ConnectionMode.Direct, Protocol.Tcp
Various MaxDegreeOfParallelism values
Various MaxBufferedItemCount
The collection with the GUID PartitionKey was queried with EnableCrossPartitionQuery = true. I am using C# and the .NET SDK v1.14.0.
In my initial tests with default settings, I found that querying the collection with TentantId as the PartitionKey was faster, with it taking on average 3,765 ms compared to 4,680 ms on the GUID-keyed collection.
When I set the ConnectionPolicy to Direct with TCP, I discovered TenantID collection query times decreased by nearly 1000 ms to an average of 2,865 ms while the GUID collection increased by about 800 ms to an average of 5,492 ms.
Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).
Given these observations, why would you not want to make the PartitionKey as broad a value as possible?
Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).
MaxDegreeOfParallelism could set the maximum number of concurrent tasks enabled by ParallelOptions instance. As I known, this is a client side parallelism and it would cost your CPU / Memory resources that you have on your site.
Given these observations, why would you not want to make the PartitionKey as broad a value as possible?
For write operations, we could scale across partition keys, in order to use the throughout that you have provisioned. While for read operations, we need to minimize cross-partition lookups for a lower latency.
Also, as this official document mentioned:
The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.
It is a best practice to have a partition key with many distinct values (100s-1000s at a minimum).
To achieve the full throughput of the container, you must choose a partition key that allows you to evenly distribute requests among some distinct partition key values.
For more details, you could refer to How to partition and scale in Azure Cosmos DB and this channel 9 tutorial about Azure DocumentDB Elastic Scale - Partitioning.
I have a bunch of primary keys - tens of thousands, and I want to retrieve their associated table entities. All row keys are empty strings. The best way I know of doing so, is querying them one by one async. It seems fast, but ideally I would like to bunch a few entities together in a single transaction. Playing with the new Storage Client, I have the following code failing:
var sample = GetSampleIds(); //10000 pks
var account = GetStorageAccount();
var tableClient = account.CreateCloudTableClient();
var table = tableClient.GetTableReference("myTable");
//I'm trying to get first and second pk in a single request.
var keyA = sample[0];
var keyB = sample[1];
var filterA = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyA);
var filterB = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, keyB));
//filterAB = "(PartitionKey eq 'keyA') or (PartitionKey eq 'keyB')"
var filterAB = TableQuery.CombineFilters(filterA, TableOperators.Or, filterB);
var query = new TableQuery<TweetEntity>().Where(filterAB);
//Does something weird. I thought it might be fetching a range at one point.
//Whatever it does it doesn't return. Expected the following line to get an array of 2 items.
table.ExecuteQuery(query).ToArray()
// replacing filterAB in query with either filterA or filterB works as expected
Examples always show CombineFilters working on PK and then RK, but this is of no use to me. I'm assuming that this is not possible.
Question
Is it possible to bundle together entities by PK? I know the maximum filter length is 15, but even 2 is a potential improvement when you are fetching 10,000 items. Also, where is the manual? Can't find proper documentation anywhere. For example MSDN for CombineFilters is a basic shell wrapping less information that intellisense provides.
tl;dr: sounds like you need to rethink your partitioning strategy. Unique, non-sequential IDs are not good PKs when you commonly have to query or work on many. More:
Partition Keys are not meant to be 'primary' keys really. They are more thought of as grouped, closely related sets of data that you want to work with. You can group by id, date, etc. PKs are used to scale the system - in theory, you could have 1 partition server per PK working on your data.
To your question: you won't get very good performance doing what you are doing. In fact, OR queries are non-optimized and will require a full table scan (bad). So, instead of doing PK = "foo" OR PK = "bar", you really should be doing 2 queries (in parallel) as that will get you much better performance.
Back to your core issue, if you are using a unique identifier for a particular entity and describing that as a PK, then it also means you are not able to be working on more than 1 entity at time. In order to work on entit(ies) you really need a common partition key. Can you think of a better one that describes your entities? Does date/time work? Some other common attribute? Those tend to be good partion keys. The only other thing you can do is what is called partition ranging - where your queries tend to be ranged on partition keys. An example of this is date-time partition keys. You can use file ticks to describe your partition and end up with sequential data ticks as PKs. Your query can then use > and < queries to specify a range (no OR). Those can be more optimized, but you will still get potentially a ton of continuation tokens.
As dunnry has mentioned in his reply, the problem with this approach is that OR queries are horribly slow. I got my problem to work without the storage client (at this point, I'm not sure what's wrong with it, let's say it's a bug maybe), but getting 2 entities separately without the OR query turns out to be much(!) faster than getting them with the OR query.
I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.
Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.
Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql