What are some ways to optimize the retrieval of large numbers of entities (~250K) from a single partition from Azure Table Storage to a .NET application?
As far as I know, there are two ways to optimize the retrieval of large numbers of entities from a single partition from Azure Table Storage to a .NET application.
1.If you don’t need to get all properties of the entity, I suggest you could use server-side projection.
A single entity can have up to 255 properties and be up to 1 MB in size. When you query the table and retrieve entities, you may not need all the properties and can avoid transferring data unnecessarily (to help reduce latency and cost). You can use server-side projection to transfer just the properties you need.
From:Azure Storage Table Design Guide: Designing Scalable and Performant Tables(Server-side projection)
More details, you could refer to follow codes:
string filter = TableQuery.GenerateFilterCondition(
"PartitionKey", QueryComparisons.Equal, "Sales");
List<string> columns = new List<string>() { "Email" };
TableQuery<EmployeeEntity> employeeQuery =
new TableQuery<EmployeeEntity>().Where(filter).Select(columns);
var entities = employeeTable.ExecuteQuery(employeeQuery);
foreach (var e in entities)
{
Console.WriteLine("RowKey: {0}, EmployeeEmail: {1}", e.RowKey, e.Email);
}
2.If you just want to show the table’s message, you needn’t to get all the entities at same time.
You could get part of the result.
If you want to get the other result, you could use the continuation token.
This will improve the table query performance.
A query against the table service may return a maximum of 1,000 entities at one time and may execute for a maximum of five seconds. If the result set contains more than 1,000 entities, if the query did not complete within five seconds, or if the query crosses the partition boundary, the Table service returns a continuation token to enable the client application to request the next set of entities. For more information about how continuation tokens work, see Query Timeout and Pagination.
From:Azure Storage Table Design Guide: Designing Scalable and Performant Tables(Retrieving large numbers of entities from a query)
By using continuation tokens explicitly, you can control when your application retrieves the next segment of data.
More details, you could refer to follow codes:
string filter = TableQuery.GenerateFilterCondition(
"PartitionKey", QueryComparisons.Equal, "Sales");
TableQuery<EmployeeEntity> employeeQuery =
new TableQuery<EmployeeEntity>().Where(filter);
TableContinuationToken continuationToken = null;
do
{
var employees = employeeTable.ExecuteQuerySegmented(
employeeQuery, continuationToken);
foreach (var emp in employees)
{
...
}
continuationToken = employees.ContinuationToken;
} while (continuationToken != null);
Besides, I suggest you could pay attention to the table partition scalability targets.
Target throughput for single table partition (1 KB entities) Up to 2000 entities per second
If you reach the scalability targets for this partition, the storage service will throttle.
Related
I am fetching the list of persons from Cosmos DB using a LINQ expression mentioned below from an array(String[] getPersons) as input.
var container = db.GetContainer(containerId);
var q = container.GetItemLinqQueryable<Person>();
var iterator = q.Where(p => getPersons.Contains(p.Name)).ToFeedIterator();
var results = await iterator.ReadNextAsync();
I could able to get result but it is taking more time(>15 sec) to retrieve data(more than 1K records) from CosmosDB. I need to get the data with in < 1 sec. Is there any way to optimise the above query to achieve this?
There could be multiple things that you can do but my intuition thinks this is not an SDK issue but one of design.
Is the partition key person name? If not then you are running a cross-partition query which will never scale. In fact, your performance will worsen as you add more data.
I suggest taking a look at Choosing a Partition key to learn more about how to create a database that can scale easily.
Also, if you're new to this type of database, this article is super helpful too, How to model and partition data on Azure Cosmos DB using a real-world example
Once you have a design that can scale where queries are answered from a single (or a bounded set of) partitions, your query performance will drastically increase.
Although I am setting high RUs, I am not getting required results.
Background is: I am working on IOT application and unfortunately partition key set is very bad {deviceID}+ {dd/mm/yyyy hh:mm:sec:}, which means technically speaking each logical partition would have very less items (never reach 10 GB limit), but I feel there is a huge number of physical partitions got created which is forcing my RUs to split. How do I get physical partition list
you cant control partitions, nor you can get a partition list. but you dont actually need them. its not like each partition will be placed on a separate box. if you are suffering from low performance you need to identify what is causing throttling. You can use the metrics blade to identify throttled partitions and figure out why those are throttled. You can also use diagnostic settings and stream those to Log Analytics to gain additional insights
We can get the list of partition key ranges using this API. Partition Key Ranges might change in future with changes in data.
Physical partitions are internal implementations. We don't have any control over the size or number of physical partitions and we can't control the mapping between logical & physical partitions.
But we can control the distribution of data over logical partitions by choosing appropriate Partition Key which can spread data evenly across multiple logical partitions.
This information used to be displayed straight forwardly in the portal but this was removed in a redesign.
I feel that this is a mistake as provisoning RU requires knowledge of peak RU per partition multiplied by number of partitions so this number should be easily accessible.
The information is returned in the JSON returned to the portal but not shown to us. For collections provisioned with dedicated throughput (i.e. not using database provisioned throughput) this javascript bookmark shows the information.
javascript:(function () { var ss = ko.contextFor($(".ext-quickstart-tabs-left-margin")[0]).$rawData().selectedSection(); var coll = ss.selectedCollectionId(); if (coll === null) { alert("Please drill down into a specific container"); } else { alert("Partition count for container " + coll + " is " + ss.selectedCollectionPartitionCount()); } })();
Visit the metrics tab in the portal and select the database and container and then run the bookmark to see the count in an alert box as below.
You can also see this information from the pkranges REST end point. This is used by the SDK. Some code that works in the V2 SDK is below
var documentClient = new DocumentClient(new Uri(endpointUrl), authorizationKey,
new ConnectionPolicy {
ConnectionMode = ConnectionMode.Direct
});
var partitionKeyRangesUri = UriFactory.CreatePartitionKeyRangesUri(dbName, collectionName);
FeedResponse < PartitionKeyRange > response = null;
do {
response = await documentClient.ReadPartitionKeyRangeFeedAsync(partitionKeyRangesUri,
new FeedOptions {
MaxItemCount = 1000
});
foreach(var pkRange in response) {
//TODO: Something with the pkRange
}
} while (!string.IsNullOrEmpty(response.ResponseContinuation));
EDIT question summary:
I want to expose an endpoints, that will be capable of returning portions of xml data by some query parameters.
I have a statefull service (that is keeping the converted to DTOs xml data into a reliable dictionary)
I use a single, named partition (I just cant tell which partition holds the data by the query parameters passed, so I cant implement some smarter partitioning strategy)
I am using service remoting for communication between the stateless WEBAPI service and the statefull one
XML data may reach 500 MB
Everything is OK when the XML only around 50 MB
When data gets larger I Service Fabric complaining about MaxReplicationMessageSize
and the summary of my few questions from below: how can one achieve storing large amount of data into a reliable dictionary?
TL DR;
Apparently, I am missing something...
I want to parse, and load into a reliable dictionary huge XMLs for later queries over them.
I am using a single, named partition.
I have a XMLData stateful service that is loading this xmls into a reliable dictionary in its RunAsync method via this peace of code:
var myDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, List<HospitalData>>>("DATA");
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myDictionary.TryGetValueAsync(tx, "data");
ServiceEventSource.Current.ServiceMessage(this, "data status: {0}",
result.HasValue ? "loaded" : "not loaded yet, starts loading");
if (!result.HasValue)
{
Stopwatch timer = new Stopwatch();
timer.Start();
var converter = new DataConverter(XmlFolder);
List <Data> data = converter.LoadData();
await myDictionary.AddOrUpdateAsync(tx, "data", data, (key, value) => data);
timer.Stop();
ServiceEventSource.Current.ServiceMessage(this,
string.Format("Loading of data finished in {0} ms",
timer.ElapsedMilliseconds));
}
await tx.CommitAsync();
}
I have a stateless WebApi service that is communicating with the above stateful one via service remoting and querying the dictionary via this code:
ServiceUriBuilder builder = new ServiceUriBuilder(DataServiceName);
DataService DataServiceClient = ServiceProxy.Create<IDataService>(builder.ToUri(),
new Microsoft.ServiceFabric.Services.Client.ServicePartitionKey("My.single.named.partition"));
try
{
var data = await DataServiceClient.QueryData(SomeQuery);
return Ok(data);
}
catch (Exception ex)
{
ServiceEventSource.Current.Message("Web Service: Exception: {0}", ex);
throw;
}
It works really well when the XMLs do not exceeds 50 MB.
After that I get errors like:
System.Fabric.FabricReplicationOperationTooLargeException: The replication operation is larger than the configured limit - MaxReplicationMessageSize ---> System.Runtime.InteropServices.COMException
Questions:
I am almost certain that it is about the partitioning strategy and I need to use more partitions. But how to reference a particular partition while in the context of the RunAsync method of the Stateful Service? (Stateful service, is invoked via the RPC in WebApi where I explicitly point out a partition, so in there I can easily chose among partitions if using the Ranged partitions strategy - but how to do that while the initial loading of data when in the Run Async method)
Are these thoughts of mine correct: the code in a stateful service is operating on a single partition, thus Loading of huge amount of data and the partitioning of that data should happen outside the stateful service (like in an Actor). Then, after determining the partition key I just invoke the stateful service via RPC and pointing it to this particular partition
Actually is it at all a partitioning problem and what (where, who) is defining the Size of a Replication Message? I.e is the partiotioning strategy influencing the Replication Message sizes?
Would excerpting the loading logic into a stateful Actor help in any way?
For any help on this - thanks a lot!
The issue is that you're trying to add a large amount of data into a single dictionary record. When Service Fabric tries to replicate that data to other replicas of the service, it encounters a quota of the replicator, MaxReplicationMessageSize, which indeed defaults to 50MB (documented here).
You can increase the quota by specifying a ReliableStateManagerConfiguration:
internal sealed class Stateful1 : StatefulService
{
public Stateful1(StatefulServiceContext context)
: base(context, new ReliableStateManager(context,
new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
{
MaxReplicationMessageSize = 1024 * 1024 * 200
}))) { }
}
But I strongly suggest you change the way you store your data. The current method won't scale very well and isn't the way Reliable Collections were meant to be used.
Instead, you should store each HospitalData in a separate dictionary item. Then you can query the items in the dictionary (see this answer for details on how to use LINQ). You will not need to change the above quota.
PS - You don't necessarily have to use partitioning for 500MB of data. But regarding your question - you could use partitions even if you can't derive the key from the query, simply by querying all partitions and then combining the data.
We are using .net Azure storage client library to retrieve data from server. But when we try to retrieve data, the result have only 0 items with a continuation token. When we fetch the next page with this continuation token we again gets the same result. However when we use the 4th continuation token fetched like this, we are getting the proper result with 15 items.( The items count for all requests are 15). This issue is observed only when we tried applying filter conditions. The code used to fetch result is given below
var tableReference = _tableClient.GetTableReference(tableName);
var query = new TableQuery();
query.Where("'DeviceId' eq '99'"); // DeviceId is of type Int32
query.TakeCount = 15;
var resultsQuery = tableReference.ExecuteQuerySegmented(query, token);
var nextToken = resultsQuery.ContinuationToken;
var results = resultsQuery.ToList();
This is expected behavior. From Query Timeout and Pagination:
A query against the Table service may return a maximum of 1,000 items
at one time and may execute for a maximum of five seconds. If the
result set contains more than 1,000 items, if the query did not
complete within five seconds, or if the query crosses the partition
boundary, the response includes headers which provide the developer
with continuation tokens to use in order to resume the query at the
next item in the result set. Continuation token headers may be
returned for a Query Tables operation or a Query Entities operation.
I noticed that you're not using PartitionKey in your query. This will result in full table scan. Recommendation would be to always use PartitionKey (and possibly RowKey) in your queries to avoid full table scans. I would highly recommend reading Azure Storage Table Design Guide: Designing Scalable and Performant Tables to get the most out of Azure Tables.
UPDATE: Explaining "If the query crosses the partition boundary"
Let me try with an example as to what I understand by Partition Bounday. Let's assume you have 1 million rows in your table evenly spread across 10 Partitions (let's assume your PartitionKeys are 001, 002, 003,...010). Now we know that the data in Azure Tables is organized by PartitionKey and then in a Partition by RowKey. Since in your query you did not specify PartitionKey, Table Service starts from 1st Partition (i.e. PartitionKey == 001) and tries to find the matching data there. If it does not find any data in that Partition, it does not know whether the data is there in another Partition so instead of going to the next Partition, it simply returns back with a continuation token and leave it to the client consuming the API to decide whether they want to continue the search using the same parameters + continuation token or revise their search to start again.
I am trying to benchmark search/read & insert queries on an ATS which is small size(500 entities). Average insert time is 400ms and average search + retrieve time is 190ms.
When inserting, I am querying on the partition key and the condition itself is only composed of one predicate : [PartitionKey] eq <value> (no more ands/ors). Also, I am returning only 1 property.
What could be the reason for such results?
Search code:
TableQuery<DynamicTableEntity> projectionQuery = new TableQuery<DynamicTableEntity>().Select(new string[] { "State" });
projectionQuery.Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "" + msg.PartitionKey));
// Define an entity resolver to work with the entity after retrieval.
EntityResolver<bool?> resolver = (pk, rk, ts, props, etag) => props.ContainsKey("State") ? (props["State"].BooleanValue) : null;
Stopwatch sw = new Stopwatch();
sw.Start();
List<bool?> sList = table.ExecuteQuery(projectionQuery, resolver, null, null).ToList();
sw.Stop();
Insert Code:
CloudTable table = tableClient.GetTableReference("Messages");
TableOperation insertOperation = TableOperation.Insert(msg);
Stopwatch sw = new Stopwatch();
// Execute the insert operation.
sw.Start();
table.Execute(insertOperation);
sw.Stop();
You can refer to this post for possible performance issues: Microsoft Azure Storage Performance and Scalability Checklist.
The reason why you can only get one property is you're using EntityResolver, please try to remove that. Refer to Windows Azure Storage Client Library 2.0 Tables Deep Dive for the usage of EntityResolver - when you should use it and how to use it correctly.
From the SLA Document:
Storage
We guarantee that at least 99.99% of the time, we will successfully
process requests to read data from Read Access-Geo Redundant Storage
(RA-GRS) Accounts, provided that failed attempts to read data from the
primary region are retried on the secondary region.
We guarantee that at least 99.9% of the time, we will successfully process requests to read data from Locally Redundant Storage (LRS),
Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS)
Accounts.
We guarantee that at least 99.9% of the time, we will successfully process requests to write data to Locally Redundant Storage (LRS),
Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts
and Read Access-Geo Redundant Storage (RA-GRS) Accounts.
And also from there refereed document:
Table Query / List Operations
Maximum Processing Time: Ten (10)
seconds (to complete processing or return a continuation)
There is no commitment for fast / low response time. Nor are there any commitments on being faster with smaller tables.