Even if you have designed your document schema with care and handcrafted minimal necessary indexes toward good balance of read vs change scenarios, it may not always be intuitive which index is actually doing the job for a heavy-RU query and if the choices are what you expected it to be. Or maybe there's a typo in indexing policy in a critical property name, causing a silent fall-back to some unsuitable index required by some other query..
I know that I can use the following tools to debug index usage in DocumentDB:
RequestCharge usage per query, but it does not say where this RU is spent on.
time/count metrics using x-ms-documentdb-populatequerymetrics header, which is useful and hints that "some" index was used, but not which one(s) were actually used.
The problem is the above toolset still forces blind experiments and working on unverifiable assumptions, causing query/index optimization to be a time-consuming process.
In SQL Server you could simply fetch the execution plan and verify index design and usage correctness. Is there a analogous tool for DocumentDB?
An illustrative pseudo-example of a query when it is not obvious which index(es) DocDB would pick:
select s.poorlySelectiveIndexed
from c
join s in c.sub
where c.anotherPoorlySelectiveIndexed = #aCommonValue
and s.Indexed1 in ('a', 'b', 'c')
and ARRAY_CONTAINS(s.Indexed2, #searchValue)
and ARRAY_CONTAINS(s.Indexed3, 'literalValue')
and (s.SuperSelective ='23456' OR c.AnotherSuperSelective = '76543')
order by s.RangeIndexed4
It seems the documentDB team is considering the already mentioned x-ms-documentdb-populatequerymetrics header and it's corresponding response as such a tool.
As mentioned in this response from "Azure Cosmos DB Team" in Azure feedback site from August 27, 2017:
We’re pleased to announce the availability of query execution statistics: https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-sql-query-metrics#query-execution-metrics
Using these metrics, you can infer the execution plan and tune the query and index for best performance tradeoffs.
Currently it does not seem to officially expose detailed information about used indexes, but let's hope it will change in some future version.
Related
My previous question: Errors saving data to Google Datastore
We're running into issues writing to Datastore. Based on the previous question, we think the issue is that we're indexing a "SeenTime" attribute with YYYY-MM-DDTHH:MM:SSZ (e.g. 2021-04-29T17:42:58Z) and this is creating a hotspot (see: https://cloud.google.com/datastore/docs/best-practices#indexes).
We need to index this because we're querying the data by date and need the time for each observation in the end application. Is there a way around this issue where we can still query by date?
This answer is a bit late but:
On your previous question, before even writing a query, it feels like the main issue is "running into issues writing" (DEADLINE_EXCEEDED/UNAVAILABLE) -> it's happening on "some saves" -- so, it's not completely clear if it's due to data hot-spotting or from "ingesting more data in shorter bursts", which causes contention (see "Designing for scale").
A single entity in Datastore mode should not be updated too rapidly. If you are using Datastore mode, design your application so that it will not need to update an entity more than once per second. If you update an entity too rapidly, then your Datastore mode writes will have higher latency, timeouts, and other types of error. This is known as contention.
You would need to add a prefix to the key to index monotonically increasing timestamps (as mentioned in the best-practices doc). Then you can test your queries using GQL interface in the console. However, since you most likely want "all events", I don't think it would be possible, and so will result in hot-spotting & read-latency.
The impression is that the latency might be unavoidable. If so, then you would need to decide if it's acceptable, depending on the frequency of your query/number-of-elements returned, along with the amount of latency (performance impact).
Consider switching to Firestore Native Mode. It has a different architecture under the hood and is the next version of Datastore. While Firestore is not perfect, it can be more forgiving about hot-spotting and contention, so it's possible that you'll have fewer issues than in Datastore.
I'm using an Azure function like a scheduled job, using the cron timer. At a specific time each morning it calls a stored procedure.
The function is now taking 4 mins to run a stored procedure that takes a few seconds to run in SSMS. This time is increasing despite efforts to successfully improve the speed of the stored procedure.
The function is not doing anything intensive.
using (SqlConnection conn = new SqlConnection(str))
{
conn.Open();
using (var cmd = new SqlCommand("Stored Proc Here", conn) { CommandType = CommandType.StoredProcedure, CommandTimeout = 600})
{
cmd.Parameters.Add("#Param1", SqlDbType.DateTime2).Value = DateTime.Today.AddDays(-30);
cmd.Parameters.Add("#Param2", SqlDbType.DateTime2).Value = DateTime.Today;
var result = cmd.ExecuteNonQuery();
}
}
I've checked and the database is not under load with another process when the stored procedure is running.
Is there anything I can do to speed up the Azure function? Or any approaches to finding out why it's so slow?
UPDATE.
I don't believe Azure functions is at fault, the issue seems to be with SQL Server.
I eventually ran the production SP and had a look at the execution plan. I noticed that the statistic were way out, for example a join expected the number of returned rows to be 20, but actual figure was closer to 800k.
The solution for my issue was to update the statistic on a specific table each week.
Regarding why that stats were out so much, well the client does a batch update each night and inserts several hundred thousand rows. I can only assume this affected the stats and it's cumulative, so it seems to get worse with time.
Please be careful adding with recompile hints. Often compilation is far more expensive than execution for a given simple query, meaning that you may not get decent perf for all apps with this approach.
There are different possible reasons for your experience. One common reason for this kind of scenario is that you got different query plans in the app vs ssms paths. This can happen for various reasons (I will summarize below). You can determine if you are getting different plans by using the query store (which records summary data about queries, plans, and runtime stats). Please review a summary of it here:
https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
You need a recent ssms to get the ui, though you can use direct queries from any tds client.
Now for a summary of some possible reasons:
One possible reason for plan differences is set options. These are different environment variables for a query such as enabling ansi nulls on or off. Each different setting could change the plan choice and thus perf. Unfortunately the defaults for different language drivers differ (historical artifacts from when each was built - hard to change now without breaking apps). You can review the query store to see if there are different “context settings” (each unique combination of set options is a unique context settings in query store). Each different set implies different possible plans and thus potential perf changes.
The second major reason for plan changes like you explain in your post is parameter sniffing. Depending on the scope of compilation (example: inside a sproc vs as hoc query text) sql will sometimes look at the current parameter value during compilation to infer the frequency of the common value in future executions. Instead of ignoring the value and just using a default frequency, using a specific value can generate a plan that is optimal for a single value (or set of values) but potentially slower for values outside that set. You can see this in the query plan choice in the query store as well btw.
There are other possible reasons for performance differences beyond what I mentioned. Sometimes there are perf differences when running in mars mode vs not in the client. There may be differences in how you call the client drivers that impact perf beyond this.
I hope this gives you a few tools to debug possible reasons for the difference. Good luck!
For a project I worked on we ran into the same thing. Its not a function issue but a sql server issue. For us we were updating sprocs during development and it turns out that per execution plan, sql server will cache certain routes/indexes (layman explanation) and that gets out of sync for the new sproc.
We resolved it by specifying WITH (RECOMPILE) at the end of the sproc and the API call and SSMS had the same timings.
Once the system is settled, that statement can and should be removed.
Search on slow sproc fast ssms etc to find others who have run into this situation.
I have an application that requires large RUs, but for some reason I cannot get the client app to handle more than 1000-1500 RUs, although the collection is set to 10000 RUs. Obviously I can add more clients, but I need one client to give me at least 10000 RUs then scale that.
My requests are simple
var query = connection.CreateDocumentQuery<DocumentDBProfile>(
CollectionUri, //cached
"SELECT * FROM Col1 WHERE Col1.key = '" + partitionKey + "' AND Col1.id ='" + id + "'",
new FeedOptions
{
MaxItemCount = -1,
MaxDegreeOfParallelism = 10000000,
MaxBufferedItemCount = 1000,
}).AsDocumentQuery();
var dataset = await query.ExecuteNextAsync().ConfigureAwait(false);
The query above is hitting 150,000 partitions, each is within own task (awaiting all at the end) and the client is initialized with TCP and direct mode:
var policy = new ConnectionPolicy
{
EnableEndpointDiscovery = false,
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp,
};
The CPU on the client appears to max out, mostly to service the call query.ExecuteNextAsync()
Am I doing anything wrong? Any optimization tips? Is there a lower level API I can use? Is there a way to pre-parse queries or make Json parsing more optimal?
UPDATE
I was able to get up to 3000-4000 RU on one client by lowering the number of concurrent requests, and stripping down my deserialized class to one with a single property (id), but I am still 10% of the limit of 50,000 RUs mentioned in the performance guidelines.
Not sure what else I could do. Is there any security checks or overhead I can disable in the .Net SDK?
UPDATE2
All our tests are run on Azure in the same region D11_V2. Running multiple clients scales well, so we are client bound not server bound.
Still not able to achieve 10% of the performance outlined in the CosmosDB performance guideline
By default the SDK will use a retry policy to mask throttling errors. Have you looked at the RU metrics available on Azure portal to confirm if you are being throttled or not? For more details on this, see tutorial here.
Not sure why the REST API would perform better than the .NET SDK. Can you give some more details on the operation you used here?
The example query you provided is querying a single document with a known partitionkey and id per request. For this kind of point-read operation, it would be better to use DocumentClient.ReadDocumentAsnyc, as it should be cheaper than a query.
It sounds like your sole purpose has become to disprove the documentation of Microsoft. Don't overrate this "50.000 RU/S" value for how you should scale your clients.
I don't think you can get a faster & lower level API than using .NET SDK with TCP & direct mode. The critical part is to use the TCP protocol (which you are). Only Java SDK also has direct mode, i doubt its faster. Maybe .NET Core...
How can your requirement be to "have large RU/s"? That is equivalent to "the application should require us to pay X$ for CosmosDB every month". The requirement should rather be "needs to complete X queries per second" or something like this. You then go on from there. See also the request unit calculator.
A request unit is the cost of your transaction. It depends on how large your documents are, how your collection is configured and on what your are doing. Inserting documents is usually much more expensive than retrieving data. Retrieving data across partitions within one query is more expensive than touching only a single one. A rule of thumb is that writing data is about 5 times more expensive than reading it.
I suggest you read the documentation about request units.
The problem with the performance tip of Microsoft is that they don't mention anything about which request should incur those RU/s. I would not expect it to mean: "The most basic request possible will not max out the CPU on the client system if you are still below 50.000 RU/s". Inserting data will get you to those numbers much more easily. I did a very quick test on my local machine, and got the official benchmarking sample up to about 7-8k RU/s using TCP+direct. I did not do anything apart from downloading the code and running it from Visual Studio. So my guess would be that the tips are also about inserting, since the performance testing examples are as well. The example achieves 100.000RU/s incidentally.
There are some good samples from Azure about "Benchmarking" and "Request Units". They should also be good sources for further experiments.
Only one actual tip on how to improve your query: Maybe ditch deserialization to your class, using CreateDocumentQuery(..) or CreateDocumentQuery<dynamic>. Could help your CPU. My first guess would be that your CPU is doing a bunch of that.
Hope this helps in any way.
I've read through this excellent feedback on Azure Search. However, I have to be a bit more explicit in questioning one the answers to question #1 from that list...
...When you index data, it is not available for querying immediately.
...Currently there is no mechanism to control concurrent updates to the same document in an index.
Eventual consistency is fine - I perform a few updates and eventually I will see my updates on read/query.
However, no guarantee on ordering of updates is really problematic. Perhaps I'm misunderstanding Let's assume this basic scenario:
1) update index entry E.fieldX w/ foo at time 12:00:01
2) update index entry E.fieldX w/ bar at time 12:00:02
From what I gather, it's entirely possible that E.fieldX will contain "foo" after all updates have been processed?
If that is true, it seems to severely limit the applicability of this product.
Currently, Azure Search does not provide document-level optimistic concurrency, primarily because overwhelming majority of scenarios don't require it. Please vote for External Version UserVoice suggestion to help us prioritize this ask.
One way to manage data ingress concurrency today is to use Azure Search indexers. Indexers guarantee that they will process only the current version of a source document at each point of time, removing potential for races.
Ordering is unknown if you issue multiple concurrent requests, since you cannot predict in which order they'll reach the server.
If you issue indexing batches in sequence (that is, start the second batch only after you saw an ACK from the service from the first batch) you shouldn't see reordering.
Getting below error while executing MKS integrity query.
Cannot show view information: Your query was stopped because it was using too may system resources.
Your query is likely taking longer than the time alotted by the Integrity server to queries. By default this value is 15 seconds. This usually indicates that your query is very broad or that an index needs to be created in the database to help increase the performance of the query. The latter requires the assistance of your database administrator.
DISCLAIMER: I am employed by the PTC Integrity Business Unit (formerly MKS).
one thing that you can check is if your query could have a very big list of items as results. try adding more restrictive filters first and then ease them step by step. At least this was my use case :)
Try to use filter as much as can, when you use filters it’s limiting unnecessary results.