CreateContainerIfNotExistsAsync is slower than GetContainer? - azure

I am using Azure cosmosDB SDK v3.As you know the SDK supports CreateContainerIfNotExistsAsync which creates a container if there is no container matching provided container id. This is convenient.
But it pings Cosmos DB to know container exists or not whereas GetContainer doesn't as GetContainer assumes container exists. So CreateContainerIfNotExistsAsync would need one more round trip to Cosmos DB for most of operations if my understanding is correct.
So my questions is would it better to avoid using CreateContainerIfNotExistsAsync as much as possible in terms of API perspective? Api can have better latency and save bandwidth.

The different is explained in the Intellisense, GetContainer just returns a proxy object, one that simply gives you the ability to execute operations within that container, it performs no network requests. If, for example, you try to read an Item (ReadItemAsync) on that proxy and the container does not exist (which also makes the item non-existent) you will get a 404 response.
CreateContainerIfNotExists is also not recommended for hot path operations as it involves a metadata or management plane operation:
Retrieve the names of your databases and containers from configuration or cache them on start. Calls like ReadDatabaseAsync or ReadDocumentCollectionAsync and CreateDatabaseQuery or CreateDocumentCollectionQuery will result in metadata calls to the service, which consume from the system-reserved RU limit. CreateIfNotExist should also only be used once for setting up the database. Overall, these operations should be performed infrequently.
See https://learn.microsoft.com/azure/cosmos-db/sql/best-practice-dotnet for more details
Bottomline: Unless you expect the container to be deleted due to some logical pathway in your application, GetContainer is the right way, it gives you a proxy object that you can use to execute Item operations without any network requests.

Related

Maintain a distributed incremental counter in Azure cosmos DB

I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document.
I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-
{"counter": 1}
Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1) which works fine.
Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.
My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?
Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.
Update 1:
I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -
{
"id": "randomId",
"counter": 0
}
Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario.
Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-
Update 2:
The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.
Test configuration-
Cosmos DB container with RUs to automatically scale from 400 to 4000
2 instances of application sharing the load.
Locust script to ingest load on the application
Findings-
Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-
"exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".
I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.
"exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.
This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.
PATCH is not different from other operations, Fundamentally CosmosDB implements Optimistic Concurrency Control unlike the relational databases which have these mechanisms. Optimistic Concurrency Control (OCC) allows you to prevent lost updates and to keep your data correct. OCC can be implemented by using the etag of a document. T Each document within Azure Cosmos DB has an E_TAG property.
In your scenario, yes it will return 2 in one of them and 3 in other one given both get succeeded, because SDK has the retry mechanism and it's explained here. Also have a look at this sample.
If your Azure Cosmos DB account is configured with multiple write
regions, conflicts and conflict resolution policies are applicable at
the document level, with Last Write Wins (LWW) being the default
conflict resolution policy

Dealing with Azure Cosmos DB cross-partition queries in REST API

I'm talking to Cosmos DB via the (SQL) REST API, so existing questions that refer to various SDKs are of limited use.
When I run a simple query on a partitioned container, like
select value count(1) from foo
I run into a HTTP 400 error:
The provided cross partition query can not be directly served by the gateway. This is a first chance (internal) exception that all newer clients will know how to handle gracefully. This exception is traced, but unless you see it bubble up as an exception (which only
happens on older SDK clients), then you can safely ignore this message.
How can I get rid of this error? Is it a matter of running separate queries by partition key? If so, would I have to keep track of what the existing key values are?

Is there a way to read a database link in Cosmos DB Java V4 API?

For example, reading "dbs/colls/document" instead of getting a container, then calling read on the container.
I've been having an issue where the first readItem on a container (after calling database.getContainer(x)) is extremely slow (like 1 second or longer) and was thinking using a database link could be faster.
I'm guessing a read after getting the container is slow because it doesn't make a service call until I call read.
Is there a way I can have this preloaded when reading in a database?
I have an application with a read(collectionName, key) method, and my approach was to use getContainer(collectionName) and then call read on that, but this method needs to be fast.
As discussed, the best practice is to keep an instance of your container alive between requests and call readItem on each request. This should resolve the primary issue.
As for the secondary concern, the "high latency every 50 requests or so", this is a known issue however it should only occur in the first minute or so of operation. If you can tolerate the initial slow requests, the solution is to wait for performance to stabilize. How long do you have to run your app for before you no longer see these high-latency requests?
FYI, if latency is a concern, run your client application in a geographically colocated Azure VM. Also a good rule of thumb is to allocate client CPU cores such that CPU utilization is not more than 40% or 50%.

Can I cache a single value in Azure Functions without any negative effects?

I have an Azure Function on a timer that activates every minute, which calls an API which returns an integer value. I store this value in SQL.
I then have another Azure Function that can be queried by the user to retrieve the integer value. This query could in theory be as high as hundreds or thousands of times per second.
Rather than have the second Azure Function query SQL every single time it gets a request, I would like it to cache the value in memory. If the cache were perfect there would be no need for SQL at all, but because Functions can scale up and also seem to lose their cache periodically there has to be some persistent storage.
Is it just a case of a static variable within the function to cache the value, and another with the last date retrieved? Or is there another type of caching that I can use within the function?
I understand there are solutions such as Redis but they seem pretty overkill to spin up just for a single integer value. I'm also not even sure if Azure SQL itself would cache the value when it's requested.
My question is, would a static variable work (if it's null/reset then we'd just do a quick SQL query to get the value) and actually persist? Or does an alternative like redis or similar exist that wouldn't be overkill for this application? And finally, is there actually any harm (performance problems) in hammering SQL over and over to retrieve a single value (i.e. is it clever enough to cache already so there's not a significant performance hit vs. querying a variable in memory)?
Really depends. If you understand the limitations of using in-memory cache in an azure function, and your business case is fine with those limitations, you should use it.
The main thing is you can't invalidate cache.
So for example, if your number changes, it can be not usable for you. You will have cases where a container for your azure is spinning, and it has an old value. The same user could get different values on each request because who knows which instance he will hit, and what that instance is caching.
If you number is something that is set only once and doesn't change, you don't have this issue.
And another important thing is that you still make quite a few requests just to cache it. Every new container will have to cache it for itself, while centralized cache would do it only once. This can be fine for something smaller, but if the thing you're caching really takes significant amount of time, or if the resources of the service are super limited, you would be a lot more efficient with centralized cache.
No matter what, caching in Azure Function level still reduces the load, and there's no reason to make requests when you don't have to.
To answer your sql question, yes, most likely the SQL server will cache it too, but your azure function still needs to establish a connection to sql server, make the request and kill the connection.
Azure functions best practices states that functions should be stateless and your state information should be with data. I think Redis is still the better option that SQL.

DynamoDB Application Architecture

We are using DynamoDB with node.js and Express to create REST APIs. We have started to go with Dynamo on the backend, for simplicity of operations.
We have started to use the DynamoDB Document SDK from AWS Labs to simplify usage, and make it easy to work with JSON documents. To instantiate a client to use, we need to do the following:
AWS = require('aws-sdk');
Doc = require("dynamodb-doc");
var Dynamodb = new AWS.DynamoDB();
var DocClient = new Doc.DynamoDB(Dynamodb);
My question is, where do those last two steps need to take place, in order to ensure data integrity? I’m concerned about an object that is waiting for something happen in Dynamo, being taken over by another process, and getting the data swapped, resulting in incorrect data being sent back to a client, or incorrect data being written to the database.
We have three parts to our REST API. We have the main server.js file, that starts express and the HTTP server, and assigns resources to it, sets up logging, etc. We do the first two steps of creating the connection to Dynamo, creating the AWS and Doc requires, at that point. Those vars are global in the app. We then, depending on the route being followed through the API, call a controller that parses up the input from the rest call. It then calls a model file, that does the interacting with Dynamo, and provides the response back to the controller, which formats the return package along with any errors, and sends it to the client. The model is simply a group of methods that essentially cover the same area of the app. We would have a user model, for instance, that covers things like login and account creation in an app.
I have done the last two steps above for creating the dynamo object in two places. One, I have simply placed them in one spot, at the top of each model file. I do not reinstantiate them in the methods below, I simply use them. I have also instantiated them within the methods, when we are preparing to the make the call to Dynamo, making them entirely local to the method, and pass them to a secondary function if needed. This second method has always struck me as the safest way to do it. However, under load testing, I have run into situations where we seem to have overwhelmed the outgoing network connections, and I start getting errors telling me that the DynamoDB end point is unavailable in the region I’m running in. I believe this is from the additional calls required to make the connections.
So, the question is, is creating those objects local to the model file, safe, or do they need to be created locally in the method that uses them? Any thoughts would be much appreciated.
You should be safe creating just one instance of those clients and sharing them in your code, but that isn't related to your underlying concern.
Concurrent access to various records in DynamoDB is still something you have to deal with. It is possible to have different requests attempt writes to the object at the same time. This is possible if you have concurrent requests on a single server, but is especially true when you have multiple servers.
Writes to DynamoDB are atomic only at the individual item. This means if your logic requires multiple updates to separate items potentially in separate tables there is no way to guarantee all or none of those changes are made. It is possible only some of them could be made.
DynamoDB natively supports conditional writes so it is possible to ensure specific conditions are met, such as specific attributes still have certain values, otherwise the write will fail.
With respect to making too many requests to DynamoDB... unless you are overwhelming your machine there shouldn't be any way to overwhelm the DynamoDB API. If you are performing more read/writes that you have provisioned you will receive errors indicating provisioned throughput has been exceeded, but the API itself is still functioning as intended under these conditions.

Resources