Cosmos DB is very slow with the newer library compared to old - azure

I am in the process of upgrading my application azure function from V3 to V4. In so doing I am also upgrading from the older/no longer supported Microsoft.Azure.DocumentDB (V 2.18.0) to the newest Microsoft.Azure.Cosmos (3.32 as per recommendations. The problem is that doing this is now taking almost 3 times as long to make a basic get request and we see every single request is a query vs a read.
An example is below where we are calling the provided ReadItemAsync(id,partition,options,token). The payload returning is about 589 bytes. The resulting diagnostic on this is that it is taking .400 - 900ms ms to return!
This cannot stand.
I am at a loss as to how to fix this issue. If MS is going to take 500 - 1000ms for every get.. and I want to just run through 26 items.. that is going to take almost 25 seconds of time. How can this be the case? This is crazy bad. When I run through my method, I do a get, a save and an upsert. On the old library, this was taking about 300ms to complete one iteration, on the 3.31.2 it is taking > 1500 ms.
I have no idea where or how to solve a request taking 460ms at the azure farm.
The raw data call looks like this:
response = await _database.GetContainer(containerId)
.ReadItemAsync<T>(id, partitionKey, null, cancellationToken);
LastQueryUsage = response.RequestCharge;
return response;
Diagnostics Dump from the above Read request:
{
"Summary": {
"DirectCalls": {
"(200, 0)": 1
},
"GatewayCalls": {
"(200, 0)": 3,
"(304, 0)": 1
}
},
"name": "ReadItemAsync",
"id": "0add6a37-9928-4145-aed1-b29e910e22f3",
"start time": "12:55:11:446",
"duration in milliseconds": 928.666,
//reduced for brevity in light of initial answer.
}
----------EDITS AFTER MARK'S RESPONSE: ----------
I am still seeing bad performance on my test collections. Very bad.
I spin up a brand new Azure V4 Dotnet6 isolated project.
public class CosmosSingleTonConnection
{
private static TestSettings _settings = new TestSettings();
private static readonly List<(string, string)> containers = new()
{
("myDb", "col1"),
("myDb", "col2")
};
private static CosmosClient cosmosClient;
private static Container Raw;
private static Container State;
public Container Container1=> Raw;
public Container Container2=> State;
public CosmosSingleTonConnection(IOptions<TestSettings> settings)
{
_settings = settings.Value;
cosmosClient = InitializeCosmosClient(_settings.Key, _settings.Endpoint);
Raw = cosmosClient.GetDatabase("myDb").GetContainer("col1");
State = cosmosClient.GetDatabase("myDb").GetContainer("col2");
}
private CosmosClient InitializeCosmosClient(string key, string endpoint)
{
return CosmosClient
.CreateAndInitializeAsync(accountEndpoint: endpoint, authKeyOrResourceToken: key, containers: containers, null, CancellationToken.None)
.Result;
}
---Program.cs ---
var host = new HostBuilder()
.ConfigureFunctionsWorkerDefaults(builder =>
{
builder
.AddApplicationInsights(opt => { opt.EnableHeartbeat = true; })
.AddApplicationInsightsLogger();
})
.ConfigureServices(DoConfiguration)
.Build();
void DoConfiguration(IServiceCollection services)
{
services.AddOptions<TestSettings>()
.Configure<IConfiguration>((settings, configuration) => { configuration.Bind(settings); });
services.AddSingleton<CosmosSingleTonConnection>();
services.AddScoped<IDoStuffService, DoStuffService>();
}
host.Run();
---DoStuffService---
private readonly CosmosSingleTonConnection _db;
public DoStuffService(CosmosSingleTonConnection db)
{
_db = db;
}
public FeedIterator<ObjectDTO> QueryLast30(string sensor)
{
string top30 = #"Select * from Col1 r Where r.paritionKey= #partitionKey"; //" Order by r.DateTimeCreatedUtc";
QueryRequestOptions ops = new QueryRequestOptions()
{
PartitionKey = new PartitionKey(sensor)
};
var query = new QueryDefinition(top30).WithParameter("#partitionKey", sensor);
using FeedIterator<ObjectDTO> feed = _db.Container1().GetItemQueryIterator<ObjectDTO>(queryDefinition: query, null, null);
return feed;
}
---The FUNCTION ---
public Function1(ILoggerFactory loggerFactory, IDoStuffService service)
{
_logger = loggerFactory.CreateLogger<Function1>();
Service = service;
}
[Function("Function1")]
public async Task<HttpResponseData> RunAsync([HttpTrigger(AuthorizationLevel.Function, "get", "post")] HttpRequestData req)
{
var response = req.CreateResponse(HttpStatusCode.OK);
List<string> responseTimes = new();
for (int i = 0; i < 10; i++)
{
var feed = Service.QueryLast30("01020001");
while (feed.HasMoreResults)
{
FeedResponse<RawSensorData> fr = await feed.ReadNextAsync();
responseTimes.Add(fr.Diagnostics.GetClientElapsedTime().TotalMilliseconds.ToString());
}
}
response.WriteString(string.Join(" | ", responseTimes));
return response;
}
'----Initial plus subsequent requests---`
Is this as good as it can get? Because this is not good if I have to do 4 atomic operations against cosmos per iteration.
459.3067 | 86.5555 | 421.989 | 81.4663 | 426.62 | 81.7712 | 82.6038 | 78.9875 | 81.0167 | 79.0283
201.5111 | 86.7607 | 79.1739 | 83.5416 | 79.2815 | 80.5983 | 79.8568 | 83.7092 | 79.7441 | 79.3132
81.8724 | 79.7575 | 91.6382 | 80.5015 | 81.7875 | 87.2023 | 79.3385 | 78.3251 | 78.3159 | 79.2731
82.8567 | 81.5768 | 81.6155 | 81.535 | 81.5871 | 79.2668 | 79.6522 | 78.9888 | 79.2734 | 80.0451
81.1635 | 88.578 | 111.7357 | 84.9948 | 80.207 | 81.2129 | 79.9344 | 80.1654 | 79.4129 | 82.7971

This is likely due to metadata requests on first access of your container object or due to establishing the connection to your container on the first call to it.
Best practice is to use a singleton CosmosClient as well as cache the names or database and container objects and also keep references to these objects within scope for the lifetime of your application.
If you are creating and destroying these references on every invocation, your app will suffer performance high latency as the Cosmos Client fetches metadata from the master partition as well as establishes connections to the service. Once these references and connections are in place (after the first call to the service) all subsequent calls to the service will be fast.
Also good idea to do a quick check for our .NET v3 Migration Guideance and these performance tips:
Migrate your application to use the Azure Cosmos DB .NET SDK v3
Best practices for Azure Cosmos DB .NET SDK Checklist
Performance tips for Azure Cosmos DB and .NET
Manage Connections for Azure Functions

Related

Cloud Run PubSub high latency

I'm building a microservice application consisting of many microservices build with Node.js and running on Cloud Run. I use PubSub in several different ways:
For streaming data daily. The microservices responsible for gathering analytical data from different advertising services (Facebook Ads, LinkedIn Ads, etc.) use PubSub to stream data to a microservice responsible for uploading data to Google BigQuery. There also are services that stream a higher load of data (> 1 Gb) from CRMs and other services by splitting it into smaller chunks.
For messaging among microservices about different events that don't require an immediate response.
Earlier, I experienced some insignificant latency with PubSub. I know it's an open issue considering up to several seconds latency with low messages throughput. But in my case, we are talking about several minutes latency.
Also, I occasionally get an error message
Received error while publishing: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.
I this case a message is not sent at all or is highly delayed.
This is how my code looks like.
const subscriptions = new Map<string, Subscription>();
const topics = new Map<string, Topic>();
const listenForMessages = async (
subscriptionName: string,
func: ListenerCallback,
secInit = 300,
secInter = 300
) => {
let logger = new TestLogger("LISTEN_FOR_MSG");
let init = true;
const _setTimeout = () => {
let timer = setTimeout(() => {
console.log(`Subscription to ${subscriptionName} cancelled`);
subscription.removeListener("message", messageHandler);
}, (init ? secInit : secInter) * 1000);
init = false;
return timer;
};
const messageHandler = async (msg: Message) => {
msg.ack();
await func(JSON.parse(msg.data.toString()));
// wait for next message
timeout = _setTimeout();
};
let subscription: Subscription;
if (subscriptions.has(subscriptionName)) {
subscription = subscriptions.get(subscriptionName);
} else {
subscription = pubSubClient.subscription(subscriptionName);
subscriptions.set(subscriptionName, subscription);
}
let timeout = _setTimeout();
subscription.on("message", messageHandler);
console.log(`Listening for messages: ${subscriptionName}`);
};
const publishMessage = async (
data: WithAnyProps,
topicName: string,
options?: PubOpt
) => {
const serializedData = JSON.stringify(data);
const dataBuffer = Buffer.from(serializedData);
try {
let topic: Topic;
if (topics.has(topicName)) {
topic = topics.get(topicName);
} else {
topic = pubSubClient.topic(topicName, {
batching: {
maxMessages: options?.batchingMaxMessages,
maxMilliseconds: options?.batchingMaxMilliseconds,
},
});
topics.set(topicName, topic);
}
let msg = {
data: dataBuffer,
attributes: options.attributes,
};
await topic.publishMessage(msg);
console.log(`Publishing to ${topicName}`);
} catch (err) {
console.error(`Received error while publishing: ${err.message}`);
}
};
A listenerForMessage function is triggered by an HTTP request.
What I have already checked
PubSub client is created only once outside the function.
Topics and Subscriptions are reused.
I made at least one instance of each container running to eliminate the possibility of delays triggered by cold start.
I tried to increase the CPU and Memory capacity of containers.
batchingMaxMessages and batchingMaxMilliseconds are set to 1
I checked that the latest version of #google-cloud/pubsub is installed.
Notes
High latency problem occurs only in the cloud environment. With local tests, everything works well.
Timeout error sometimes occurs in both environments.
The problem was in my understanding of Cloud Run Container's lifecycle. I used to send HTTP response 202 while having PubSub working in the background. After sending the response, the container switched to the idling state, what looked like high latency in my logs.

Does a NodeJS app has to be purely stateless in order for it to be replicated?

In my Node application, there are Values that the User can define. These Values, once created, can change, either from a user-triggered action or from something else, for example, a MQTT message received on the server. A Value can change very sporadically or a few times per second.
class Value {
constructor(valueStore, id, name) {
this.valueStore = valueStore;
this.id = id;
this.name = name;
}
// ...
}
Because some Values can change many times per second, I don't save the Values to MongoDB Atlas every time they change (data transfer costs are quite expensive). Instead, I have a "ValueStore", which is basically a global object where all my Values are stored with their current value. In case my app goes down, I save the contents of the ValueStore to Atlas every 5 seconds, which is much less expensive.
// This is a global object
class ValueStore {
constructor() {
this.values = [];
this.bufferUpdateInterval = setInterval(() => {
// Every 5s, save values in ValueStore collection
}, VALUES_UPDATE_DB_FREQUENCY);
}
setValue(value) {
this.value = value;
}
// ...
}
I haven't yet implemented zero-downtime deployment. When I update my application, I have to bring the app down. When I put it back up, I want all the Values to be initialized with the Values they had when I put the app down. So, before anything else, I have to query the collection in which I save my Values every 5 seconds in order to reinstantiate my ValueStore.
// When my app starts:
const initializeValueStore = async (streams, variables) => {
const values = await Values.getLastValues(); // call to get the last known Values
for (let i = 0; i < values.length; i++) {
const lastKnownValue = // ...
valueStore.addValue(valueStore, values[i].id, values[i].name, lastKnownValue);
}
}
Now, I want to scale my application and implement patterns such as zero-downtime deployment, which implies replicating my app across several nodes. The more I think about it and the more I am under the impression that I won't be able to do that until I make my app stateless (i.e.: that I get rid of the ValueStore).
Am I right to think that my app should be stateless in order for it to be replicated and if so, how could I do differently what I'm currently doing with my ValueStore? Could a Redis cache come into play?

Tune Redis connection inside an Azure Function to prevent timeouts

TL;DR
How does one amend the min number of threads for redis within an Azure Function?
Problem
I have an Azure Function that uses redis (via StackExchange.Redis package) to cache some values, or retrieve the existing value if already exists. I'm currently getting timeout issues that look to be because the Busy IOCP threads exceeds the Min IOCP thread value.
2016-09-08T11:52:44.492 Exception while executing function: Functions.blobtoeventhub. mscorlib: Exception has been thrown by the target of an invocation. StackExchange.Redis: Timeout performing SETNX 586:tag:NULL, inst: 1, mgr: Inactive, err: never, queue: 4, qu: 0, qs: 4, qc: 0, wr: 0, wq: 0, in: 260, ar: 0, clientName: RD00155D3AE265, IOCP: (Busy=8,Free=992,Min=2,Max=1000), WORKER: (Busy=7,Free=32760,Min=2,Max=32767), Local-CPU: unavailable (Please take a look at this article for some common client-side issues that can cause timeouts: https://github.com/StackExchange/StackExchange.Redis/tree/master/Docs/Timeouts.md).
According to the docs on timeouts, the resolution involves adjusting the MinThread count:
How to configure this setting:
In ASP.NET, use the "minIoThreads" configuration setting under the configuration element in machine.config. If you are running inside of Azure WebSites, this setting is not exposed through the configuration options. You should be able to set this programmatically (see below) from your Application_Start method in global.asax.cs.
Important Note: the value specified in this configuration element is a per-core setting. For example, if you have a 4 core machine and want your minIOThreads setting to be 200 at runtime, you would use .
Outside of ASP.NET, use the ThreadPool.SetMinThreads(…) API.
In an Azure Function a global.asax.cs file is not available, and the use of ThreadPool.SetMinThreads has little information associated with it I can parse!
There is a similar question on webjobs that is unanswered.
My specifics
Redis = Azure Redis Cache Standard 1Gb
Azure Function = version 0.5
StackExchange.Redis = version 1.1.603
Redis code is in a separate file to main function.
using StackExchange.Redis;
using System.Text;
private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() =>
{
string redisCacheName = System.Environment.GetEnvironmentVariable("rediscachename", EnvironmentVariableTarget.Process).ToString();;
string redisCachePassword = System.Environment.GetEnvironmentVariable("rediscachepassword", EnvironmentVariableTarget.Process).ToString();;
return ConnectionMultiplexer.Connect(redisCacheName + ",abortConnect=false,ssl=true,password=" + redisCachePassword);
});
public static ConnectionMultiplexer Connection
{
get
{
return lazyConnection.Value;
}
}
static string depersonalise_value(string input, string field, int account_id)
{
IDatabase cache = Connection.GetDatabase();
string depersvalue = $"{account_id}:{field}:{input}";
string value = $"{account_id}{Guid.NewGuid()}";
bool created = cache.StringSet(depersvalue, value, when: When.NotExists);
string retur = created? value : cache.StringGet(depersvalue).ToString();
return (retur);
}
Ultimately, we needed to pursue #mathewc's answer and added a line into the connection multiplexer code to set the min threads to 500
readonly static Lazy<ConnectionMultiplexer> lazyConnection =
new Lazy<ConnectionMultiplexer>(() =>
{
ThreadPool.SetMinThreads(500, 500);
Additionally, further tuning was required and the code enhanced via a SO code review. The main importance here, is the drastically upping of timeouts.
using StackExchange.Redis;
using System.Text;
using System.Threading;
readonly static Lazy<ConnectionMultiplexer> lazyConnection =
new Lazy<ConnectionMultiplexer>(() =>
{
ThreadPool.SetMinThreads(500, 500);
string redisCacheName = System.Environment.GetEnvironmentVariable("rediscache_name", EnvironmentVariableTarget.Process).ToString();
string redisCachePassword = System.Environment.GetEnvironmentVariable("rediscache_password", EnvironmentVariableTarget.Process).ToString();
return ConnectionMultiplexer.Connect(new ConfigurationOptions
{
AbortOnConnectFail = false,
Ssl = true,
ConnectRetry = 3,
ConnectTimeout = 5000,
SyncTimeout = 5000,
DefaultDatabase = 0,
EndPoints = { { redisCacheName, 0 } },
Password = redisCachePassword
});
});
public static ConnectionMultiplexer Connection => lazyConnection.Value;
static string depersonalise_value(string input, string field, int account_id)
{
IDatabase cache = Connection.GetDatabase();
string depersvalue = $"{account_id}:{field}:{input}";
string existingguid = (string)cache.StringGet(depersvalue);
if (String.IsNullOrEmpty(existingguid)){
string value = $"{account_id}{Guid.NewGuid()}";
cache.StringSet(depersvalue, value);
return value;
}
return existingguid;
}
We don't have a good way for you to perform app level initialization like this currently. This is tracked by an issue in our repo here.
For now, your only real workaround would be for you to put this init code into a shared helper that you invoke at the beginning of your function. The shared init method should have logic in it such that it is only performed once.

StackExchange.Redis on Azure is throwing timeout performing get and no connection available exceptions

I recently switched an MVC application that serves data feeds and dynamically generated images (6k rpm throughput) from the v3.9.67 ServiceStack.Redis client to the latest StackExchange.Redis client (v1.0.450) and I'm seeing some slower performance and some new exceptions.
Our Redis instance is S4 level (13GB), CPU shows a fairly constant 45% or so and network bandwidth appears fairly low. I'm not entirely sure how to interpret the gets/sets graph in our Azure portal, but it shows us around 1M gets and 100k sets (appears that this may be in 5 minute increments).
The client library switch was straightforward and we are still using the v3.9 ServiceStack JSON serializer so that the client lib was the only piece changing.
Our external monitoring with New Relic shows clearly that our average response time increases from about 200ms to about 280ms between ServiceStack and StackExchange libraries (StackExchange being slower) with no other change.
We recorded a number of exceptions with messages along the lines of:
Timeout performing GET feed-channels:ag177kxj_egeo-_nek0cew, inst: 12, mgr: Inactive, queue: 30, qu=0, qs=30, qc=0, wr=0/0, in=0/0
I understand this to mean that there are a number of commands in the queue that have been sent but no response available from Redis, and that this can be caused by long running commands that exceed the timeout. These errors appeared for a period when our sql database behind one of our data services was getting backed up, so perhaps that was the cause? After scaling out that database to reduce load we haven't seen very many more of this error, but the DB query should be happening in .Net and I don't see how that would hold up a redis command or connection.
We also recorded a large batch of errors this morning over a short period (couple of minutes) with messages like:
No connection is available to service this operation: SETEX feed-channels:vleggqikrugmxeprwhwc2a:last-retry
We were used to transient connection errors with the ServiceStack library, and those exception messages were usually like this:
Unable to Connect: sPort: 63980
I'm under the impression that SE.Redis should be retrying connections and commands in the background for me. Do I still need to be wrapping our calls through SE.Redis in a retry policy of my own? Perhaps different timeout values would be more appropriate (though I'm not sure what values to use)?
Our redis connection string sets these parameters: abortConnect=false,syncTimeout=2000,ssl=true. We use a singleton instance of ConnectionMultiplexer and transient instances of IDatabase.
The vast majority of our Redis use goes through a Cache class, and the important bits of the implementation are below, in case we're doing something silly that's causing us problems.
Our keys are generally 10-30 or so character strings. Values are largely scalar or reasonably small serialized object sets (hundred bytes to a few kB generally), though we do also store jpg images in the cache so a large chunk of the data is from a couple hundred kB to a couple MB.
Perhaps I should be using different multiplexers for small and large values, probably with longer timeouts for larger values? Or couple/few multiplexers in case one is stalled?
public class Cache : ICache
{
private readonly IDatabase _redis;
public Cache(IDatabase redis)
{
_redis = redis;
}
// storing this placeholder value allows us to distinguish between a stored null and a non-existent key
// while only making a single call to redis. see Exists method.
static readonly string NULL_PLACEHOLDER = "$NULL_VALUE$";
// this is a dictionary of https://github.com/StephenCleary/AsyncEx/wiki/AsyncLock
private static readonly ILockCache _locks = new LockCache();
public T GetOrSet<T>(string key, TimeSpan cacheDuration, Func<T> refresh) {
T val;
if (!Exists(key, out val)) {
using (_locks[key].Lock()) {
if (!Exists(key, out val)) {
val = refresh();
Set(key, val, cacheDuration);
}
}
}
return val;
}
private bool Exists<T>(string key, out T value) {
value = default(T);
var redisValue = _redis.StringGet(key);
if (redisValue.IsNull)
return false;
if (redisValue == NULL_PLACEHOLDER)
return true;
value = typeof(T) == typeof(byte[])
? (T)(object)(byte[])redisValue
: JsonSerializer.DeserializeFromString<T>(redisValue);
return true;
}
public void Set<T>(string key, T value, TimeSpan cacheDuration)
{
if (value.IsDefaultForType())
_redis.StringSet(key, NULL_PLACEHOLDER, cacheDuration);
else if (typeof (T) == typeof (byte[]))
_redis.StringSet(key, (byte[])(object)value, cacheDuration);
else
_redis.StringSet(key, JsonSerializer.SerializeToString(value), cacheDuration);
}
public async Task<T> GetOrSetAsync<T>(string key, Func<T, TimeSpan> getSoftExpire, TimeSpan additionalHardExpire, TimeSpan retryInterval, Func<Task<T>> refreshAsync) {
var softExpireKey = key + ":soft-expire";
var lastRetryKey = key + ":last-retry";
T val;
if (ShouldReturnNow(key, softExpireKey, lastRetryKey, retryInterval, out val))
return val;
using (await _locks[key].LockAsync()) {
if (ShouldReturnNow(key, softExpireKey, lastRetryKey, retryInterval, out val))
return val;
Set(lastRetryKey, DateTime.UtcNow, additionalHardExpire);
try {
var newVal = await refreshAsync();
var softExpire = getSoftExpire(newVal);
var hardExpire = softExpire + additionalHardExpire;
if (softExpire > TimeSpan.Zero) {
Set(key, newVal, hardExpire);
Set(softExpireKey, DateTime.UtcNow + softExpire, hardExpire);
}
val = newVal;
}
catch (Exception ex) {
if (val == null)
throw;
}
}
return val;
}
private bool ShouldReturnNow<T>(string valKey, string softExpireKey, string lastRetryKey, TimeSpan retryInterval, out T val) {
if (!Exists(valKey, out val))
return false;
var softExpireDate = Get<DateTime?>(softExpireKey);
if (softExpireDate == null)
return true;
// value is in the cache and not yet soft-expired
if (softExpireDate.Value >= DateTime.UtcNow)
return true;
var lastRetryDate = Get<DateTime?>(lastRetryKey);
// value is in the cache, it has soft-expired, but it's too soon to try again
if (lastRetryDate != null && DateTime.UtcNow - lastRetryDate.Value < retryInterval) {
return true;
}
return false;
}
}
A few recommendations.
- You can use different multiplexers with different timeout values for different types of keys/values
http://azure.microsoft.com/en-us/documentation/articles/cache-faq/
- Make sure you are not network bound on the client and server. if you are on the server then move to a higher SKU which has more bandwidth
Please read this post for more details
http://azure.microsoft.com/blog/2015/02/10/investigating-timeout-exceptions-in-stackexchange-redis-for-azure-redis-cache/

Azure web job keep restarting

I've created a console application that I run as a continuous web job. It does not use the web job SDK, but it checks for the file specified in the WEBJOBS_SHUTDOWN_FILE environment variable.
I have turned the 'Always On' option on, and I'm running the job in the shared plan as a singleton using the settings.job file.
The web job keeps getting stopped by the WEBJOBS_SHUTDOWN_FILE and is restarted again after that.
I've been looking in the kudu sources, but I can't find why my web job is restarted.
Does anybody have an idea why this happens?
This is the code that initializes the FileSystemWatcher:
private void SetupExitWatcher()
{
var file = Environment.GetEnvironmentVariable("WEBJOBS_SHUTDOWN_FILE");
var dir = Path.GetDirectoryName(file);
var fileSystemWatcher = new FileSystemWatcher(dir);
FileSystemEventHandler changed = (o, e) =>
{
if (e.FullPath.Equals(Path.GetFileName(file), StringComparison.OrdinalIgnoreCase) >= 0)
{
this.Exit();
}
};
fileSystemWatcher.Created += changed;
fileSystemWatcher.NotifyFilter = NotifyFilters.CreationTime | NotifyFilters.FileName | NotifyFilters.LastWrite;
fileSystemWatcher.IncludeSubdirectories = false;
fileSystemWatcher.EnableRaisingEvents = true;
}

Resources