I am running an azure function app on azure cloud , from time to time I am getting the fallowing error:
Exception while executing function: ****Timeout performing SETEX (10000ms), next: GET *****, inst: 140, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 2939, serverEndpoint: *****:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: 4ad57eb720e9(SE.Redis-v2.6.66.47313), IOCP: (Busy=0,Free=1000,Min=6,Max=1000), WORKER: (Busy=69,Free=32698,Min=6,Max=32767), POOL: (Threads=69,QueuedItems=54,CompletedItems=8674751), v: 2.6.66.47313 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
It doesn't look like there a specific reason for this to happen.
Any ideas why this happens ?
The bottle neck could either be network bandwidth or CPU cycles. Based on the log you shared, you have 69 work threads, so I would first check the CPU usage to be sure you aren't maxing it out.
You are likely not hitting network bandwidth issues assuming you are running on Azure and since the qs/qu values are 0 but there could be network glitches that are causing the timeouts too but should be transient and resolve on their own.
Related
I have a fargate cluster running a node.js API, running on fargate 1.4.0
I have maybe 8-25 instances running depending on the load. Instances are defined with these parameters using aws CDK:
cpu: 512,
assignPublicIp: true,
memoryLimitMiB: 2048,
publicLoadBalancer: true,
Like few times a day I get error like this:
error Command failed with signal "SIGKILL".
I though I was running out of memory, so I've configured node, to start with less memory like this: NODE_OPTIONS=--max_old_space_size=900
This made it less likely to occur, but I am still getting some SIGKILLs.
When looking at the instances at runtime I see they have plenty of memory free on the OS level:
{
"freemem": "6.95GB",
"totalmem": "7.79GB",
"max_old_space_size": 813.1680679321289,
"processUptime": "46m",
"osUptime": "49m",
"rssMemory": "396.89MB"
}
Why is fargate still killing those instances? Is there a way to find out most memory hungry processes just before the SIGKILL?
I have a TimerTrigger which calls my own Azure Functions at a relatively high rate - a few times per second. It is not being stress tested. Every call takes just a 100ms and the purpose of the test is not a stress test.
This call to my own endpoint works about 9999 times out of 10000 but just once in a while I get the following error:
System.Net.Http.HttpRequestException: Name or service not known (app.mycustomdomain.com:443)
---> System.Net.Sockets.SocketException (0xFFFDFFFF): Name or service not known
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
I replaced my actual domain with "app.mycustomdomain.com" in the error message above. It is a custom domain set up to point to the Azure Function App using CNAME.
The Function App does not detect any downtime in the Azure Portal and I have Application Insights enabled and do not see any errors. So I assume the issue is somehow on the callers side and the call never actually happens.
What does this error indicate? And how can I alleviate the problem?
For your second question - alleviating the problem, one option would certainly be to build in retry using a library like Polly. High level you create a policy, e.g. for a simple retry:
var myPolicy = Policy
.Handle<SomeExceptionType>()
.Retry(3);
This would retry 3 times, to use the policy you can call a sync or async version of Execute:
await myPolicy.ExecuteAsync(async () =>
{
//do stuff that might fail up to three times
});
More complete samples are available
This library has lots of support for other approaches, e.g. with delays, exponential delays, etc.
I'm using this code to run the tests outlined in this blog post.
(For posterity, relevant code pasted at the bottom).
What I've found is that if I run these experiments with a local instance of Mongo (in my case, using docker)
docker run -d -p 27017:27017 -v ~/data:/data/db mongo
Then I get pretty good performance, similar results as outlined in the blog post:
finished populating the database with 10000 users
default_query: 277.986ms
query_with_index: 262.886ms
query_with_select: 157.327ms
query_with_select_index: 136.965ms
lean_query: 58.678ms
lean_with_index: 65.777ms
lean_with_select: 23.039ms
lean_select_index: 21.902ms
[nodemon] clean exit - waiting
However, when I switch do using a cloud instance of Mongo, in my case an Atlas sandbox instance, with the following configuration:
CLUSTER TIER
M0 Sandbox (General)
REGION
GCP / Iowa (us-central1)
TYPE
Replica Set - 3 nodes
LINKED STITCH APP
None Linked
(Note that I'm based in Melbourne, Australia).
Then I get much worse performance.
adding 10000 users to the database
finished populating the database with 10000 users
default_query: 8279.730ms
query_with_index: 8791.286ms
query_with_select: 5234.338ms
query_with_select_index: 4933.209ms
lean_query: 13489.728ms
lean_with_index: 10854.134ms
lean_with_select: 4906.428ms
lean_select_index: 4710.345ms
I get that obviously there's going to be some round trip overhead between my computer and the mongo instance, but I would expect that to add 200ms max.
It seems that that round trip time must be being added multiple times, or something completely else that I'm not aware of - can someone explain just what it is that would cause this to blow out?
A good answer might involve doing an explain plan, and explaining that in terms of network latency.
Tests against different Atlas instances - For those suggesting the issue is that I'm using a Sandbox instance of Atlas - here is the results for a M20 and M30 instances:
BACKUPS
Active
CLUSTER TIER
M20 (General)
REGION
GCP / Iowa (us-central1)
TYPE
Replica Set - 3 nodes
LINKED STITCH APP
None Linked
BI CONNECTOR
Disabled
adding 10000 users to the database
finished populating the database with 10000 users
default_query: 9015.309ms
query_with_index: 8779.388ms
query_with_select: 4568.794ms
query_with_select_index: 4696.811ms
lean_query: 7694.718ms
lean_with_index: 7886.828ms
lean_with_select: 3654.518ms
lean_select_index: 5014.867ms
BACKUPS
Active
CLUSTER TIER
M30 (General)
REGION
GCP / Iowa (us-central1)
TYPE
Replica Set - 3 nodes
LINKED STITCH APP
None Linked
BI CONNECTOR
Disabled
adding 10000 users to the database
finished populating the database with 10000 users
default_query: 8268.799ms
query_with_index: 8933.502ms
query_with_select: 4740.234ms
query_with_select_index: 5457.168ms
lean_query: 9296.202ms
lean_with_index: 9111.568ms
lean_with_select: 4385.125ms
lean_select_index: 4812.982ms
These really don't show any significant difference (be aware than any difference may just be network noise).
Tests colocating the Mongo client and the mongo database instance
I created a docker container and ran it on Google's Cloud Run, in the same region (US Central1), the results are:
2019-12-30 11:46:06.814 AEDTfinished populating the database with 10000 users
2019-12-30 11:46:07.885 AEDTdefault_query: 1071.233ms
2019-12-30 11:46:08.917 AEDTquery_with_index: 1031.952ms
2019-12-30 11:46:09.375 AEDTquery_with_select: 457.659ms
2019-12-30 11:46:09.657 AEDTquery_with_select_index: 281.678ms
2019-12-30 11:46:10.281 AEDTlean_query: 623.417ms
2019-12-30 11:46:10.961 AEDTlean_with_index: 680.622ms
2019-12-30 11:46:11.056 AEDTlean_with_select: 94.722ms
2019-12-30 11:46:11.148 AEDTlean_select_index: 91.984ms
So while this doesn't give results as fast as running on my own machine - it does show that colocating the client and the database gives a very large performance improvement.
So the question again is - why is the improvement ~7000ms?
The test code:
(async () => {
try {
await mongoose.connect('mongodb://localhost:27017/perftest', {
useNewUrlParser: true,
useCreateIndex: true
})
await init()
// const query = { age: { $gt: 22 } }
const query = { favoriteFruit: 'potato' }
console.time('default_query')
await User.find(query)
console.timeEnd('default_query')
console.time('query_with_index')
await UserWithIndex.find(query)
console.timeEnd('query_with_index')
console.time('query_with_select')
await User.find(query)
.select({ name: 1, _id: 1, age: 1, email: 1 })
console.timeEnd('query_with_select')
console.time('query_with_select_index')
await UserWithIndex.find(query)
.select({ name: 1, _id: 1, age: 1, email: 1 })
console.timeEnd('query_with_select_index')
console.time('lean_query')
await User.find(query).lean()
console.timeEnd('lean_query')
console.time('lean_with_index')
await UserWithIndex.find(query).lean()
console.timeEnd('lean_with_index')
console.time('lean_with_select')
await User.find(query)
.select({ name: 1, _id: 1, age: 1, email: 1 })
.lean()
console.timeEnd('lean_with_select')
console.time('lean_select_index')
await UserWithIndex.find(query)
.select({ name: 1, _id: 1, age: 1, email: 1 })
.lean()
console.timeEnd('lean_select_index')
process.exit(0)
} catch (err) {
console.error(err)
}
})()
My best guess is that you're dealing with slow network throughput between your local machine and Atlas (something I've experienced myself this week - hence how I found this post!)
Looking at your local query performance:
default_query: 277.986ms
query_with_index: 262.886ms
The query with index isn't noticeably any faster than the one without. For an indexed query to take 262ms in a Node app with a local DB probably means that either:
The index isn't being used properly OR more likely...
You're returning quite a few results in the query. If the query returns say 3,000 results and each result is 1KB, that's 3MB of JSON data that your app needs to handle.
I've got a 150Mbit/s internet connection and yet my throughput to Atlas (M2 shared tier, if that makes a difference) fluctuates between around 1Mbit/s to 6Mbit/s.
On localhost I have a Mongo query that returns 2,400 results for a total of 1.7MB of JSON data. The roundtrip time for that query in my Node app (using console.time() like you did) connected to Mongo on the same local dev machine is ~150ms. But when connecting that local app to Atlas the query takes 2,400ms to 3,400ms to return. When I profiled the query on Atlas it only took 2ms to execute, so the query itself is really fast, it's apparently the data transfer that's slow.
Based on these results, I have a feeling that Atlas perhaps throttles throughput over the public internet (or just doesn't bother optimizing for it in their network) because 99% of apps are colocated in the same network region as their Atlas DB. That's the reason why they ask you to pick not just AWS, Azure, etc but your specific network region when creating a cluster.
UPDATE: I just ran a few Amazon EC2 speed tests for my network region (us-east-1) using a 3rd-party service and the average download speed was 4.5Mbit/s for smaller files (1KB to 128KB) and 41Mbit/s for larger files (256KB to 10MB). So the primary issue may be generally slow throughput on the EC2 instances that Atlas clusters run on rather than any throttling by Atlas, or perhaps a combination of both.
Usually, It takes a little bit of time for a request to propagate over the network. this depends on the connection speed, latency, and distance to the server and so many factors. but the server on your local computer doesn't face above mentioned issues as it is for a cloud environment.
But since you are confident about the max delay due to network propagations is ~200ms.
There may be several other possible reasons also to consider
Usually, sandbox plans are for testing and they have limited resources allocated to them.
They don't use SSD drives to store data and uses cheap storage solutions.
They assume that sandbox plans are usually just for exploring features.
Most of the times those instances are run on shared virtual machines.
Make sure there are no other services running on your computer which consumes a higher data rate eg :( torrent applications )
Cloud services depend on a variety of metrics like System Availability, Response Time, Throughput, Latency and many more...
If the average response time of the user base and the data centers is located in the same region then the average overall response time is about 50ms but if located in the different region the response time significantly increases from 200ms - 400ms which can also depend upon the type of instance you're using and the region which you choose.
Since you're using the Atlas Sandbox cluster you must first select the nearest region to avoid poor performance issues as Atlas Sandbox clusters do have it's own limitations. If you're looking for quick response time and faster performance try to upgrade your instance.
If you are sure that it's not about network issues like latency and bandwidth vs response size, then it's either low edge host (non-SSD, low RAMs) or misconfigured web server/proxy, or there is throttling/filtering happening to your traffic.
To narrow it down more use encrypted (https) connection (it's easy, just install letsencrypt on your server) and try to use VPN to change your network route.
Also you can try running the script directly on the server to measure actual executing performance.
Of course you have to consider that your network delay is for each request to the cloud instance , so if you have a ping time of +30ms , you will take 30ms more for each query (approximately) , moreover if your instance is a sandbox ( free account https://docs.atlas.mongodb.com/tutorial/deploy-free-tier-cluster/ ) you will have a poor and shared CPU/RAM.
This is why your mongo db queries are slow.
Making a system faster in production is one of the design goals
We need to take into the account many variables:
Networking, for example, VPC/subnetting
MongoDB Storage (SSD)
MongoDB Indexes
MongoDB RAM, CPU
Node Web Servers or Cluster
Cluud Tenants
TLS encryption
You may need to discard each and every single possible bottleneck
I use a combination of lookup and foreach activities to iterate through the set of data ingestion queries and execute them (reasons behind that is a separate broad topic :)). As the data source is connected to the private network, I have provisioned a dedicated VM to run the self-hosted runtime. In most cases everything runs smoothly, I can see worker processes eating the CPU and high overall CPU utilization (which is good).
But: sometimes, when most work is done, and there are just 2-3 activities standing in line, I can see that the runtime does no processing and CPU usage drops to zero, no new entries appear in the event log. After some time (approximately 10 minutes) I get the 30002 (the example is provided below) and runtime happily completes the work.
Example event message:
Job ID: ***-fcab-429a-bb45-***
Task ID: ***-d820-414e-ad8c-***
Queue ID: ***-4f44-4c39-a1c1-***
Log ID: PulledOffNewTask
The question: What could be the root cause of such Azure Data Factory self-hosted integration runtime's behaviour? Can this be fine-tuned?
UPDATE 1
Errors have been spotted in the application log and warnings have been spotted in the integration runtime log.
Application log contains 3 sets of errors (see below events [1] to [5]) that occured in the time interval of ~2 minutes, shortly after that 8 events (exactly the number of my worker processes) were logged to the integration runtime log (see [6]), straight after that "Windows Error Reporting" events appear. And then we face a "freeze".
So - looks like a bug :(
"application" log:
[1]
Application: diawp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
at Microsoft.DataTransfer.TransferTask.CopyTaskBase.UpdateJobProgress(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.TimerQueueTimer.CallCallback()
at System.Threading.TimerQueueTimer.Fire()
at System.Threading.TimerQueue.FireNextTimers()
[2]
Faulting application name: diawp.exe, version: 3.5.6639.1, time stamp: 0x5aa8cf5f
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x00007ff914402c65
Faulting process id: 0x1bc4
Faulting application start time: 0x01d3d287ef6e34fa
Faulting application path: C:\Program Files\Microsoft Integration Runtime\3.0\Shared\diawp.exe
Faulting module path: unknown
Report Id: 1fe7de4d-5481-478d-b9e7-d542c24ab18a
Faulting package full name:
Faulting package-relative application ID:
[3]: Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code.
[4]: The Open Procedure for service "WmiApRpl" in DLL "C:\Windows\system32\wbem\wmiaprpl.dll" failed. Performance data for this service will not be available.
"Integration Runtime" log:
[6]
'Type=System.InvalidOperationException,Message=Instance 'diawp#10' does not exist in the specified Category.,Source=System,StackTrace= at System.Diagnostics.CounterDefinitionSample.GetInstanceValue(String instanceName)
at System.Diagnostics.PerformanceCounter.NextSample()
at System.Diagnostics.PerformanceCounter.NextValue()
at Microsoft.DataTransfer.TransferTask.FormatedPerfCounter.TryGet(Single& value),'
Job ID: 7b629411-c6cd-42d0-9939-e830e58db015
Log ID: Warning
It looks like caused by worker crash. Could you please check event log from: Windows Log => Application? Any error in the category?
As far as I know, you don't have a lot of options to tune the Integration Runtime. My bet is a connectivity issue with your private network. Whenever you run the pipeline, open a cmd at the vm and ping the database pc with -t. If the process hangs, take a look at the response time between pings.
Example ping:
ping 192.168.1.1 -t
Hope this helped!
30002 means IntegrationRuntime got new tasks assigned and started execution.
If the 10 minutes "retry interval" could constantly be reproduced, then 30002 could further indicate that IntegrationRuntime lost tracks on the previous failed tasks it got assigned and had to go with retry.
You can search the specific JobIds in the eventlogs to verify whether shown up 10 minutes before and any exceptions related to.
Btw, the polling interval in normal happy path is in seconds level.
I am trying to use Redis Session State with my Windows Azure Cloud website. I am using the 1 GB Standard Tier. I am using the P1 Premium Database. I am testing on local host. My cache and website are located on EAST US.
I am storing 200 - 400 objects in session state, which include an order and its payments.
Here is the error:
An exception of type 'System.TimeoutException' occurred in Microsoft.Web.RedisSessionStateProvider.dll but was not handled in user code
Additional information: Timeout performing EVAL, inst: 0, mgr: Inactive, err: never, queue: 7, qu: 1, qs: 6, qc: 0, wr: 1, wq: 1, in: 0, ar: 0, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=4094,Min=8,Max=4095), clientName: XX
Here are my settings:
<sessionState mode="Custom" customProvider="MySessionStateStore">
<providers>
<add name="MySessionStateStore" type="Microsoft.Web.Redis.RedisSessionStateProvider" host="XX.redis.cache.windows.net" accessKey="XX" ssl="true" syncTimeout="3000" connectionTimeoutInMilliseconds="5000" operationTimeoutInMilliseconds="1000" retryTimeoutInMilliseconds="3000" />
</providers>
</sessionState>
late answer perhaps, but never the less...
in this case, it looks like the amount of info within the cache is your problem. Redis is fine tuned to retrieve a lot of small cached info, not a single huge string...
Another thing i'd suggest you to try is to increase the number min number of IOCP and worker threads... in my scenario (2 core machine) i figured out that the best number is 100...