I have a question about Azure App Insights Sampling.
If itemCount field is greater than 1 for a log item, does it mean that there was an exactly the SAME request and it was sampled?
My logs have one request that sends this message with itemCount = 2. And this request has ended with OptimisticConcurrencyException, so my transaction has been roll-backed. In this transaction I send a message to 3rd party service.
The most interesting is that they told me they've got 2 messages from my service and my database has been updated (so transaction has been committed).
All of it became clear, if there were 2 requests and one of them returned 200 code, and another returned 500. But app insights log item abot OptimisticConcurrencyException has value itemCount = 2, which means that this exception was thrown twice (for both requests).
Furthermore Beside this I don't see any other requests, that could change data, that request was changing.
So could anybody explain me how app insights samples requests and errors?
This really depends on how/where your sampling occurred, as sampling could have occurred at 3 different places depending on how you have your app configured.
There's a fair amount of documentation about the various layers of sampling, but hypothetically:
The sampling algorithm decides which telemetry items to drop, and which ones to keep (whether it's in the SDK or in the Application Insights service). The sampling decision is based on several rules that aim to preserve all interrelated data points intact, maintaining a diagnostic experience in Application Insights that is actionable and reliable even with a reduced data set. For example, if for a failed request your app sends additional telemetry items (such as exception and traces logged from this request), sampling will not split this request and other telemetry. It either keeps or drops them all together. As a result, when you look at the request details in Application Insights, you can always see the request along with its associated telemetry items.
Update:
I got some more details from people on the team that do the sampling, and it works like this:
Sampling ratio is determined by the number of events per second occurring in the app
The AI SDK randomly selects requests to be sampled when the request begins (so, it is not known whether it will fail or succeed)
AI SDK assigns itemCount=<sampling ratio>
This would then explain the behavior you are seeing, when two requests (success + failure) were counted as two failures: the failed request was sampled "in", and so in telemetry, you'd have 2 failed requests (one request with itemCount=2) instead of a failed and a successful, because the successful one got sampled away.
Related
We have a service which pings our EP1 Premium service and yesterday we received 3 client side timeout errors after 2 minutes of waiting. When opening the trace in App insights, these requests which time out are not even logged and have no trace of ever being received Azure side, and therefore stay unanswered. By looking at the metrics provided in the Azure Functions app, I found out that 1-2 minutes after the request has been sent, the app loses all its ability to work as its Total App Domains falls to 0 as well as all connections, threads and so on and this state lasts until the next request is received, therefore "skipping" the request that happened beforehand. This is a big issue as I need to make sure requests get answered in a timely manner.
The client service sent HTTP requests to the Azure Functions app expecting an answer, only to time out while the Azure-side doesn't have any record of ever receiving the request.
I believe this issues is related to Consumption Plan of Azure Functions called Cold Start behaviour. The "skipping" mechanism is explained below:
Apps may scale to zero when idle, meaning some requests may have additional latency at startup. The consumption plan does have some optimizations to help decrease cold start time, including pulling from pre-warmed placeholder functions that already have the function host and language processes running.https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#cold-start-behavior
Please also consider of having look on this article, which explains the behaviour. https://azure.microsoft.com/en-us/blog/understanding-serverless-cold-start/
We have a Azure setup with a Azure Event Grid Topic and to that we have a Azure Function Service with about 15 functions that are subscribing to the topic via different prefix filters. The Azure Function Service is set up as a consumption based resource and should be able to scale as it prefers.
Each subscription is set up to try deliveries for 10 times during maximum 4 hours befor dropping the event. So far so good and the setup is working as expected – most of the time.
In certain, for us unknown situations, it seems like the Event Grid Topic cannot deliver events to the different functions. What we can see is that our dead letter storage fill up with events that have not been delivered.
Now to my question
From the logs we can see the reason for various events not being delivered. The reason is most often Outcome: Probation. We can not find any information from Microsoft on what this actually means.
In addition, the Grid fails and adds the event to the dead letter log before both the timeout policy (4 hours) and the delivery attempts policy (10 retries) has exceeded. Some times the Function Service is idling and do not receive any events from the Grid.
Do any of you good people have ideas of how we can proceed with the troubleshooting for this? What has happened between the Grid and Funciton App when the error message Probation occurs? One thing that we have noticed is that the number of connections from the Grid to our function app is quite high in comparison to the number of events delivered.
There are not other incoming connections to the Function App besides the Event Grid.
Example of a dead letter message
[{
"id":"a40a1f02-5ec8-46c3-a349-aea6aaff646f",
"eventTime":"2020-06-02T17:45:09.9710145Z",
"eventType":"mitbalAdded",
"dataVersion":"1",
"metadataVersion":"1",
"topic":"/subscriptions/XXXXXXX/resourceGroups/XXXX_STAGING/providers/Microsoft.EventGrid/topics/XXXXXstaging",
"subject":"odl/type/mitbal/v1",
"deadLetterReason":"TimeToLiveExceeded",
"deliveryAttempts":6,
"lastDeliveryOutcome":"Probation",
"publishTime":"2020-06-02T17:45:10.1869491Z",
"lastDeliveryAttemptTime":"2020-06-02T19:30:10.5756332Z",
"data":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Stock><Action>ADD</Action><Id>123456</Id><Store>123</Store><Shelf>1</Shelf></Stock>"
}]
Function Service Metrics
Blue = Connections (count)
Red = Function Executions (count)
White = Requests (count)
I'm not sure if you have figured the issue here, but here are some insights for others in a comparable situation.
Firstly, probation is the outcome when the destination is not healthy, for which Event Grid would still attempt deliveries.
Based on the graph, it looks like functions hit the 100 executions mark and then took a while to scale out for the next 100. You could get better results by tweaking the host.json settings depending on what each function execution does.
Including scale controller logs could shed more light into what is happening internally when scaling out.
Also, another option would be to send events into service bus or event hubs first and then have a function run from there.
I'm running an Azure Function app on Consumption Plan and I want to monitor the amount of instances currently running. Using REST API endpoint of format
https://management.azure.com/subscriptions/{subscr}/resourceGroups/{rg}
/providers/Microsoft.Web/sites/{appname}/instances?api-version=2015-08-01
I'm able to retrieve the instances. However, the result doesn't match the information that I see in Application Insights / Live Metrics Stream.
For example, right now App Insights shows 4 servers online, while API call returns just one (the GUID of this 1 instance is also among App Insights guids).
Who can I trust? Is there a better way to get instance count (e.g. from App Insights)?
UPDATE: It looks like data from REST API are wrong.
I was sending 10000 messages to the queue, logging each function call with respective instance ID which processed the request.
While messages keep coming in and the backlog grows, instance count from REST API seems to be correct (scaled from 1 to 12). After sending stops, the reported instance count rapidly goes down (eventually back to 1, while processors are still busy).
But based on the speed and the execution logs I can tell that the actual instance count kept growing and ended up at 15 instances at the moment of last message processed.
UPDATE2: It looks like SDK refuses to report more than 20 servers. The metric flats out at 20, while App Insights kept steady growth and is already showing 41.
Who can I trust? Is there a better way to get instance count (e.g. from App Insights)?
Based on my understanding we need to use Rest API endpoint to retrieve the instance, App Insights could be configured for multiple WebApps, so the number of servers online in the App Insights may be for multiple WebApps.
Updated:
Based on my test, the number of the application insight may be not real time.
During my test if the WebApp Function scale out then I could get multiple instances with Rest API, and I also can check the number of servers online in the App Insights.
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourcegroup}/providers/Microsoft.Web/sites/{functionname}/instances?api-version=2016-08-01
But after I finished the test, I could get the number of the instance with Rest API is 1, based on my understanding, it is right result.
At the same time I check it in the Application Insight the number of the servers online is the max number during my test.
And after a while, the number of server online in the application insight also became 1.
So If we want to get the number of intance for Azure function, my suggestion is that using REST API to do that.
Update2:
According to the DavidEbbo mentioned that the REST API is not always reliable.
Unfortunately, the REST API is not always reliable. Specifically, when a Function App scales across multiple scale units, only the instances from the 'home' scale unit are reflected. You probably will not see this in a smallish test, but likely will if you start scaling out widely (say over 20 instances).
On this documentation page there is the following limitation of Application Insights documented:
Up to 500 telemetry data points per second per instrumentation key (that is, per application). This includes both the standard telemetry sent by the SDK modules, and custom events, metrics and other telemetry sent by your code.
However it doesn't explain what the implications of that limit are?
a) Does it buffer and throttle, but still persist all data eventually? So say - 1000 data points get pushed within a second - it will persist the first 500, then wait for a bit and push the other 500?
or
b) Does it just drop/not log data? So say - 1000 data points get pushed within a second and only the first 500 will be persisted and the other 500 not (ever)?
It is the latter (b) with the caveat that ALL data will start to be throttled in this case, i.e. once RPC is > 500 (100 for free apps, please see https://azure.microsoft.com/en-us/documentation/articles/app-insights-data-retention-privacy/ for details) is detected, it will start rejecting all data from this instrumentation key on data collection endpoint, until RPC rate is back to under 500.
EDIT: Further information from Bret Grinslade:
The current implementation averages over one minute -- so if you send 30K in 1 minute (500*60) it will throttle your application. The HTTP response will tell the SDK to retry later. If the incoming rate never comes down, the response will tell the SDK to drop the data. We are working on other features to improve this experience -- pre-aggregation on the client, improved burst data rates, etc.
A bit more detail on top of Alex's response. The current implementation averages over one minue -- so if you send 30K in 1 minute (500*60) it will throttle your application. The HTTP response will tell the SDK to retry later. If the incoming rate never comes down, the response will tell the SDK to drop the data. We are working on other features to improve this experience -- pre-aggregation on the client, improved burst data rates, etc.
AI now has the ingestion throttling limit of 16K EPS: https://learn.microsoft.com/en-us/azure/application-insights/app-insights-pricing
I'm building an application using tag subscriptions in the real-time API and have a question related to capacity planning. We may have a large number of users posting to a subscribed hashtag at once, so the question is how often will the API actually POST to our subscription processing endpoint? E.g., if 100 users post to #testhashtag within a second or two, will I receive 100 POSTs or does the API batch those together as one update? A related question: is there a maximum rate at which POSTs can be sent (e.g., one per second or one per ten seconds, etc.)?
The Instagram API seems to lack detailed information about both how many updates are sent and what are the rate limits. From the [API docs][1]:
Limits
Be nice. If you're sending too many requests too quickly, we'll send back a 503 error code (server unavailable).
You are limited to 5000 requests per hour per access_token or client_id overall. Practically, this means you should (when possible) authenticate users so that limits are well outside the reach of a given user.
In other words, you'll need to check for a 503 and throttle your application accordingly. No information I've seen for how long they might block you, but it's best to avoid that completely. I would advise you manage this by placing a rate limiting mechanism on your own code, such as pushing your API requests through a queue with rate control. That will also give you the benefit of a retry of you're throttled so you won't lose any of the updates.
Moreover, a mechanism such as a queue in the case of real-time updates is further relevant because of the following from the API docs:
You should build your system to accept multiple update objects per payload - though often there will be only one included. Also, you should acknowledge the POST within a 2 second timeout--if you need to do more processing of the received information, you can do so in an asynchronous task.
Regarding the number of updates, the API can send you 1 update or many. The problem with this is you can absolutely murder your API calls because I don't think you can batch calls to specific media items, at least not using the official python or ruby clients or API console as far as I have seen.
This means that if you receive 500 updates either as 1 request to your server or split into many, it won't matter because either way, you need to go and fetch these items. From what I observed in a real application, these seemed to count against our quota, however the quota itself seems to consume resources erratically. That is, sometimes we saw no calls at all consumed, other times the available calls dropped by far more than we actually made. My advice is to be conservative and take the 5000 as a best guess rather than an absolute. You can check the remaining calls by parsing one of the headers they send back.
Use common sense, don't be stupid, and using a rate limiting mechanism should keep you safe and have the benefit of dealing with failures either due to outages (this happens more than you may think), network hicups, and accidental rate limiting. You could try to be tricky and use different API keys in a pooling mechanism, but this is likely a violation of the TOS and if they are doing anything via IP, you'd have to split this up to different machines with different IPs.
My final advice would be to restructure your application to not completely rely on the subscription mechanism. It's less than reliable and very expensive API wise. It's only truly useful if you just need to do something in your app that doesn't require calling back to Instgram, your number of items is small, or you can filter out the majority of items to avoid calling back to Instagram accept when a specific business rule is matched.
Instead, you can do things like query the tag or the user (ex: recent media) and scale it out that way. Normally this allows you to grab 100 items with 1 request rather than 100 items with 100 requests. If you really want to be cute, you could at least merge the subscription notifications asynchronously and combine the similar ones into a single batched request when you combine the duplicate characteristics such as tag into a single bucket. Sort of like a map/reduce but on a small data set. You could of course do an actual map/reduce from time-to-time on your own data as another way of keeping things in async. Again, be careful not to thrash instagram, but rather just use map/reduce to batch out your calls in a way that's useful to your app.
Hope that helps.