Azure Microservices Performance Insights - Collective Performance Counter Reporting - azure

I have around 10 microservice applications in .net, all hosted on Azure ServiceFabric.
These applications are setup in a sequence for example
API call to Application 1 > stores data in cosmos > sends message to Application 2
Application 2 > Depending on data and business logic send a message to relative department (application 3, 4, 5, etc)
Application 3 processes and stores the data in database
I want a performance metric which shows some start/end time or total time taken to perform 1 End to End cycle for a payload.
I have come through certain solutions for tihs
Log metrics in Application Insights before and after method calls
Example:
Create and use a unique Guid as correlationId
Application 1 > Method1() - Record Start Time
Application 1 > Method() - Record start and end time
Application 3 > Method2() - Record start and end time
Application 3 > Method2() - Record End Time
This is available in Insights when I search for that Guid
Even here I have a question how could I improve the visibility of this, maybe charts, reports, what options I could use in Application Insights?
Log as above but in a separate database, this way we have control on data (application insights have huge data and cant be a separate API)
Create a new API with input as Guid, the response will be seomthing like below
Total EndToEnd Time: 10seconds
Application1> Method2(): 2 seconds
...
I know there could be better options but need some direction on this please.

There are two options to do it with Application Insights. Both are not ideal at this point.
Option I. If you store all telemetry in the same resource and your app doesn't have too much load then you can group by (summarize) by CorrelationId. Here is an idea (you might want to extend it by recording start time when it comes to Application 1 and end time when it comes to Application 3):
let Source = datatable(RoleName:string, CorrelationId:string, Timestamp:datetime)
[
'Application 1', '1', '2021-04-22T05:00:45.237Z',
'Application 2', '1', '2021-04-22T05:01:45.237Z',
'Application 3', '1', '2021-04-22T05:02:45.237Z',
'Application 1', '2', '2021-04-22T05:00:45.237Z',
'Application 2', '2', '2021-04-22T05:01:46.237Z',
'Application 3', '2', '2021-04-22T05:02:47.237Z',
];
Source
| summarize min_timestamp=min(Timestamp), max_timestamp=max(Timestamp) by CorrelationId
| extend duration = max_timestamp - min_timestamp
| project CorrelationId, duration
Option II. Application Insights supports W3C standard of Distributed Tracing for HTTPS calls. If you manually propagate distributed tracing context through your message (between applications) and restore this context, then you can do the following:
In Application 1 you can put start time in a Baggage
This field will get propagated across applications [note, OperationId will also propagate]
In Application N you will know exactly when a particular request/transaction started, so you will be able to emit proper metric

Related

BigQuery Internal Error with `pageToken` when running in GCP

I run into this error with BigQuery:
"An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 5034996"
Two application use the same way with pageToken to paginate trough big result sets.
run query with initital startIndex: 0, maxResults: 10
get result together with pageToken
send to client
... some time may pass ...
request "next page": use pageToken together with maxResults: 10 to get the next result
repeat from 3.
NodeJS 16, #google-cloud/bigquery 6.0.3
Locally (Windows 10), for both application every thing works, pagination with pageToken returns results quite fast (<5s). All steps 1 to 6 and requesting multiple next pages one after another works, even tested that the pageToken still works after 60min+.
Production Cloud has problems: the initial query works always, but as soon as a pageToken is given, the query fails after ~15s+, even when "requested the next page directly (1-5s. delay) after getting the first page". Steps 1 to 3 work, but requesting next page fails nearly most time, it's very rare that it doesn't fail.
Production uses Google Cloud Functions and Google Cloud Run to serve the applications.
One application is an internal experiment, this application uses the same dataset + table when running locally and when running in "production".
The other application uses the same dataset but different tables for local/production - and is in another Google Cloud project than the first application.
Thus project-level quotas or e.g. different table setups locally/prod shouldn't cause the issue here (hopefully).
Example code used:
const [rows, , jobQueryResults] = await (job.getQueryResults(
('maxResults' in paginate ?
{
// there seems to be no need to give the startIndex again / but tested it also with giving a stable `0` index; atm. as soon as a pageToken is given the `startIndex` is omitted
startIndex: paginate.pageToken ? undefined : paginate.startIndex,
maxResults: paginate.maxResults,
pageToken: paginate.pageToken,
} : undefined) as QueryResultsOptions,
) as Promise<QueryRowsResponse>)
What wonders me is that the pageToken isn't shown in the log of the failure, the maxResults is visible:
Edit
The error suggests some SLA problem, one of the GCP projects only include experimentals (non public) applications, thus any traffic/usage can be easily monitored.
The monitoring for BigQuery in that project shows roughly 1 job per 1 second when testing it, job 1+2 where "load without pageToken" -> 3 used the pageToken from 2 and run into an error, the retries must happen from BigQuery side, there is nothing implemented from my side (using only the official BigQuery package).

What is iKey in traces table - KQL [ Kusto Query Language] - Application Insights - Also query Optimization of following KQL query

I was getting some exceptions around 60 exceptions when I run my .NET Core Application - of a particular type - Partner center exceptions.
I have dealt with those exceptions but now I am writing some KQL queries so that I come to know if anything goes wrong beforehand.
I want to write KQL query which in future catches exceptions from partner center but not that type of exception - so how to filter them out?
My Query looks like -
traces
| where customDimensions.LogLevel == "Error"
| where operation_Name == "functionName"
| where iKey != "************"
I saw this iKey - what is it? and how can I write a desired query is what I need to know.
Also :
Could not find purchase charge for customer and "errorName":"RelationshipDoesNotExist" ----> this all comes in message and also customDimensions field
Can I extract this errorName and exclude these type of exceptions? Any way to do that?
For now I have used :
where message !contains_cs "Could not find purchase charge for customer"
but it has high compute price, so looking for an alternate to optimize the query.
iKey correspondents to the instrumentation key:
When you set up Application Insights monitoring for your web app, you create an Application Insights resource in Microsoft Azure. You open this resource in the Azure portal in order to see and analyze the telemetry collected from your app. The resource is identified by an instrumentation key (ikey). When you install the Application Insights package to monitor your app, you configure it with the instrumentation key, so that it knows where to send the telemetry.
(source)
I want to write KQL query which in future catches exceptions from partner center but not that type of exception - so how to filter them out?
Exceptions are stored in the exceptions table. You can filter them based on a known property like the exception type. For example, say you want all exceptions except those of type NullReferenceException you can do something like this:
exceptions
| where ['type'] != "System.NullReferenceException"

How to use BeginScope in Azure Application Insights (in https://portal.azure.com)?

My C# code is log.BeginScope("Testing Scope1"); and log.BeginScope("Testing Scope2");. How can I use in Azure Application Insights (in https://portal.azure.com)?
If your code like below:
using (_logger.BeginScope("Testing Scope1"))
{
_logger.LogInformation("this is an info from index page 111111");
}
Then, after the code is executed, nav to azure portal -> your application insights -> Logs -> in the traces table, write the following query(also note that select a proper "time range"):
traces
| where customDimensions.Scope contains "Testing Scope1"
| project message, customDimensions
The screenshot is as below:
By the way, it may take a few minutes for the logs being generated. And please also set the proper log level in your application(like set the proper log level in your azure function).

Reserved EventIds in ApplicationInsights

I am creating some LogError calls in my ASP.NET Core webapp on the line of
_logger.LogError(new EventId(5000,"CustomName"),"description");``
I can find this event in Application Insights by querying like this
traces | where timestamp > ago(10m) |where customDimensions.EventId == 5000
Is there any list of event ids that is reserved? I only want to get my own events. I know that a third party library that i bind to my project theoretically can write some events with the above event id, but I am thinking more if Microsoft has a list of reserved event ids. If I do this search in my log
traces | where timestamp > ago(10m) |where customDimensions.EventId > 1
I get some hits, on Azure Function startup, so I know that Microsoft are using this also.
I have searched the docs, but haven't found any list.
No, there're no reserved EventIDs in app insights. You always need to provide it by yourself.

Umbraco 7.6.0 - Site becomes unresponsive for several minutes every day

We've been having a problem for several months where the site becomes completely unresponsive for 5-15 minutes every day. We have added a ton of request logging, enabled DEBUG logging, and have finally found a pattern: Approximately 2 minutes prior to the outages (in every single log file I've looked at, going back to the beginning), the following lines appear:
2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Timer: release. 2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Run now (sync).
From what I gather this is the process that rebuilds the umbraco.config, correct?
We have ~40,000 nodes, so I can't imagine this would be the quickest process to complete, however the strange thing is that the CPU and Memory on the Azure Web App do not spike during these outages. This would seem to point to the fact that the disk I/O is the bottleneck.
This raises a few questions:
Is there a way to schedule this task in a way that it only runs
during off-peak hours?
Are there performance improvements in the newer versions (we're on 7.6.0) that might improve this functionality?
Are there any other suggestions to help correct this behavior?
Hosting environment:
Azure App Service B2 (Basic)
SQL Azure Standard (20 DTUs) - DTU usage peaks at 20%, so I don't think there's anything there. Just noting for completeness
Azure Storage for media storage
Azure CDN for media requests
Thank you so much in advance.
Update 10/4/2017
If it helps, It appears that these particular log entries correspond with the first publish of the day.
I don't feel like 40,000 nodes is too much for Umbraco, but if you want to schedule republishes, you can do this:
You can programmatically call a cache refresh using:
ApplicationContext.Current.Services.ContentService.RePublishAll();
(Umbraco source)
You could create an API controller which you could call periodically by a URL. The controller would probably look something like:
public class CacheController : UmbracoApiController
{
[HttpGet]
public HttpResponseMessage Republish(string pass)
{
if (pass != "passcode")
{
return Request.CreateResponse(HttpStatusCode.Unauthorized, new
{
success = false,
message = "Access denied."
});
}
var result = Services.ContentService.RePublishAll();
if (result)
{
return Request.CreateResponse(HttpStatusCode.OK, new
{
success = true,
message = "Republished"
});
}
return Request.CreateResponse(HttpStatusCode.InternalServerError, new
{
success = false,
message = "An error occurred"
});
}
}
You could then periodically ping this URL:
/umbraco/api/cache/republish?code=passcode
I have a blog post on how you can read on how to schedule events like these to occur. I recommend just using the Windows Task Scheduler to ping the URL: https://harveywilliams.net/blog/better-task-scheduling-in-umbraco#windows-task-scheduler

Resources