I have been having an issue with an unresponsive node application for a few days now. It started with when I added in a new trigger. The trigger runs on every new insert to a table, and the trigger function copies all rows older than one month over to an archive table. Typically we can have 10 or more records be inserted every ~5 seconds; I don't see that to be much of a load on the database. The node app uses the pg module.
I discovered when I enabled the trigger, the application would stop running. When I disabled it, a huge backlog of requests would flood the system and the app would start working normally again. This issue would only manifest itself after ~10-15 minutes of running time. I thought it wouldn't be an issue since most of the time there'd be nothing for the trigger to do, but obviously I'm missing something here. Does running a trigger on record insert impose that much of an overhead that it causes the database to come to a screeching halt?
Use pg_stat_statements with pg_stat_statements.track = all to find out which query hogs your database.
Related
G'day folks,
I'm having some issues with an Azure function that I'm hoping someone might be able to help with.
We have a relatively long-running process (3-4 mins) that is being triggered from a Service Bus message, and we were having issues with the function execution ending without error and then attempting to re-process. The time take for this to happen is less than all the timeout/lock duration settings we have configured. Watching the logs (log stream, for both file system and app insights) we see the last line of the previous execution, then it kicks straight into the next.
To determine whether it's service bus related, I've also tried executing the process via a blob trigger (the process uses the file as a data source anyway) but I'm seeing the same thing except I don't see the subsequent retries.
In both scenarios I don't see anything in App insights apart from the Trace records. I don't get an exception, or even a 'request' entry. (function logic is all enclosed in try/catch blocks btw)
So my question is - Is it possible to trap these scenarios so we can determine the root cause? Currently I've got nothing to go on to try and diagnose. These errors don't happen when running locally.
FWIW we've seen this issue happen during the execution of a third-party libraries (MS Graph and an OpenXMLPowerTools library) - as we're generating documents for upload into Sharepoint. Not sure if this is relevant.
Thanking you in advance,
Tim
May be this is because of the plan that you are using , If you're using the Consumption plan, the default timeout is 5 minutes, but you can increase it to a maximum of 10 minutes. The maximum timeout on a Premium plan is 60 minutes. You can set your timeout as long as you want if you have a dedicated App Service plan.
Also try configuring the timeout of your function app i.e by changing the value of functionTimeout in host.json of your function app.
You should have a look at durable functions.
They allows us to have long running processes, i.e. import/export tasks.
I was able to wrap a long running import process, which takes about 20 mins to run successfully.
We have a long running ASP.NET WebApp in Azure which has no real endpoints exposed – it serves a single functional purpose primarily reading and manipulating database data, effectively a batched, scheduled task, triggered by a timer every 30 seconds.
The app runs fine most of the time but we are seeing occasional issues where the CPU load for the app goes close to the maximum for the AppServicePlan, instantaneously rather than gradually, and stops executing any more timer triggers and we cannot find anything explicitly in the executing code to account for it (no signs of deadlocks etc. and all code paths have try/catch so there should be no unhandled exceptions). More often than not we see errors getting a connection to a database but it’s not clear if those are cause or symptoms.
Note, this is the only resource within the AppService Plan. The Azure SQL database is in the same region and whilst utilised by other apps is very lightly used by them and they also exhibit none of the issues seen by the problem app.
It feels like this is infrastructure related but we have been unable to find anything to explain what is happening so if anyone has any suggestions for where we should be looking they would be gratefully received. We have enabled basic Application Insights (not SDK) but other than seeing CPU load spike prior to loss of app response there is little information of interest given our limited knowledge of how to best utilise Insights.
According to your description, I thought of two points to troubleshoot your problem. First of all, you can track the running status of your program through the code, and put a log at the beginning and end of your batch scheduled tasks to record the status of each run. If possible, record request and response information and start and end information. This can completely record the time and running status of your task.
Secondly, you can record logs before the program starts database operations, and whether the database connection is successful. The best case is to be able to record, what business will trigger CPU load when operating, and track the specific operating conditions, in order to specifically analyze what causes the database connection failure.
Because you cannot reproduce your problem, you can only guess the cause of the problem. If you still can't find where the problem is through the above two points, then modify your timer appropriately, and let the program trigger once every 5 minutes instead of 30s.
In Kentico (9) when I run the task "Delete inactive contacts" it never actually runs and the result is always "Rescheduled to delete more contacts in next off-peak period"
I've tried changing the settings to run once a week and I've tried creating a custom IDeleteContacts then setting it to use that custom class, but I always get the same result.
Any ideas?
By default, Kentico runs it's scheduled tasks in the tail of regular web requests. That's fine if you have traffic 24/7. If you don't, then you can run into all kinds of nastyness including the issue you're describing now because scheduled tasks are not executing.
If you're running on a Windows server you can setup a service to trigger scheduled tasks. If that's not an option, you can setup monitoring to hit your site every couple of minutes, for example UptimeRobot or Application Insights. You'll get the added bonus of being notified whenever the site goes down.
If you really need to clean up the EMS contacts because it's getting out of control, you can access the database directly and trigger the same stored procedure that the scheduled task uses. It's called [Proc_OM_Contact_MassDelete] and takes a where clause and a batch size. The where clause is where you specify the delete policy. For example
ContactCreated < GETDATE()-60 AND ([ContactEmail] IS NULL PR [ContactEmail]='')
With this where clause the stored proc would process contacts that were created over 60 days ago and don't have an e-mail address yet.
Please be aware that large volumes of EMS data will require database index tuning for this procedure to run within an acceptable period of time. This is true for EMS in general when your site has a decent amount of traffic.
If the standard Kentico cleanup doesn't work, for example because the database is unable to deal with millions of contacts, we've written a script to purge all EMS data. Use with caution ;)
do you have applied the latest hotfix (9.0.50) on your project? There was a bug when the deletion of inactive contacts took longer than 1 minute, the next run of the "Delete inactive contacts" scheduled task was not set, and the task did not execute again. You can download the package directly from this page: https://devnet.kentico.com/download/hotfixes
The "Delete inactive contacts" scheduled task only runs between 2am and 6am based on the servers time the site is running on. You can see this in the documentation. It only ever deletes a batch of 1000 contacts and never more. If you want to "trick" the site into running the scheduled task more, update the time on the server to 1:58am and restart the site.
UPDATE: I've figured it out. See the end of this question.
I have an Azure App Service running four sites. One of the sites has two deployment slots in addition to the primary one. Recently I've been seeing really high CPU utilization for the App Service plan as a whole.
The dark orange line shows the CPU percentage. This is just after restarting all my sites, which brought it down to this level.
However, when I look at the CPU use reported by each site, it's really low.
The darker blue line shows the CPU time, which is basically nothing. I did this for all of my sites, and all the graphs look the same. Basically, it seems that none of my sites are causing the issue.
A couple of the sites have web jobs, so I took a look at the logs but everything is running fine there. The jobs run for a few seconds every few hours.
So my question is: how can I determine the source of this CPU utilization? Any pointers would be greatly appreciated.
UPDATE: Thanks to the replies below, I was able to get more detail into what was happening. I ended up getting what I needed from SCM / Kudu tools. You can get here by going to your web app in Azure and choosing Advanced Tools from the side nav. From the Kudu dashboard, choose Process Explorer. The value in the Total CPU Time column is not directly useful, because it's the time in seconds that the process has run since it started, which might have been minutes or days ago.
However, if you make a record of the value at intervals, you can look at the change over time, and one process might jump out at you. In my case, it was my WebJobs process. Every 60 seconds, this one process was consuming about 10 seconds of processor time, just within one environment.
The great thing about this Kudu dashboard is, if you can catch the problem while it is actually happening, you can hit the Start Profiling button and capture a diagnostic session. You can then open this up in Visual Studio and get some nice details about where the CPU time is being spent.
Just in case anyone else is seeing similar issues, I'll provide more details about my particular case. As I mentioned, my WebJobs exe was the culprit, and I found that all the CPU time was being spent in StackExchange.Redis.SocketManager, which manages connections to Azure Redis Cache. In my main web app, I create only one connection, as recommended. But Since my web jobs only run every once in a while, I was creating a new connection to Azure Redis Cache each time one ran, which apparently can lead to issues. I changed my code to create the Redis Cache connection once when the WebJob process starts up and use the existing connection when any individual WebJob runs.
Time will tell if this really fixes the issue, but I think it will. When the problem occurred, it always fit the same pattern: After a few days of running fine, my CPU would slowly ramp up over the course of about 12 hours. My thinking is that each time a WebJob ran, it created a connection object, which at first didn't produce trouble, but gradually as WebJobs ran every hour or two, cruft was building up until finally some critical threshold was met and the CPU usage would take off.
Hope this helps someone out there. Best wishes!
May be you should go to webApp scm?
%yourAppName%.scm.azurewebsites.com;
There is a page, that can show you all process, that runned now on your web app. (something like Console > Process).
Also you can go to support page (from scm right corner).
You can find some more info about your performance there, and make memory dump (not for this problem, but it useful for performance issues).
According to your description, I assumed that you could leverage the Crash Diagnoser extension to capture dump files from your Web Apps and WebJobs when the CPUs usage percentage is higher than the specific threshold to isolate this issue. For more details, you could refer to this official blog.
I have recently just started working with firebird DB v2.1 on a Linux Redhawk 5.4.11 system. I am trying to create a monitor script that gets kicked off via a cron job. However I am running into a few issues and I was hoping for some advice...
First off I have read through most of the documentation that come with the firebird DB and a lot of the documentation that is provided on their site. I have tried using the gstat tool which is supplied but that didn't seem to give me the kind of information I was looking for. I ran across README.monitoring_tables file which seemed to be exactly what I wanted to monitor. Yet this is where I started to hit a snag in my progress....
After running from logging into the db via isql, I run SELECT MON$PAGE_READS, MON$PAGE_WRITES FROM MON$IO_STATS; I was able to get some numbers which seemed okay. However upon running the command again it appeared the data was stale because the numbers were not updating. I waited 1 minute, 5 minutes, 15 minutes and all the data was the same during each. Once I logged off and back on to run the command again the data changed. It appears that only on a relog does the data refresh and yet I am not sure if even then the data is correct.
My question is now am I even doing this correct? Are these commands truly monitoring my db or are just monitoring the command itself? Also why does it take a relog to refresh the statistics? One thing I was worried about was inconsistency in my data. In other words my system was running yet when I would logon each time the read/writes were not linearly increasing. They would vary from 10k to 500 to 2k. Any advice or help would be appreciated!
When you query a monitoring table, a snapshot of the monitoring information is created so the contents of the monitoring tables are stable for the rest of the transaction. You need to commit and start a new transaction if you want fresh information. Firebird always uses a transaction (and isql implicitly starts a transaction if none was started explicitly).
This is also documented in doc/README.monitoring_tables (at least in the Firebird 2.5 version):
A snapshot is created the first time any of the monitoring tables is being selected from in the given transaction and it's preserved until the transaction ends, so multiple queries (e.g. master-detail ones) will always return the consistent view of the data. In other words, the monitoring tables always behave like a snapshot (aka consistency) transaction, even if the host transaction has been started with another isolation level. To refresh the snapshot, the current transaction should be finished and the monitoring tables should be queried in the new transaction context.
(emphasis mine)
Note that depending on your monitoring needs, you should also look at the trace functionality that was introduced in Firebird 2.5.