I have zabbix version 3.4. i have 2 templates. one for monitoring the OS and the other for monitoring Databases. i have few servers with CentOS 6.9 added to these templates. everything works just fine.
then i added 4 servers to these templates with CentOS 7. items works correctly. they have the expected results. the problem is when a trigger is activated for these 4 servers, they don't resolve and stay active and we see them in dashboard.
for example, in Database template, we have an item that is for service status. if it is 1 then it means service is running, if it is something other than 1 it means service is not running. i stopped the service on one of those CentOS 7 servers. the result that agent got was 0. trigger was activated. then i started the service. in latest data i can see that the value is 1 which means service is running, but the trigger did't resolved and it is still up.
then i did the above steps for one of the CentOS 6.9 servers and everything works just fine.
why this happens and how i can fix it?
Update:
the trigger expression is:
{log-b:db2stat.db2instance_service[].last()}<>1
Long story short: maybe check the DB logs if some inserts/updates are not failing (especially in event_recovery and problem tables)
Short story long:
We observe the similar behaviour on ZBX 4.4 and only with certain triggers checking last 10 minutes of data (e.g. item_key.str('problem',10m)=1 ). The problem get detected but later will not get resolved even after several days event though the trigger conditions are no longer matched.
In our particular case:
I looked into DB and found the event with appropriate eventid (e.g. 123) in events table and noted down objectid (e.g.100123)
Then I checked the events table for the specific objectid (100123) and found that indeed there was a "resolution" event (e.g. 125)
when checked event_recovery table I couldn't find an entry which would match those two eventids (while in case of other triggers, they had an entry in event_recovery table after they got resolved)
I simply created the entry: insert into event_recovery (eventid, r_eventid) VALUES ('123', '125');
it is not enough however as the similar pairing needs to be adjusted in problems table
in problems table I found a problem with my eventid (123) and simply mapped recovery event to that: update problem set r_eventid='125' where eventid='123' and objectid='100123';
The problem with this is that this is not a solution, just a one time workaround. The issue keeps popping up and at this time we suspect the problem is on database side (we have a primary+standby DB with selects directed to standby which can cause certain select operations which do write in the end to fail as standby DB is in read-only mode).
We will try to redirect everything to primary DB to see if it helps.
Related
I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document.
I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-
{"counter": 1}
Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1) which works fine.
Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.
My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?
Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.
Update 1:
I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -
{
"id": "randomId",
"counter": 0
}
Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario.
Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-
Update 2:
The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.
Test configuration-
Cosmos DB container with RUs to automatically scale from 400 to 4000
2 instances of application sharing the load.
Locust script to ingest load on the application
Findings-
Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-
"exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".
I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.
"exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.
This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.
PATCH is not different from other operations, Fundamentally CosmosDB implements Optimistic Concurrency Control unlike the relational databases which have these mechanisms. Optimistic Concurrency Control (OCC) allows you to prevent lost updates and to keep your data correct. OCC can be implemented by using the etag of a document. T Each document within Azure Cosmos DB has an E_TAG property.
In your scenario, yes it will return 2 in one of them and 3 in other one given both get succeeded, because SDK has the retry mechanism and it's explained here. Also have a look at this sample.
If your Azure Cosmos DB account is configured with multiple write
regions, conflicts and conflict resolution policies are applicable at
the document level, with Last Write Wins (LWW) being the default
conflict resolution policy
I've set up a function app in Azure. I've added a proxy to the function (so I can assign it a different URI).
When the proxy and function have been torn down and its time to wake it up, I sometimes get the error code 429: Too many requests from a single Postman/insomnia request to wake it up.
How do I stop this from happening?
For the time being, I've added a logic app to ping it every 5 mins.
Seems to be something with the last release of https://github.com/Azure/azure-functions-host/releases/tag/v3.0.15185
On the date of this release we started receiving 429s, a lot, on the functions we had running for a long time.
We fixed it by adding the following to the hosts.json:
"extensions": {
"http": {
"dynamicThrottlesEnabled": false
}
}
Doc: https://learn.microsoft.com/pt-br/azure/azure-functions/functions-bindings-http-webhook-output
My guess is that they've changed some default values.
EDIT:
We are operating for a long time using BOTH, the hosts.json update from above and the pinned version, stated by sanjo (https://stackoverflow.com/a/65311645/10585914).
You can follow the entire discussion here: https://github.com/Azure/azure-functions-host/issues/6984
And the PR: https://github.com/Azure/azure-functions-host/pull/6986
We are also experiencing 429's in our azure-function and has been advised by MS to force the Azure Functions Extensions to a lower version by setting FUNCTIONS_EXTENSION_VERSION to 3.0.14916.0 instead of ~3
We're still evaluating the "solution".
From Microsoft support, there are 2 workarounds:
Cassio's answer, which actually worked for us for a couple hours but then stopped working. We had been getting very consistent 429s for multiple days, then a brief stoppage after the change, then it came back.
Update your FUNCTIONS_EXTENSION_VERSION app setting to the previous version (3.0.14916.0). This has worked again in the short time since we've changed it.
App Setting Update
I don't think your 5 minute ping is a problem like the answer from Hury Shen. We have recently begun receiving 429 requests anytime our functions wake from a cold period. I don't know what has changed at Azure side but it is not good! One fix you could try is simply redeploy your function, we did this and it worked at least for a time! Will report back if we find anything else
It seems the error was caused by the logic app ping the function every 5 mins. Per my understanding, you schedule the logic app request function to keep the function awake.
If so, you do not need to create the logic specifically to wake it up. You can choose Premium plan for your function app when you create it.
And then go to "Scale out" tab of your function app, you can set Always Ready Instances as 1. Then your function will have one instance always awake, function will not cold start when a request come.
As Premium plan plan provides the same features and scaling mechanism used on the Consumption plan (based on number of events) with no cold start, so it will cost much more than Consumption plan. You can refer to this page about function cost.
A development server I was using ran low on disk space causing the system to crash. When I checked the replica set cluster it came back 1 node was unreachable. I removed the bad nodes and forced config. I went home for the day the next day I came back, and the status was not good saying unreachable for one of the nodes. I worked on something else and later that day it when I checked rs.status it came back primary and secondary. I then added the 3rd node back that ran out of space. Now I can connect to each node and the data looks ok, but I cannot connect to the replica set group in php/nodejs or stuido3t. When i use the group connect it returns auth invalid but I can use that same auth for each node.
Any ideas what could be going on and how to fix it?
What I needed to do was take down the 3 services making up the replica docker swarm. I redeployed it using my scripts with auth turned on. It I checked the replica status and it returns host unreachable, I checked it a few hours later that day it came back online. I was unable to get the replica set back online with rs.add / remove, but I did get it back up and running by recreating the services.
Is there anyone using the nice pinoccio from www.pinocc.io ?
I want to use it to post data into an apache couchdb using node.js. So I'm trying to poll data from the pinnocio API, but I'm a little lost:
schedule the polls
do long polls
do a completely different approach
Any ideas are welcome
Pitt
Sure. I wrote the Pinoccio API, here’s how you do it
https://gist.github.com/soldair/c11d6ae6f4bead140838
This example depends on the pinoccio npm module ~0.1.3 so make sure to npm install again to pick up the newest version.
you don't need to poll because pinoccio will send you changes as they happen if you have an open connection to either "stats" or "sync". if you want to poll you can but its not "real time".
sync gives you the current state + streams changes as they happen. so its perfect if you
only need to save the changes to your troop while your script is running. or show the current and last known state on a web page.
The solution that replicates every data point we store is stats. This is the example provided. Stats lets you read everything that has happened to a scout. Digital pins for example are the "digital" report. You can ask for data from a specific point in time or just from the current time (default). Changes to this "digital" report will continue streaming live as they happen, until the "end" time is reached, or if "tail" equals 0 in the options passed to stats.
hope this helps. i tested the script on my local couch and it worked well. you would need to modify it to copy more stats from each scout. I hope that soon you will be able to request multiple reports from multiple scouts in the same stream. i just have some bugs to sort out ;)
You need to look into 2 dimensions:
node.js talking to CouchDB. This is well understood and there are some questions you can find here.
Getting the data from the pinoccio. The API suggests that as long as the connection is open, you get data. So use a short timeout and a loop. You might want to run your own node.js instance for that.
Interesting fact: the CouchDB team seems to work on replacing their internal JS engine with node.js
I have a intranet SiteCore website set up on IIS 7 which randomly throws the following error message
HTTP Error 503.2 - Service Unavailable
The serverRuntime#appConcurrentRequestLimit setting is being exceeded.
To fix this issue, I have made following changes
Increased the Queue Length of application pool myrjetAppPool from 1000 to 65535.
Modified Machine.Config to increase requestQueueLimit property of ProcessModel element to 100000
Increased appConcurrentRequestLimit to 10000 by running
C:\Windows\System32\inetsrv\appcmd.exe set config /section:serverRuntime /appConcurrentRequestLimit:100000
But I'm still getting the same error. ANy help is greatly appreaciated.
You might check to see where all your threads are going. We had occurrences where threads for Media Library assets were hanging and blocking up the queue.
In IIS Manager, select the server node from the tree, then the "Worker Processes" feature icon, then right-click the application pool of interest and select "View current requests". You might find something is getting stuck. I sometimes hit F5 on this screen a few dozen times in very quick succession to see the rate the requests are going through (of course Performance Monitor is better for viewing metrics but it won't tell you what URLs are being processed).
Investigate references in the linked url to 'MaxConcurrentReqeustsPerCPU' which you may need to set by creating a new registry key, depending on your OS and framework.
https://learn.microsoft.com/en-us/archive/blogs/tmarq/asp-net-thread-usage-on-iis-7-5-iis-7-0-and-iis-6-0
As already commented - check the actual concurrent request count using performance counters to determine which limit you're hitting i.e. it could be a limit of 5000 or maybe 12 (per cpu).
Edit: I realise this may look like I'm talking about a different setting entirely, but I believe there is overlap here.
We got this problem after an installation of an IIS plugin. After long investigating we saw that the config-file C:\Windows\System32\inetsrv\config\applicationHost.config had an extra location tag for the site with the problem. After removing the extra entry and an iisreset, the site/server worked normally againg. So something must went wrong during the installation....