Heroku - restart on failed health check

Heroku - restart on failed health check - node.js

Heroku does not support health checks on its own. It will restart services that crashed, but there is nothing like health checks.
It sometimes happen that service become unresponsive, but the process is still running. In most of modern cloud solution, you can provide health endpoint which is periodically called by the cloud hosting service and if that endpoints return either error or not at all, it will shut down such service and start new one.
That seems like industrial standard these days, but I am unable to find any solution to this for Heroku. I can even use external service with Heroku CLI, but just calling some endpoint is not sufficient - if there are multiple instances, they all share same URL and load balancer calls one of them randomly -> therefore it is possible to not hit failed instance at all. Even when I hit it, usually the health checks have something like "after 3 failed health checks in a row restart that instance", which is highly unprobable if there are 10 instances and one of it become unhealthy.
Do you have any solution to this?

You are right that this is industry standard and shame that it's not provided out of box.
I can think of 2 solutions (both involve running some extra code that does all of this:
a) use heroku API which allows you to get the IP of individual dynos, and then you can call each dyno how you want
b) in each dyno instance you can send a request to webserver like https://iamaalive.com/?dyno=${process.env.HEROKU_DYNO_ID}

Related

Why is my Azure node.js app becoming unresponsive?

I recently deployed a Node.js Backend Service to Azure and have the following problem. The service becomes unresponsive after a certain amount of time, and only comes back to life if a external request is sent. The problem is, that it takes about 3 minutes for the Container to start back up and actually return the request. I'm running Node 14 LTS. I also added a health check yesterday, but azure simply doesn't bother actually keeping the app alive, here is the metric off azure
I verified azure is actually trying to reach the correct endpoint, and it does. I also have "Always On" enabled. I also verified that the app itself, is not crashing. I log every request and all of a sudden requests are no longer received, which means the health endpoint doesn't respond either, but it does not result in a container restart. It just waits for an external request to appear and then decides to start everything back up, which takes too long.
I feel like it's some kind of configuration issue, because the app itself is not very complex and I never experienced crashes when doing local development.

The official document tells us that the Free pricing tier you are currently using, Always on does not take effect.
How do I decrease the response time for the first request after idle time?

How to find/cure source of function app throughput issues

I have an Azure function app triggered by an HttpRequest. The function app reads the request, tosses one copy of it into a storage table for safekeeping and sends another copy to a queue for further processing by another element of the system. I have a client running an ApacheBench test that reports approximately 148 requests per second processed. That rate of processing will not be enough for our expected load.
My understanding of function apps is that it should spawn as many instances as is needed to handle the load sent to it. But this function app might not be scaling out quickly enough as it’s only handling that 148 requests per second. I need it to handle at least 200 requests per second.
I’m not 100% sure the problem is on my end, though. In analyzing the performance of my function app I found a LOT of 429 errors. What I found online, particularly https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits, suggests that these errors could be due to too many requests being sent from a single IP. Would several ApacheBench 10K and 20K request load tests within a given day cause the 429 error?
However, if that’s not it, if the problem is with my function app, how can I force my function app to spawn more instances more quickly? I assume this is the way to get more throughput per second. But I’m still very new at working with function apps so if there is a different way, I would more than welcome your input.
Maybe the Premium app service plan that’s in public preview would handle more throughput? I’ve thought about switching over to that and running a quick test but am unsure if I’d be able to switch back?
Maybe EventHub is something I need to investigate? Is that something that might increase my apparent throughput by catching more requests and holding on to them until the function app could accept and process them?
Thanks in advance for any assistance you can give.

You dont provide much context of you app but this is few steps how you can improve
If you want more control you need to use App Service plan with always on to avoid cold start, also you will need to configure auto scaling since you are responsible in this plan and auto scale is not enabled by default in app service plan.
Your azure function must be fully async as you have external dependencies so you dont want to block thread while you are calling them.
Look on the limits. Using host.json you can tweek it.
429 error means that function is busy to process your request, so probably when you writing to table you are not using async and blocking thread

Function apps work very well and scale as it says. It could be because request coming from Single IP and Azure could be considering it DDOS. You can do the following
AzureDevOps Load Test
You can load test using one of the azure service . I am very sure they have better criteria of handling IPs. Azure DeveOps Load Test
Provision VM in Azure
The way i normally do is provision the VM (windows 10 pro) in azure and use JMeter to Load test. I have use this method to test and it works fine. You can provision couple of them and subdivide the load.
Use professional Load testing services
If possible you may use services like Loader.io . They use sophisticated algos to run the load test and provision bunch of VMs to run the same test.
Use Application Insights
If not already you must be using application insights to have a better look from server perspective. Go to live stream and see how many instance it would provision to handle the load test . You can easily look into events and error logs that may be arising and investigate. You can deep dive into each associated dependency and investigate the problem.

Load test on Azure

I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.

I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites

It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.

How to fail over node.js timer on amazon load balancer?

I have setup 2 instance under aws load balancer. I have deployed node.js web services + mongodb in both instance. load balancer works fine with web services.
But, Problem is I have one timer service (node.js service only). the behavior of this timer is updating my mongodb based on some calculation.
My problem is, I must need to run this timer service (timer.js) at only one aws instance (out of 2) at same time. and expected that if one aws instance goes down then timer service at other instance will come up.
i know elb not providing this kind of facility.Can any one please help me to make it done ?
Condition : At a time only one timer service must be run with amazon load balancer.
Thanks.

You would have to implement this yourself using a locking algorithm using a shared data store that supports atomic operations
Alternatively, consider starting a "timer" server in an Auto Scale Group of Min:1, Max: 1 so Amazon keeps it running. This instance can be a t2.micro which is very cheap. It can either run the job itself, or just make an http request to your load balancer to run the job at the desired internal. If you so that, only one of your servers will run each job

Wouldn't it make more sense to handle this like any other "service" that needs to keep running?
upstart service
running node.js server using upstart causes 'terminated with status 127' on 'ubuntu 10.04'
This guy had a bad path in his file but his upstart script looks okay
monit
Node.js (sudo) and monit

what is an appropriate value for maxLag in node toobusy on Heroku?

We're evaluating using the toobusy module https://github.com/lloyd/node-toobusy on an app hosted on Heroku. I am not sure what an appropriate value for maxLag would be for Heroku environment. It seems like it would need a fair amount of playing around and tweaking to tune it? Anyone use this module in production and with what kind of setup (ie. dynos) and with what params?
Thanks!

I recently started running a small auto-complete RESTful service on heroku.
It serves lots of requests for each user (request is sent on each character the user types).
This service is running on two x1 dynos.
I need it to keep responding fast under load, so I'm also using toobusy. At first I've set it to max lag of 10ms but that was an ambitious goal - it denied many requests when I ran a load test.
After some tweaking - I've ended up with max lag of 40ms. It gave me a good balance between the load (amount of requests service needs to handle) and the desired response time (before denying requests).
I'm monitoring my app for denied requests due to load, so I'm able to add more dynos when needed.
I believe you'll have to play with this value and run load tests to get to the right number as it's very specific to the hardware (in your case Heroku) and what your app is doing for each request.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Heroku - restart on failed health check - node.js

Related

Why is my Azure node.js app becoming unresponsive?

How to find/cure source of function app throughput issues

Load test on Azure

How to fail over node.js timer on amazon load balancer?

what is an appropriate value for maxLag in node toobusy on Heroku?

Categories

Resources