[Problem Statement]
We have a Tier 0 service which has haproxy LB and multiple back end server configured behind it. Currently, infrastructure is serving P99 with ~100 ms. Now, as per the 100% availability and 0 downtime. Sometimes we see, some of the back end servers misbehaves or goes out of LB and that moment all of landed requests on those back end servers gets timeout.
So we looking to have configuration like that if any request on server takes more than 100ms then this same request can route to another back end server and we can achieve the ~100℅ no time outs.
[Disclaimer]
I understand after a certain retires if still request timeout, then it will serve the timeouts to end consumer of our Tier - 0 service.
[Tech Stack]
HAProxy
Java
Java
MySQL
Azure
Would appreciate to discuss on this problem as I searched a lot but didn't get any reference, the way I am thinking but yes this could be possible by other ways so that we can achieve the no downtime and under the defined SLA of service.
Thanks
The option redispatch directive sends a request to a different server.
The retry-on directive states what type of errors to retry on.
The retries directive states how many times to retry.
option redispatch 1
retry-on all-retryable-errors
retries 3
Plus, you'll want to test how to setup the timeouts for the following
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
Make sure all requests are idempotent and have no side effects. Otherwise, you will end up causing a lot of problems for yourself.
Related
I am using
App Engine Flexible, custom runtime.
nodejs, as base Image.
express
Cloud Tasks for queuing the requests
puppeteer job
My Requirements
20GB RAM
long-running process
because of my unique requirement, I want 1 request to be handled by only 1 instance. when it gets free or the request gets timed-out, only then it should get a new request.
I have managed to reject other requests while the instance is processing 1 request, but not able to figure out the appropriate automatic scaling settings.
Please suggest the best way to achieve this.
Thanks in advance!
In your app.yaml try restricting the max_instances and max_concurrent_requests.
I also recommend looking into rate limiting your Cloud Tasks queue in order to reduce unnecessary attempts to send requests. Also you may want to increase your MIN_INTERVAL for retry attempts to spread out requests as well.
Your task queue will continue to process and send tasks by the rate you have set, so if your instance rejects the request it will go into a retry pattern. It seems like you're focused on the scaling of App Engine but your issue is with Cloud Tasks. You may want to schedule your tasks so they fire at the interval you want.
You could set readiness checks on your app.
When an instance is handling a request, set the readiness check to return a non-ready status. 429 (too many requests) seems like a good option.
This should avoid traffic to that specific instance.
Once the request is finished, return a 200 from the readiness endpoint to signal that the instance is ready to accept a new request.
However, I'm not sure how will this work with auto-scaling options. Since the app will only scale up once the average CPU is over the threshold defined, if all instances are occupied but do not reach that threshold, the load balancer won't know where to route requests (no instances are ready), and it won't scale up.
You could play around a little bit with this idea and manual scaling, or by programatically changing min_instances (in automatic scaling) through the GAE admin API.
Be sure to always return a 200 for the liveness check, or the instance will be killed as it will be considered unhealthy.
We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?
We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.
So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?
How frequently the Traffic Manager monitors endpoints? It's very obvious that it's not event driven (when an endpoint is down it takes up-to 30 secs - 2.5 mins to identify the status of the endpoint as per my observations). Can we configure this frequency, I cannot see any configuration for this.
Is there a relationship between Traffic Manager Monitoring interval and TTL?
This may look like a general question, but my real issue is that I experience a service downtime in a fail over scenario (fail over of the primary). I understand the effect in TTL where until the client DNS cache expires they are calling the cached endpoint. I spent a lot of time on this and now I have narrowed down it to a specific question.
Issue is that there is a delay in Traffic Manager identifying the endpoint status after it's stopped or started. I need a logical explanation for this, could not find any Azure reference which explains this.
Traffic manager settings
I need to understand this delay and plan for that down time.
I have gone through the same issue. Check this link, it explains the Monitoring behaviour
Traffic Manager Monitoring
The monitoring system performs a GET, but does not receive a response in 10 seconds or less. It then performs three more tries at 30 second intervals. This means that at most, it takes approximately 1.5 minutes for the monitoring system to detect when a service becomes unavailable. If one of the tries is successful, then the number of tries is reset. Although not shown in the diagram, if the 200 OK message(s) come back more than 10 seconds after the GET, the monitoring system will still count this as a failed check.
This explains the 30-2 mins delay.
basically the maximum delay would be 1.5 mins + TTL as per the details.
When we migrated our apps to azure from rackspace, we saw almost 50% of http requests getting read timeouts.
We tried placing the client both inside and outside azure with the same results. The client in this case is also a server btw, so no geographic/browser issues either.
We even tried increasing the size of the box to ensure azure wasn't throttling. But even using D boxes for a single request, the result was the same.
Once we moved out apps out of azure they started functioning properly again.
Each query was done directly on an instance using a public ip, so no load balancer issues either.
Almost 50% of queries ran into this issue. The timeout was set to 15 minutes.
Region was US East 2
Having 50% of HTTP requests timing out is not normal behavior. This is why you need to analyze what is causing those timeouts by validating the requests are hitting your VM. For this, I would recommend you running a packet capture on your server and analyze response times, as well as look for high number of retransmissions; it is even better if you can take a simultaneous network trace on your clients machines so you can do TCP sequence number analysis and compare packets sent vs received.
If you are seeing high latencies in the packet capture or high number of retransmissions, it requires detailed analysis. I strongly suggest you to open a support incident so Microsoft support can help you investigate your issue further.
Here is my application cloud environment.
I have ELB with sticky session -> 2 HA Proxy -> 1 Machines which hosts my application on jboss.
I am processing a request which takes more than 1 minute. I am logging IP addresses at the start of the processing request.
When i process this request through browser, I see that duplicate request is being logged after 1 minute and few seconds. If first request routes from the HAProxy1 then another request routes from HAProxy2. On browser I get HttpStatus=0 response after 2.1 minute
My hypotesis is that ELB is triggering this duplicate request.
Kindly help me to verify this hypothesis.
When I use the Apache Http Client for same request, I do not see duplicate request being triggered. Also I get exception after 1 minute and few seconds.
org.apache.http.NoHttpResponseException: The target server failed to respond
Kindly help me to understand what is happening over here.
-Thanks
By ELB I presume you are referring to Amazon AWS's Elastic Load Balancer.
Elastic Load Balancer has a built-in request time-out of 60 seconds, which cannot be changed. The browser has smart re-try logic, hence you're seeing two requests, but your server should be processing them as two separate unrelated requests, so this actually makes matters worse. Using httpclient, the timeout causes the NoHttpResponseException, and no retry is used.
The solution is to either improve the performance of your request on the server, or have the initial request fire off a background task, and then a supplemental request (possibly using AJAX) which polls for completion.