I'm using JHipster version 4.6.2 on my gateway. I have a JHipster registry and two instances of the same microservice. In the JHipster registry I can see that the gateway and both microservice instances are registered properly. I can configure, see health and so on. In short all is working fine.
The microservice was created with a newer JHipster version (4.11.1). Both the gateway and microservices seem to co-operate well. For example the default (generated) user interface on gateway is capable to fetch data (entities) from microservices. On gateway I just use the logic that Jhipster generated for me to fetch data from microservice. I can see in the logs that calls are routed to both microservice instances.
The issue that I'm facing is that when I shutdown one microservice instance the gateway still sometimes tries to route the the service call to the microservice instance that is already shutdown. Of course after some time all service calls are routed properly just to the correct/running microservices instance. But sometimes right after shutting down one microservices instance the call might be routed to the "wrong" instance.
I expected that components like ribbon, zuul and eureka would automatically try other microservice instance if the service call to the first microservice instance fails. Is my expectation correct? Should Jhipster "microservices platform" automatically retry the service call against other registered microservice instance?
If by default retrying is not supported, what should I do to make it happen?
Thanks for your response Gael. I tried configuration from the link you provided but that did not fully solve my case.
When managed to get rid of the original exception ("com.netflix.client.ClientException: null") I faced the next issue ("Caused by: com.netflix.client.ClientException: Number of retries on next server exceeded max 2 retries, while making a call for: 192.168.1.4:8082").
I needed to adjust MaxAutoRetriesNextServer (see https://github.com/spring-cloud/spring-cloud-netflix/issues/2052).
That brought me one step further but still got hystrix exceptions ("Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: myservice timed-out and no fallback available.").
Finally with help of these two links (https://github.com/jhipster/generator-jhipster/issues/3323 and https://github.com/spring-cloud/spring-cloud-netflix/issues/321) I managed to make configuration which has been providing 100% availability in my tests (so far).
This is the configuration that has worked for me. I don't fully know all details behind all these settings so if you find any inconsistencies in those or you have suggestions for improvements please raise them up. Thanks!
zuul:
routes:
myservice:
retryable: true
host:
connect-timeout-millis: 5000
socket-timeout-millis: 20000
ribbon:
MaxAutoRetries: 1
MaxAutoRetriesNextServer: 5
OkToRetryOnAllOperations: true
ReadTimeout: 2500
restclient:
enabled: true
hystrix:
command:
default:
execution:
isolation:
thread:
timeoutInMilliseconds: 20000
Related
I'm currently researching and experimenting with Kubernetes in Azure. I'm playing with AKS and the Application Gateway ingress. As I understand it, when a pod is added to a service, the endpoints are updated and the ingress controller continuously polls this information. As new endpoints are added AG is updated. As they're removed AG is also updated.
As pods are added there will be a small delay whilst that pod is added to the AG before it receives requests. However, when pods are removed, does that delay in update result in requests being forwarded to a pod that no longer exists?
If not, how does AG/K8S guarantee this? What behaviour could the end client potentially experience in this scenario?
Azure Application gateway ingress is an ingress controller for your kubernetes deployment which allows you to use native Azure Application gateway to expose your application to the internet. Its purpose is to route the traffic to pods directly. At the same moment all questions about pods availability, scheduling and generally speaking management is on kubernetes itself.
When a pod receives a command to be terminated it doesn't happen instantly. Right after kube-proxies will update iptables to stop directing traffic to the pod. Also there may be ingress controllers or load balancers forwarding connections directly to the pod (which is the case with an application gateway). It's impossible to solve this issue completely, while adding 5-10 seconds delay can significantly improve users experience.
If you need to terminate or scale down your application, you should consider following steps:
Wait for a few seconds and then stop accepting connections
Close all keep-alive connections not in the middle of request
Wait for all active requests to finish
Shut down the application completely
Here are exact kubernetes mechanics which will help you to resolve your questions:
preStop hook - this hook is called immediately before a container is terminated. This is very helpful for graceful shutdowns of an application. For example simple sh command with "sleep 5" command in a preStop hook can prevent users to see "Connection refused errors". After the pod receives an API request to be terminated, it takes some time to update iptables and let an application gateway know that this pod is out of service. Since preStop hook is executed prior SIGTERM signal, it will help to resolve this issue.
(example can be found in attach lifecycle event)
readiness probe - this type of probe always runs on the container and defines whether pod is ready to accept and serve requests or not. When container's readiness probe returns success, it means the container can handle requests and it will be added to the endpoints. If a readiness probe fails, a pod is not capable to handle requests and it will be removed from endpoints object. It works very well with newly created pods when an application takes some time to load as well as for already running pods if an application takes some time for processing.
Before removing from the endpoints readiness probe should fail several times. It's possible to lower this amount to only one fail using failureTreshold field, however it still needs to detect one failed check.
(additional information on how to set it up can be found in configure liveness readiness startup probes)
startup probe - for some applications which require additional time on their first initialisation it can be tricky to set up a readiness probe parameters correctly and not compromise a fast response from the application.
Using failureThreshold * periodSecondsfields will provide this flexibility.
terminationGracePeriod - is also may be considered if an application requires more than default 30 seconds delay to gracefully shut down (e.g. this is important for stateful applications)
Heroku does not support health checks on its own. It will restart services that crashed, but there is nothing like health checks.
It sometimes happen that service become unresponsive, but the process is still running. In most of modern cloud solution, you can provide health endpoint which is periodically called by the cloud hosting service and if that endpoints return either error or not at all, it will shut down such service and start new one.
That seems like industrial standard these days, but I am unable to find any solution to this for Heroku. I can even use external service with Heroku CLI, but just calling some endpoint is not sufficient - if there are multiple instances, they all share same URL and load balancer calls one of them randomly -> therefore it is possible to not hit failed instance at all. Even when I hit it, usually the health checks have something like "after 3 failed health checks in a row restart that instance", which is highly unprobable if there are 10 instances and one of it become unhealthy.
Do you have any solution to this?
You are right that this is industry standard and shame that it's not provided out of box.
I can think of 2 solutions (both involve running some extra code that does all of this:
a) use heroku API which allows you to get the IP of individual dynos, and then you can call each dyno how you want
b) in each dyno instance you can send a request to webserver like https://iamaalive.com/?dyno=${process.env.HEROKU_DYNO_ID}
I have an app service (plan B2) running on Azure.
My integration tests running from docker container are calling some app service endpoints one by one and sometimes receive 500 or 502 error.
When I debug tests I make some pauses between calls and all requests work successfully. Also, when I scale up my app service, everything works properly.(I don't want to scale up because cpu and other params are low.)
In my tests I have only one HttpClient and I dispose it at the end so I don't think there should be any connections leaks.
Also, in TCP Connections I have around 60 total connections while in Azure docs the limit is 1,920.
This app is not accessed by any users but here it says that I had the maximum connections. Is there any way how can I track these connections? Why when I receive these 5xx errors I don't see anything in app insights? Also how 15 connections can exceed the limit when the limit is 1920? Are these connections related to my errors and how they can be fixed?
You don't see them in Application Insights because they're happening at IIS level which is breaking the request, and because of that, data is not being sent to Application Insights.
The place to look for information is "Diagnose and solve problems", then "Availability and Performance". More info in here:
https://learn.microsoft.com/en-us/azure/app-service/overview-diagnostics
PS: I do think the problem is related to the Dispose of your HTTPClient. It's a well known issue and the reason why they've introduced HttpClientFactory. More info in here:
https://www.stevejgordon.co.uk/httpclient-creation-and-disposal-internals-should-i-dispose-of-httpclient
https://stackoverflow.com/a/15708633/1384539
Unhealthy event: SourceId='System.FabricDnsService', Property='Environment', HealthState='Warning', ConsiderWarningAsError=false.
FabricDnsService is not preferred DNS server on the node.
Wondering if anyone has a place on where to start on getting this warning in azure fabric?
Looks this was an issue on sf v6.0 and now it has beed fixed in v6.1
https://github.com/Azure/service-fabric-issues/issues/496
For now to workaround this you should turn OFF all your network connections except one major, reset a local cluster, redeploy the app.
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-manifest
I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.
I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites
It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.