Azure batch pool crashing after running steadily for some time

Azure batch pool crashing after running steadily for some time - azure

I am encountering the following behavior with Azure Batch. I am using Shipyard to start a pool of 500 low-priority nodes to perform a list of 400.000 tasks. The pool size is managed using auto-scaling.
At first, the pool seems to be running just fine. The number of nodes increases to maximum capacity and the tasks complete as expected. However, after some time (having completed a sizable amount of tasks), I start to encounter 'start task failed' errors. The pool then quickly starts degrading until all nodes crash due to this same error.
This is the error I get in the stdout.txt file of one of the crashed nodes:
Login Succeeded
2020-03-04T09:09:07UTC - INFO - Docker registry logins completed.
2020-03-04T09:09:07UTC - WARNING - No Singularity registry servers found.
2020-03-04T09:13:37,840996225+00:00 - ERROR - Cascade Docker exited with non-zero exit code: 1
This seems to be an issue related to pulling the Docker image? Although it worked without issue on other nodes before.
I am aware that this is not a lot of information to go on, but I am having trouble figuring out what information is relevant and what's not.
UPDATE
After updating to shipyard 3.9.1, this is the output in stdout.txt for one of the crashed nodes (start task failed):
2020-03-05T08:23:43,784166638+00:00 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0)
2020-03-05T08:23:58,876629647+00:00 - ERROR - Error response from daemon: Get https://mcr.microsoft.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-03-05T08:23:58,878254953+00:00 - ERROR - No fallback registry specified, terminating

Please see the GitHub issue https://github.com/Azure/batch-shipyard/issues/340. You will likely need to upgrade your Batch Shipyard version and recreate your pool.

Related

Google App Engine NodeJS app stops after 30 min

I have a very basic NodeJS application hosted on Google App Engine that executes an async function on 15 second intervals. The deployment is successful and the app starts and runs fine, but stops after about 30 minutes with the following error logs. This runs fine locally, though.
Quitting on terminated signal
Start program failed: user application failed with exit code -1 (refer to stdout/stderr logs for more detail): signal: terminated
I have used App Engine before with no issues, so I'm not sure why this is happening. I used https://github.com/GoogleCloudPlatform/nodejs-docs-samples/tree/main/appengine/typescript as a reference and am still not able to resolve this issue. Any ideas?

Quitting on terminated signal
You may receive this error if your App Engine instances is down scaling or shutting down due to some reasons and possibly due to:
Your application runs out of Instance Hours quota.
Your instance is moved to a different machine, either because the current machine that is running the instance is restarted, or App Engine moved your instance to improve load distribution.
There are good strategies to avoid the downtime of your instance and here are additional:
You can try to have a minimum number of idle instances
Use manual scaling which you can specify the number of instances
will continuously run regardless of the load level.
Increase the maximum instance.
Asynchronous background work is not recommended in App Engine. It can result in higher billing and users may also experience increased latency because of high pushback or request queuing. Google recommend to use Cloud Tasks. With Cloud Tasks, HTTP requests are long-lived and return a response only after any asynchronous work ends.

Vertx clustered eventbus not removing old node on kubernetes rolling deployment

I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.
Below is the member group information before deployment
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80 //pod 1
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447 //pod 2
When deployment is started, pod 2 gets removed from the member list,
[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447
And new member is added,
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)
But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),
[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing
I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.
never start more than one new pod at once
forbid more than one unavailable pod during the process
I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,
Vertx.clusteredVertx(options, vertx -> {
vertx.result().deployVerticle(verticleName, deploymentOptions);
Sorry for the long post, any help is highly appreciated.

One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.
One TL;DR solution is to add a delay when terminating a pod by either:
Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.
Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.
A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.

My "spring boot application' cant be accessed but cpu and heap usage is normal

Background:
I have a kubernates cluster in my cloud server,And i deployed spring boot application as a pod .Everything is ok before 2 week , but my application unexpectedly could not be accessed on Ferb 22.
I used command "kubectl exec -it <pod> sh ",curl 127.0.0.1:<port> in pods but no response .I saw my application logs but cant found error about this issue.I try to restart my applications ,but the same issue occur after two days.
I have no idea with this issues.Can any one help me ?
When everything ok ,i can call 127.0.0.1:18890 and have response immediately ,once issues happend it will be requested timeout .Only this kubernates service have this question ,others seems normal

AppEngine nodejs app sporadically sends 502s and restarts

We have a nodejs app that gets successfully deployed to a standard environment. Something happens after about two hours (or sooner depending on traffic): our downstream clients start receiving a bunch of 502 responses and then the service stabilizes. We think this has been happening for at least a few months.
When investigating the cause of the 502s, I see that:
There are no unhandled exception/promise rejection logs to indicate that the node app has crashed
I console.log when receiving SIGTERM and that, too, does not appear in the logs
The logs of the nginx sidecar include the following:
2020/06/16 23:11:11 [error] 35#35: *1149 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /api/redacted HTTP/1.1", upstream: "http://127.0.0.1:8081/api/redacted", host: "redacted.appspot.com""
I'm assuming that the 502s are coming from nginx because the upstream has disappeared. Are there other explanations I should explore?
If GAE is replacing my app containers intentionally, shouldn't that process prevent these types of 502s?
Should I expect something other than SIGTERM to be sent by the environment when the application/container is getting replaced?
Update #1 (2020-06-22)
I investigated and found evidence that we might be exceeding memory quota so I changed our instance_class from F1 to F2. As I write this our instances are sitting at ~200M of memory usage (F2s have 512M available). Additionally, I use the --max-old-space-size switch to set nodes memory usage to 496M.
The 502s are still happening.
I suspect that the 502s are happening as a result of the autoscaler terminating instances. Our app never receives SIGTERM (even during deployments). That means I can't close http keepalive connections gracefully and might explain why nginx raises Connection reset by peer.
Update #2 (2020-06-24)
Our service is just standard REST type stuff, no heavy loops.
I'll post another update with some memory graphs but I don't see any spikes. Perhaps a small memory leak.
Here's our app.yaml:
service: redacted
runtime: nodejs12
instance_class: F2
handlers:
- url: /.*
secure: always
redirect_http_response_code: 301
script: auto

We had a very similar problem with our Node.js app deployed on App Engine Flexible.
In our case, we ultimately determined that we had memory pressure that was causing the Node.js garbage collector to sometimes delay the processing of a request for hundreds of milliseconds (sometimes more). This caused our health check URLs to sporadically timeout, prompting GAE to remove the instance from the active pool.
Because we typically had just two instances handling the steady traffic, removing one instance quickly overloaded the remaining instance, and it would soon suffer the same fate.
We were surprised to find that it could take two minutes or longer before App Engine assigned traffic to a newly-created instance. Between the time our original instances were declared unhealthy, and when new instance(s) were online, 502s would be returned (presumably by GAE's nginx) to the client.
We were able to stabilize the environment simply by adding:
automatic_scaling:
min_num_instances: 4
To our app.yaml. Because two instances were generally sufficient for the traffic, ensuring we always had four running apparently kept our memory usage low enough to prevent the GC from stalling request handling, and even if it did, we had enough excess capacity to handle one instance being removed.
The scaling settings for GAE standard are slightly different.
In retrospect, we could see that our latency/response times would get a little "jittery" before the real problems started. Most responses had typical response times ~30ms, but increasingly we would see outlier requests in the x00ms range. You may want to check your request logs to see if you see something similar.
New Relic's Node.js VM data was helpful in detecting that garbage collection was taking an increasing amount of time.

Usually, 502 messages are errors on nginx side, as you have mentioned. The detailed logs related to this errors are not surfaced to Cloud Logging, yet.
According to your behavior, it seems a workload, so we can relate this case to an issue with running out of resources.
There are somethings that are well worth to take a look:
Check your metrics. The memory and CPU usage should be under healthy limits.
Check whether your scaling metrics are being enough to your workload.
Is there a chance to share these metrics near to the restart event?
Also, i t would be goo if you share your resources and scaling in the app.yaml.

App Service unavailable for unknown reason

We're running App Service for Linux in docker container.
When things work, they work really good. But, occasionally, our site becomes unavailable for unclear reason.
Our health status reports looks like this:
Now, after some time, the app becomes completely unavailable. Health check reports Available, but in out docker log we find records like this:
2017-11-18 08:01:50.060 ERROR - Container for --- site ---is unhealthy. Stopping site.
2017-11-18 08:32:49.295 INFO - Issuing docker login to sever: http://---
2017-11-18 08:32:49.837 INFO - docker login to http://--- succeeded
2017-11-18 08:32:49.858 INFO - Issuing docker pull ---
2017-11-18 08:39:49.096 INFO - docker pull returned STDOUT>> 40: Pulling from ---
The only thing that helps is restarting the app. Then it comes back to normal and all works as expected.
I emphasise, site doesn't hang on every 'Unavailable' report from the Health check. It hangs randomly. CPU/Memory are at normal levels, nothing unusual there and no crasy spikes.
Application itself has general exceptions filter and no uncaught exceptions go out of app.
Any ideas why it might happen?

Depending on the site of your docker image, the application goes offline while it's pulling and initializing the new image. I noticed that our deploy took nearly 20 minutes before coming back up.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string