App Engine Flex deployment health check fails

App Engine Flex deployment health check fails - python-3.x

I've made a Python 3 Flask app to serve as an API proxy with gunicorn. I've deployed the openapi to Cloud Endpoints and filled in the endpoints service in the app.yaml file.
When I try to deploy to app engine flex, the health check fails because it took too long. I've tried to alter the readiness_check's app_start_timeout_sec like suggested but to no avail. When checking the logs on stackdriver I can only see gunicorn booting a couple of workers and eventually terminating everything a couple times in a row. No further explanation of what goes wrong. I've also tried to specify resources in the app.yaml and scaling the workers in the gunicorn.conf.py file but to no avail.
Then I tried switching to uwsgi but this acted in the same way: starting up and terminating a couple of times in a row and health check timeout.
error:
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
app.yaml
runtime: python
env: flex
entrypoint: gunicorn -c gunicorn.conf.py -b :$PORT main:app
runtime_config:
python_version: 3
endpoints_api_service:
name: 2019-09-27r0
rollout_strategy: managed
resources:
cpu: 1
memory_gb: 2
disk_size_gb: 10
gunicorn.conf.py:
import multiprocessing
bind = "127.0.0.1:8000"
workers = multiprocessing.cpu_count() * 2 + 1
requirments.txt:
aniso8601==8.0.0
certifi==2019.9.11
chardet==3.0.4
Click==7.0
Flask==1.1.1
Flask-Jsonpify==1.5.0
Flask-RESTful==0.3.7
gunicorn==19.9.0
idna==2.8
itsdangerous==1.1.0
Jinja2==2.10.1
MarkupSafe==1.1.1
pytz==2019.2
requests==2.22.0
six==1.12.0
urllib3==1.25.5
Werkzeug==0.16.0
pyyaml==5.1.2
Is there anyone who can spot a conflict or something I forgot in here? I'm out of ideas and really need help. It would also definitely help if someone could point me in the right direction where to find more info in the logs (I also run the gcloud app deploy with --verbosity=debug but this only shows "Updating service [default]... ...Waiting to retry."). I would really like to know what causes the health checks to timeout!
Thanks in advance!

You can both disable Health Checks or customize them:
For disabling you have to add the following to your app.yaml:
health_check:
enable_health_check: False
For customize them you can take a look into the Split health checks.
You can customize Liveness checks request by adding an optional liveness_check section on you app.yaml file, for example:
liveness_check:
path: "/liveness_check"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
In the documentation you can check the settings available for liveness checks.
In addition, there are the Readiness checks. In the same way, you can customize some settings, for example:
readiness_check:
path: "/readiness_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
The values above mentioned can be changed according to your needs. Check this values especially since App Engine Flexible takes some minutes to get the instance startup-ed, this is a remarkable difference to App Engine Standard and should not be taken lightly.
If you examine the nginx.health_check logs for your application, you might see health check polling happening more frequently than you have configured, due to the redundant health checkers that are also following your settings. These redundant health checkers are created automatically and you cannot configure them.

Related

Continuous WebJob randomly restarts

I run a continuous WebJob in my WebApp. What I have found is that it can randomly restart.
I've checked my web app settings and "Always On" is turned on. I have no triggers that can cause a reboot.
This is an empty web app, creating from scratch. All I have done is just published my continuous WebJob.
How can I prevent this random reboots?
As I see from App Insight it restarts after 10 minutes from the first run
2/9/2021, 12:46:45 PM - TRACE
Checking for active containers
Severity level: Information
2/9/2021, 12:46:45 PM - TRACE
Job host started
Severity level: Information
2/9/2021, 12:46:45 PM - TRACE
No job functions found. Try making your job classes and methods public. If you're using binding extensions (e.g. Azure Storage, ServiceBus, Timers, etc.) make sure you've called the registration method for the extension(s) in your startup code (e.g. builder.AddAzureStorage(), builder.AddServiceBus(), builder.AddTimers(), etc.).
Severity level: Warning
2/9/2021, 12:46:45 PM - TRACE
Starting JobHost
Severity level: Information
2/9/2021, 12:36:44 PM - TRACE
2: Change Feed Processor: Processor_Container2 with Instance name: 4b94336ff47c4678b9cf4083a60f0b3bf1cd9f77ce7d501100a9d4e60bd87e8e has been started
Severity level: Information
2/9/2021, 12:36:37 PM - TRACE
1: Change Feed Processor: Processor_Container1 with Instance name: 4b94336ff47c4678b9cf4083a60f0b3bf1cd9f77ce7d501100a9d4e60bd87e8e has been started
Severity level: Information
2/9/2021, 12:36:32 PM - TRACE
Checking for active containers
Severity level: Information
2/9/2021, 12:36:32 PM - TRACE
Job host started
Severity level: Information

Kindly review the Jobs and logs to isolate the issue further:
For continuous WebJobs - Console.Out and Console.Error are routed to the "application logs", they will show up as file or blob depends on your configuration of the application logs (similar to your WebApp).
Kindly check this document for more details - https://github.com/projectkudu/kudu/wiki/WebJobs#logging
I have seen cases, where having unnecessary app settings on the configuration blade of WebJobs on the portal caused reboots.
Kindly identify and remove unnecessary app settings as required (as a test).
Also, kindly see if this setting is present ‘**WEBJOBS_RESTART_TIM**E ‘-Timeout in seconds between when a continuous job's process goes down (for any reason) and the time we re-launch it again (Only for continuous jobs).
On the App Service, In the left navigation, click on Diagnose and solve problems – Checkout the tile for “**Diagnostic Tools**” > “Availability and Performance” & "Best Practices". /Review the WebJob details (screenshot below).
Just to isolate, kindly see if setting singleton helps. If a continuous job is set as singleton it'll run only on a single instance opposed to running on all instances. By default, it runs on all instances.
{
"is_singleton": true
}
Refer this doc- https://github.com/projectkudu/kudu/wiki/WebJobs-API#set-a-continuous-job-as-singleton
P.S. To benefit the community/copying the answer from our discussion on Q&A thread.

how to make azure external.metrics.k8s adapter work?

I've setup Azure external metrics adapter following this document "https://github.com/Azure/azure-k8s-metrics-adapter/tree/master/samples/servicebus-queue"
After the helm installation using service-principal when executing the command kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq I should be getting an output as suggested by the document. but instead I'm facing an error stating Error from server (ServiceUnavailable): the server is currently unable to handle the request
The helm installation went successful and below are the logs
I0116 12:49:36.216094 1 controller.go:40] Setting up external metric
event handlers I0116 12:49:36.216148 1 controller.go:52] Setting up
custom metric event handlers I0116 12:49:36.216528 1 controller.go:69]
initializing controller I0116 12:49:36.353905 1 main.go:104] Looking
up subscription ID via instance metadata I0116 12:49:36.359887 1
instancemetadata.go:40] connected to sub: ********************* I0116
12:49:36.416858 1 controller.go:77] starting 2 workers with 1000000000
interval I0116 12:49:36.417062 1 controller.go:88] Worker starting
I0116 12:49:36.417068 1 controller.go:88] Worker starting I0116
12:49:36.417074 1 controller.go:98] processing item I0116
12:49:36.417078 1 controller.go:98] processing item I0116
12:49:36.680065 1 serving.go:312] Generated self-signed cert
(apiserver.local.config/certificates/apiserver.crt,
apiserver.local.config/certificates/apiserver.key) I0116
12:49:37.197936 1 secure_serving.go:116] Serving securely on [::]:6443
When I execute the command kubectl api-versions external.metrics.k8s.io/v1beta1 is displayed in the list. So this proves that the installation went successful. But why am I not able to hit the api???

Solved it. Initially I was installing in my custom namespace. Looks like Azure metrics adapter will work only if it is installed in namespace "custom-metrics". Probably they should mention it somewhere in the document. It cost me 2 days of trouble shooting to figure this out :-(

app engine fails to deploy app with large dataframe

So i built an app using machine learning on a small pandas dataframe, ~1000 records. I use gcloud app deploy, it hosts on appspot and I am able to use it.
I increase the dataframe to ~30,000 records and the app still runs on my local. When i use gcloud app deploy, i get a 500 server error. I am loading the dataframe from a csv in my project root.
My app.yaml looks like:
runtime: python37
service: snow
instance_class: F4_1G
From another stackoverflow post, I switched the instance_class to F4_1G but it keeps having the same error. I also tried
gcloud config set app/cloud_build_timeout 1000
Any other ideas on what could cause app engine to have this error?

The error:
"exceeded soft memory limit of 2048 mb, consider increasing in app yaml file"
Indicates that your instance class has run out of memory, in theory would be able to increase the memory by specifying another instance class, however you are already using the one with the most memory (2048mb). Check the list of instance classes.
So in your case, the solution would be to change to App Engine Flex, and to do so you will need to specify something like this on your app.yaml:
runtime: python
env: flex
entrypoint: gunicorn -b :$PORT main:app
runtime_config:
python_version: 3
manual_scaling:
instances: 1
resources:
cpu: 1
memory_gb: 2.1
disk_size_gb: 10
On memory_gb you specify the memory that the vm instance will use, and here's the formula to know which value to set:
memory_gb = cpu * [0.9 - 6.5] - 0.4
You choose the desired memory from the interval [0.9 - 6.5], multiply it by the number of CPUs and subtract 0.4. For a more extend explanation check the app.yaml reference documentation.
Also, check the App Engine Pricing documentation to know how your billing will change from Standard to Flex.

Openshift 3 App Deployment Failed: Took longer than 600 seconds to become ready

I have a problem with my openshift 3 setup, based on Node.js + MongoDB (Persistent) https://github.com/openshift/nodejs-ex.git
Latest App Deployment: nodejs-mongo-persistent-7: Failed
--> Scaling nodejs-mongo-persistent-7 to 1
--> Waiting up to 10m0s for pods in rc nodejs-mongo-persistent-7 to become ready
error: update acceptor rejected nodejs-mongo-persistent-7: pods for rc "nodejs-mongo-persistent-7" took longer than 600 seconds to become ready
Latest Build: Complete
Pushing image 172.30.254.23:5000/husk/nodejs-mongo-persistent:latest ...
Pushed 5/6 layers, 84% complete
Pushed 6/6 layers, 100% complete
Push successful
I have no idea how to debug this? Can you help please.

Check what went wrong in console: oc get events
Failed to pull image? Make sure you included a proper secret

Unable to update VM with nodejs app on Google App Engine

When I try to deploy from the gcloud CLI I get the following error.
Copying files to Google Cloud Storage...
Synchronizing files to [gs://staging.logically-abstract-www-site.appspot.com/].
Updating module [default]...\Deleted [https://www.googleapis.com/compute/v1/projects/logically-abstract-www-site/zones/us-central1-f/instances/gae-builder-vm-20151030t150724].
Updating module [default]...failed.
ERROR: (gcloud.preview.app.deploy) Error Response: [4] Timed out creating VMs.
My app.yaml is:
runtime: nodejs
vm: true
api_version: 1
automatic_scaling:
min_num_instances: 2
max_num_instances: 20
cool_down_period_sec: 60
cpu_utilization:
target_utilization: 0.5
and I am logged in successfully and have the correct project ID. I see the new version created in the Cloud Console for App Engine, but the error is after that it seems.
In the stdout log I see both instances go up with the last console.log statement I put in the app after it starts listening on the port, but in the shutdown.log I see "app was unhealthy" and in syslog I see "WARNING: never got healthy response from app, but sending /_ah/start query anyway."

From my experience with nodejs using Google Cloud App Engine, I see that "Timed out creating VMs" is neither a traditional timeout nor does it have to do with creating VMs. I had found that other errors were reported during the launch of the server --which happens to be right after VMs are created. So, I recommend checking console output to see if it tells you anything.
To see the console output:
For a vm instance, then go to /your/ vm instances and click the vm instance you want, then scroll towards the bottom and click "Serial console output".
For stdout console logging, go monitoring /your/ logs then change the log type dropdown from Request to be stdout.
I had found differences in the process.env when running locally versus in the cloud. I hope you find your solution too --good luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string