app engine fails to deploy app with large dataframe - python-3.x

So i built an app using machine learning on a small pandas dataframe, ~1000 records. I use gcloud app deploy, it hosts on appspot and I am able to use it.
I increase the dataframe to ~30,000 records and the app still runs on my local. When i use gcloud app deploy, i get a 500 server error. I am loading the dataframe from a csv in my project root.
My app.yaml looks like:
runtime: python37
service: snow
instance_class: F4_1G
From another stackoverflow post, I switched the instance_class to F4_1G but it keeps having the same error. I also tried
gcloud config set app/cloud_build_timeout 1000
Any other ideas on what could cause app engine to have this error?

The error:
"exceeded soft memory limit of 2048 mb, consider increasing in app yaml file"
Indicates that your instance class has run out of memory, in theory would be able to increase the memory by specifying another instance class, however you are already using the one with the most memory (2048mb). Check the list of instance classes.
So in your case, the solution would be to change to App Engine Flex, and to do so you will need to specify something like this on your app.yaml:
runtime: python
env: flex
entrypoint: gunicorn -b :$PORT main:app
runtime_config:
python_version: 3
manual_scaling:
instances: 1
resources:
cpu: 1
memory_gb: 2.1
disk_size_gb: 10
On memory_gb you specify the memory that the vm instance will use, and here's the formula to know which value to set:
memory_gb = cpu * [0.9 - 6.5] - 0.4
You choose the desired memory from the interval [0.9 - 6.5], multiply it by the number of CPUs and subtract 0.4. For a more extend explanation check the app.yaml reference documentation.
Also, check the App Engine Pricing documentation to know how your billing will change from Standard to Flex.

Related

sam local invoke timeout on newly created project (created via sam init)

I create a new project via sam init and I select the options:
1 - AWS Quick Start Templates
1 - nodejs14.x
8 - Quick Start: Web Backend
Then from inside the project root, I run sam local invoke -e ./events/event-get-all-items.json getAllItemsFunction, which returns:
Invoking src/handlers/get-all-items.getAllItemsHandler (nodejs14.x)
Skip pulling image and use local one: public.ecr.aws/sam/emulation-nodejs14.x:rapid-1.32.0.
Mounting /home/rob/code/sam-app-2/.aws-sam/build/getAllItemsFunction as /var/task:ro,delegated inside runtime container
Function 'getAllItemsFunction' timed out after 100 seconds
No response from invoke container for getAllItemsFunction
Any idea what could be going on or how to debug this? Thanks.
Any chance the image/lambda make a call to a database someplace? and does the container running the lambda have the right connection string and/or access? To me sounds like your function is getting called and then function is trying to reach something that it can't reach.
As far as debugging - lots of console.log() statements to narrow down how far your code is getting before it runs into trouble.

APP Engine Google Cloud Storage - Error 500 when downloading a file

I'm having an error 500 when I download a JSON file (2MB aprox) using the nodejs-storage library. The file gets downloaded without any problem, but once I render the view and pass the file as parameter the app crashes "The server encountered an error and could not complete your request."
file.download(function(err, contents) {
var messages = JSON.parse(contents);
res.render('_myview.ejs', {
"messages": messages
})
}
I am using the App Engine Standard Environment and have this further error detail:
Exceeded soft memory limit of 256 MB with 282 MB after servicing 11 requests total. Consider setting a larger instance class in app.yaml
Can someone give me hint? Thank you in advance.
500 error messages are quite hard to troubleshoot due to the all the possible scenarios that could go wrong with the App Engine instances. A good way to start debugging this type of errors with App Engine would be to go to the Stackdriver logging, query for the 500 error messages click on the expander arrow and check for the specific error code. In the specific case of the Exceeded soft memory limit... error message in the App Engine Standard environment my suggestion would be to choose an instance class better suited to your application's load.
Assuming you are using automatic scaling you could try to use an F2 instance class (which has a higher Memory and CPU limit than the default F1) and start from there. Adding or modifying the instance_class element of your app.yaml file to instance_class: F2 would suffice to accomplish the instance class suggested, or you could change your app.yaml file to use an instance better suited to your application's load.
Notice that increasing the instance class directly affects your billing and you can use the Google Cloud Platform Pricing Calculator to get an estimate of the costs associated to using a different instance class for your App Engine application.

K8S - using Prometheus to monitor another prometheus instance in secure way

I've installed Prometheus operator 0.34 (which works as expected) on cluster A (main prom)
Now I want to use the federation option,I mean collect metrics from other Prometheus which is located on other K8S cluster B
Secnario:
have in cluster A MAIN prometheus operator v0.34 config
I've in cluster B SLAVE prometheus 2.13.1 config
Both installed successfully via helm, I can access to localhost via port-forwarding and see the scraping results on each cluster.
I did the following steps
Use on the operator (main cluster A) additionalScrapeconfig
I've added the following to the values.yaml file and update it via helm.
additionalScrapeConfigs:
- job_name: 'federate'
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 101.62.201.122:9090 # The External-IP and port from the target prometheus on Cluster B
I took the target like following:
on prometheus inside cluster B (from which I want to collect the data) I use:
kubectl get svc -n monitoring
And get the following entries:
Took the EXTERNAL-IP and put it inside the additionalScrapeConfigs config entry.
Now I switch to cluster A and run kubectl port-forward svc/mon-prometheus-operator-prometheus 9090:9090 -n monitoring
Open the browser with localhost:9090 see the graph's and click on Status and there Click on Targets
And see the new target with job federate
Now my main question/gaps. (security & verification)
To be able to see that target state on green (see the pic) I configure the prometheus server in cluster B instead of using type:NodePort to use type:LoadBalacer which expose the metrics outside, this can be good for testing but I need to secure it, how it can be done ?
How to make the e2e works in secure way...
tls
https://prometheus.io/docs/prometheus/1.8/configuration/configuration/#tls_config
Inside cluster A (main cluster) we use certificate for out services with istio like following which works
tls:
mode: SIMPLE
privateKey: /etc/istio/oss-tls/tls.key
serverCertificate: /etc/istio/oss-tls/tls.crt
I see that inside the doc there is an option to config
additionalScrapeConfigs:
- job_name: 'federate'
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 101.62.201.122:9090 # The External-IP and port from the target
# tls_config:
# ca_file: /opt/certificate-authority-data.pem
# cert_file: /opt/client-certificate-data.pem
# key_file: /sfp4/client-key-data.pem
# insecure_skip_verify: true
But not sure which certificate I need to use inside the prometheus operator config , the certificate of the main prometheus A or the slave B?
You should consider using Additional Scrape Configuration
AdditionalScrapeConfigs allows specifying a key of a Secret
containing additional Prometheus scrape configurations. Scrape
configurations specified are appended to the configurations generated
by the Prometheus Operator.
I am affraid this is not officially supported. However, you can update your prometheus.yml section within the Helm chart. If you want to learn more about it, check out this blog
I see two options here:
Connections to Prometheus and its exporters are not encrypted and
authenticated by default. This is one way of fixing that with TLS
certificates and
stunnel.
Or specify Secrets which you can add to your scrape configuration.
Please let me know if that helped.
A couple of options spring to mind:
Put the two clusters in the same network space and put a firewall in-front of them
VPN tunnel between the clusters.
Use istio multicluster routing (but this could get complicated): https://istio.io/docs/setup/install/multicluster

App Engine Flex deployment health check fails

I've made a Python 3 Flask app to serve as an API proxy with gunicorn. I've deployed the openapi to Cloud Endpoints and filled in the endpoints service in the app.yaml file.
When I try to deploy to app engine flex, the health check fails because it took too long. I've tried to alter the readiness_check's app_start_timeout_sec like suggested but to no avail. When checking the logs on stackdriver I can only see gunicorn booting a couple of workers and eventually terminating everything a couple times in a row. No further explanation of what goes wrong. I've also tried to specify resources in the app.yaml and scaling the workers in the gunicorn.conf.py file but to no avail.
Then I tried switching to uwsgi but this acted in the same way: starting up and terminating a couple of times in a row and health check timeout.
error:
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
app.yaml
runtime: python
env: flex
entrypoint: gunicorn -c gunicorn.conf.py -b :$PORT main:app
runtime_config:
python_version: 3
endpoints_api_service:
name: 2019-09-27r0
rollout_strategy: managed
resources:
cpu: 1
memory_gb: 2
disk_size_gb: 10
gunicorn.conf.py:
import multiprocessing
bind = "127.0.0.1:8000"
workers = multiprocessing.cpu_count() * 2 + 1
requirments.txt:
aniso8601==8.0.0
certifi==2019.9.11
chardet==3.0.4
Click==7.0
Flask==1.1.1
Flask-Jsonpify==1.5.0
Flask-RESTful==0.3.7
gunicorn==19.9.0
idna==2.8
itsdangerous==1.1.0
Jinja2==2.10.1
MarkupSafe==1.1.1
pytz==2019.2
requests==2.22.0
six==1.12.0
urllib3==1.25.5
Werkzeug==0.16.0
pyyaml==5.1.2
Is there anyone who can spot a conflict or something I forgot in here? I'm out of ideas and really need help. It would also definitely help if someone could point me in the right direction where to find more info in the logs (I also run the gcloud app deploy with --verbosity=debug but this only shows "Updating service [default]... ...Waiting to retry."). I would really like to know what causes the health checks to timeout!
Thanks in advance!
You can both disable Health Checks or customize them:
For disabling you have to add the following to your app.yaml:
health_check:
enable_health_check: False
For customize them you can take a look into the Split health checks.
You can customize Liveness checks request by adding an optional liveness_check section on you app.yaml file, for example:
liveness_check:
path: "/liveness_check"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
In the documentation you can check the settings available for liveness checks.
In addition, there are the Readiness checks. In the same way, you can customize some settings, for example:
readiness_check:
path: "/readiness_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
The values above mentioned can be changed according to your needs. Check this values especially since App Engine Flexible takes some minutes to get the instance startup-ed, this is a remarkable difference to App Engine Standard and should not be taken lightly.
If you examine the nginx.health_check logs for your application, you might see health check polling happening more frequently than you have configured, due to the redundant health checkers that are also following your settings. These redundant health checkers are created automatically and you cannot configure them.

Unable to update VM with nodejs app on Google App Engine

When I try to deploy from the gcloud CLI I get the following error.
Copying files to Google Cloud Storage...
Synchronizing files to [gs://staging.logically-abstract-www-site.appspot.com/].
Updating module [default]...\Deleted [https://www.googleapis.com/compute/v1/projects/logically-abstract-www-site/zones/us-central1-f/instances/gae-builder-vm-20151030t150724].
Updating module [default]...failed.
ERROR: (gcloud.preview.app.deploy) Error Response: [4] Timed out creating VMs.
My app.yaml is:
runtime: nodejs
vm: true
api_version: 1
automatic_scaling:
min_num_instances: 2
max_num_instances: 20
cool_down_period_sec: 60
cpu_utilization:
target_utilization: 0.5
and I am logged in successfully and have the correct project ID. I see the new version created in the Cloud Console for App Engine, but the error is after that it seems.
In the stdout log I see both instances go up with the last console.log statement I put in the app after it starts listening on the port, but in the shutdown.log I see "app was unhealthy" and in syslog I see "WARNING: never got healthy response from app, but sending /_ah/start query anyway."
From my experience with nodejs using Google Cloud App Engine, I see that "Timed out creating VMs" is neither a traditional timeout nor does it have to do with creating VMs. I had found that other errors were reported during the launch of the server --which happens to be right after VMs are created. So, I recommend checking console output to see if it tells you anything.
To see the console output:
For a vm instance, then go to /your/ vm instances and click the vm instance you want, then scroll towards the bottom and click "Serial console output".
For stdout console logging, go monitoring /your/ logs then change the log type dropdown from Request to be stdout.
I had found differences in the process.env when running locally versus in the cloud. I hope you find your solution too --good luck!

Resources