I'm building multiple projects using a single docker build, generating an image and pushing that into AWS ECR. I've recently noticed that builds that were taking 6-7 minutes are now taking on the order of 25 minutes. The Docker build portion of the process that checks out git repos and does the project builds takes ~5 minutes, but what is really slow are the individual Docker build commands such as COPY, ARG, RUN, ENV, LABEL etc. Each one is taking a very long time resulting in an additional 18 minutes or so. The timings vary quite a bit, even though the build remains generally the same.
When I first noticed this degradation Azure was reporting that their pipelines were impacted by "abuse", which I took as a DDOS against the platform (early April 2021). Now, that issue has apparently been resolved, but the slow performance continues.
Are Azure DevOps builds assigned random agents? Should we be running some kind of cleanup process such as docker system prune etc?
Are Azure DevOps builds assigned random agents? Should we be running some kind of cleanup process such as docker system prune etc?
Based on your description:
The timings vary quite a bit, even though the build remains generally the same.
This issue should still be a performance problem of the hosted agent.
And based on the settings of Azure DevOps, every time you run the pipeline with host-agent, the system will randomly match a new qualified agent. Azure DevOps builds assigned random new agent, so we do not need run some kind of cleanup process.
To verify this, you could set your private agent to check if the build time is much different each time (The first build time may be a bit longer because there is no local cache resource).
By the way, if you still want to determine whether the decline in hosted performance is causing your problem, you should contact the Product team directly, and they can check the region where your organization is located to determine whether there is degradation in the region.
This probably a weird situation. All the posts I have looked on this topic is the other way around where they want to check this "remove additional files" but in my case I want to have it unchecked but that is giving problems at later stages. To give some context around
We are building around 15 to 20 Azure functions as a wrapper API on top of Dynamics CRM APIs. So the 2 options evaluated are
a) Create each function in it's own function app - this gives us maintenance issue (20 URLs for Dev, SIT,UAT, Stage, Prod, Training is a considerable mess to tackle for along with their managed identities, app registrations etc etc) ,another key reason not to consider this approach is the consumption plan's warm up issues. It is unlikely that all these functions are heavily used but some of them are.
b) the second option, keeping all functions under 1 big function app. This is most preferred way for us as it will take care of most of the above issues. However, the problem with this we observed is - if we have to deploy 1 function, we have to wait for all the functions to be tested and approved and then deploy all functions even if the requirement is to deploy only one function. This is totally a No-No from architectural point of view.
So, we adapted a hybrid approach - in Visual studio, we still maintain multiple function app projects but during the deployment all these functions will be deployed in to Single Function App by using Web Deploy and Unchecking "Remove additional files in target"
The problem now
this is all worked very well for us during our POC, however now when we started deploying using pipe lines in to Staging slot it is becoming problem for us. Let's say when we first deployed function 1 to staging, swap it to production - the stage now has 0 functions and prod has 1 function. Then when we deploy the 2nd azure function, stage now has only 2nd function and if we swap it with production now, the production will get only 2nd azure function and we miss the 1st Azure function totally from production.
Logically it sounds correct to me but wondering if any one can give any suggestions for a work around for this.
Plz let me know if any further details required.
I followed the Nodejs on App Engine Flexible env tutorial:
https://cloud.google.com/appengine/docs/flexible/nodejs/create-app
Having successfully deployed and tested the tutorial, I changed the code to experiment a little and successfully deployed it... and then left it running since this was a testing environment (not public).
A month later, I receive a bill from Google for over $370!
In the transaction details I see the following:
Oct 1 – 31, 2017 App Engine Flex Instance RAM: 5948.774 Gibibyte-hours
([MYPROJECT]) $42.24
Oct 1 – 31, 2017 App Engine Flex Instance Core Hours: 5948.774 Hours ([MYPROJECT]) $312.91
How did this testing environment with almost 0 requests require about 6,000 hours of resources? In the worst, I would have assume 720 hrs running fulltime for a month # $0.05 per hour would cost me ~$40.
https://cloud.google.com/appengine/pricing
Can someone help shed light on this? I have not been able to find out why so many resources were needed?
Thanks for the help!
For more data, this is the traffic over the last month (basically 0):
And instance data
UPDATE:
Note that I did bring one modification to the package.json: I added nodemon as a dependency and added it as part of my "nmp start" script. Though I doubt this explains the 6000 hours of resources:
"scripts": {
"deploy": "gcloud app deploy",
"start": "nodemon app.js",
"dev": "nodemon app js",
"lint": "samples lint",
"pretest": "npm run lint",
"system-test": "samples test app",
"test": "npm run system-test",
"e2e-test": "samples test deploy"
},
App.yaml (default-no change from tutorial)
runtime: nodejs
env: flex
After multiple back and forth with Google, and hours of reading blogs and looking at reports, I've finally found an explanation for what happened. I will post it here with my suggestions so that other people do not also fall victim to this problem.
Note, this may seem obvious to some, but as a new GAE user, all of this was brand new to me.
In short, when deploying to GAE and using the following command "$ gcloud app deploy", it creates a new version and sets it as the default, but also and more importantly, it does NOT remove the previous version that was deployed.
More info about versions and instances can be found here: https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine
So in my case, without knowing it, I had created multiple versions of my simple node app. These versions are still running in case one needs to switch following an error. But these versions also require instances, and the default, unless stated in the app.yaml, is 2 instances.
Google says:
App Engine by default scales the number of instances running up and
down to match the load, thus providing consistent performance for your
app at all times while minimizing idle instances and thus reducing
cost.
However, from my experience, this was not the case. As I said earlier, I pushed my node app with nodemon which it seems was causing errors.
In the end, following the tutorial and not shutting down the project, I had 4 versions, each with 2 instances running full-time for 1.5 months serving 0 requests and generating lots of error messages and it cost me $500.
RECOMMENDATIONS IF YOU STILL WANT TO USE GAE FLEX ENV:
First and foremost, setup a billing budget & alerts so that you do not get surprised by an expensive invoice that is automatically charged to your CC: https://cloud.google.com/billing/docs/how-to/budgets
In a testing env, you most likely do not need multiple versions, so while deploying use the following command:
$ gcloud app deploy --version v1
Update your app.yaml to force only 1 instance with minimal resources:
runtime: nodejs
env: flex
# This sample incurs costs to run on the App Engine flexible environment.
# The settings below are to reduce costs during testing and are not appropriate
# for production use. For more information, see:
# https://cloud.google.com/appengine/docs/flexible/nodejs/configuring-your-app-with-app-yaml
manual_scaling:
instances: 1
resources:
cpu: 1
memory_gb: 0.5
disk_size_gb: 10
Set daily spending limit
See this blog post for more info: https://medium.com/google-cloud/three-simple-steps-to-save-costs-when-prototyping-with-app-engine-flexible-environment-104fc6736495
I wish some of these steps had been included in the tutorial in order to protect those who are trying to learn and experiment, but it was not.
Google App Engine Flex env can be tricky if one does not know all these details. A friend pointed me to Heroku, that has both set pricing and Free/Hobby offers. I was able to quickly push a new node app there, and it worked like charm!
https://www.heroku.com/pricing
It "only" cost me $500 to learn this lesson, but I do hope this helps others looking at Google App Engine Flex Env.
If you want to reduce your GAE costs please do not use manual_scaling as suggested in this article or the accepted answer!
The beautiful thing about Google App Engine is that it can scale up and down to hundreds of machines within milliseconds based on demand. And you only pay for instances that are running.
To be able to optimize your costs you need to understand the different scaling options and instance types:
1. App engine flex vs standard:
The details about differences can be found here, but one important difference relevant for this question is:
[Standard is] Intended to run for free or at very low cost, where you pay only for
what you need and when you need it. For example, your application can
scale to 0 instances when there is no traffic.
2. Scaling Options:
Automatic scaling: Google will scale your app depending on demand and configuration you provided.
Manual scaling: No scaling at all, GAE will run exact # of instances you asked for, all the time(very misleading naming)
Basic scaling: It will scale up to limit you set and will also scale down after certain time
3. Instance Types:
There are 2 instance types, and they basically differ in the time it takes to spin up a new instance. F class instances(used in automatic scaling) can be created when there is need within ~0.1 seconds and B class instances(used in manual scaling/basic) within ~0.7 seconds:
Now that you understood the basics let's go back to accepted answer:
manual_scaling:
instances: 1
resources:
cpu: 1
memory_gb: 0.5
disk_size_gb: 10
What this instructs GAE is to run a custom instance class(more costly), all the time. Obviously this is not the cheapest option because B1/F1 instance type could be used instead(it has lower specs) and it is also running an instance constantly.
What would be the cheapest is to turn off the instance when there is no traffic. If you don't mind the ~0.1 second spin up time you could go with this instead:
instance_class: F1
automatic_scaling:
max_instances: 1 (--> you can adjust this as you wish)
min_instances: 0 (--> will scale to 0 when there is no traffic so won't incur costs)
This will fall within the free quotas google provide and it should not cost you anything if you don't have any real traffic.
PS: It's also highly recommended to set up daily spending limit in case you forgot something running or you have some costly settings somewhere(daily spending limits are deprecated but will be available until July 24, 2021, source).
We had code deployed to GAE FE go absolutely nuts due to a cascading, exponential failure (bounced emails generated bounced-email emails, etc.) and we could NOT turn off the GAE instances that were bugged. After 4+ hours, and 1M+ emails sent (Mailgun just would NOT let us disable the account. It said "Please wait up to 24 hours for the password change to go into effect", and revoking API keys did nothing), the redis VM was stopped, the DB down, and all the site's code reduced to a single "Down For Maintenance" static 503 page), the emails kept being sent.
I determined that GAE FE just simply does not end either docker VMs or Cloud Compute VMs (redis) that are under CPU load. Maybe never! Once we actually deleted the Compute VM (instead of "merely" stopping it), the emails instantly stopped.
But, our DB continued to get filled with "could not send email" notices for up to 2 more hours, despite the GAE app reporting 100% of the versions and instances to be "Stopped". I ended up having to change the Google Cloud SQL password.
We kept checking the bill, and the 7 rogue instances kept using up CPU and so we cancelled the card used on that account, and the site did, in fact, go down when the bill was past due, but so did the rogue instances. We never were able to resolve the situation with GAE email support.
Update (30 Sep 2020): This is still the worst moment of my 22 year career!! An entire company of 15 crack genius devs couldn't figure out how to turn off GAE. We knew customers were receiving MILLIONS of emails when one of my dev's couldn't access her GMail account. Couldn't unplug it, couldn't turn it off. It was quite a "Terminator" moment!
It wouldn't have been nearly so bad, except for expenses, if MailGun had allowed us to actually disable the API access or change the password. But it would have still been bad expense-wise on GAE.
I no longer trust servers I can't issue reboot on.
In the end, MailGun only charged us about $50. GAE, however... If I had just assumed "OK, mails stopped, we can stop", we could have ended up with a $20,000 excess bill! As it was, it "only" cost $1,500. And we never could get in contact with anyone to dispute it. So the CEO just ate it.
Also note that if you still want your app to have automatic scaling but you don't want the default minimum of 2 instances running at all times, you can configure your app.yaml like so:
runtime: nodejs
env: flex
automatic_scaling:
min_num_instances: 1
Since no one mentioned, here are the gcloud commands related to the versions
# List all versions
$ gcloud app versions list
SERVICE VERSION.ID TRAFFIC_SPLIT LAST_DEPLOYED SERVING_STATUS
default 20200620t174631 0.00 2020-06-20T17:46:56+03:00 SERVING
default 20200620t174746 0.00 2020-06-20T17:48:12+03:00 SERVING
default prod 1.00 2020-06-20T17:54:51+03:00 SERVING
# Delete these 2 versions (you can't delete all versions, you have to have at least one remaining)
$ gcloud app versions delete 20200620t174631 20200620t174746
# Help
$ gcloud app versions --help
for dev environments where I don't mind a little latency, I'm using the following settings:
instance_class: B1
basic_scaling:
max_instances: 1
idle_timeout: 1m
And if you use your instance more than the free backend instance allowance try this:
instance_class: F1
automatic_scaling:
max_instances: 1
It the AppEngine dashboard, watch the Instances, take note of the start time, and watch to ensure that after the idle_timeout period has passed the Instance count drop to zero and you see the message "This version has no instances deployed".
These options don't work in the flex env:
app.yaml :
# 1.
resources:
cpu: .5
memory_gb: .18
disk_size_gb: 10
# 2.
automatic_scaling:
min_instances: 1
max_instances: 1
# 3.
beta_settings:
machine_type: f1-micro
Related errors:
1.
Error Response: [3] App Engine Flexible validation error: Memory GB
(0.58) per VCPUs must be between 0.90 and 6.50
ERROR: (gcloud.app.deploy) INVALID_ARGUMENT: VM-based automatic
scaling should NOT have the following parameter(s):
[standard_scheduler_settings.min_instances,
standard_scheduler_settings.max_instances]
'#type': type.googleapis.com/google.rpc.BadRequest fieldViolations:
description: 'VM-based automatic scaling should NOT have the following parameter(s):
[standard_scheduler_settings.min_instances, standard_scheduler_settings.max_instances]'
field: version.automatic_scaling
ERROR: (gcloud.app.deploy) INVALID_ARGUMENT: Unrecognized or
unpermitted key(s) in configuration "beta_settings"
'#type': type.googleapis.com/google.rpc.BadRequest fieldViolations:
description: beta_setting key can not be used with env:flex
field: machine_type
I wonder if there is a way to get the average build time from bitbucket pipelines. (API ?)
We have several projects in parallel and build times can go from 20s to 4-5 minutes depending on what each project needs.
Since this build process has slack reporting, i would like to introduce the reporting with a "The average build time for this project is 3.2 minutes" so the people know when to expect the end or to worry about a failure.
Anyone has any lead about meta information about the bitbucket pipelines that could be accessed from within the pipeline ?
How do you handle having different time delays and sending notifications to different users based on the environment they are running in (dev, test & production)?
We are developing long running workflows that we would like to have delay for minutes in our dev and test environments, but need to delay for days in production.
These same workflows need to send their notifications to us in the dev environment and business users in the test and production environments.
What are the best practices for handling these types of situations?
Store the delay values in a list, and just change the values based on which environment you are in.
If you were creating the workflow in Visual Studio, you could vary the delay value based on the host name of the site the workflow is running on.