How to collect statistics from CI over time? - gitlab

CI environments like GitLab (self-hostet and Cloud), GitHub, CodeCov, Codacy, ... collect statistics over time, so a developer or team lead can see the evolution of the project over time:
number of merge requests, commits, contributors, ...
number of passed/failed tests
code coverage
used runtime for e.g. unit tests on a server
...
Unfortunately, these statistics are: decentralized (multipe cloud services are needed), specific to the services who offer them, and they are not general purpose.
I'm seeking for a solution to collect data points over time per repository or repository group. My background is hardware development with e.g. FPGAs and also embedded software.
used hardware resources like gates, memory, multiplier units, ...
timing errors (how many wires do not meet timing constraints)
achievable (max) frequency
number of critical errors, warnings and info messages
Other more software-like parameters could be:
performance / per test-case runtime
executable size
All these metrics are essential to detect improvements / optimizations over time or to notice degradation before a hardware designs stops working (get unreliable).
What I know so far:
Such data is ideally stored in a time series database with either unlimited time span (if you want to compare even years back when the project started) or limited like last 12 month.
Prometheus is used widely in cloud and network setups e.g. to collect CPU/RAM usage, network traffic, temperatures and other arbitrary data points over time.
Prometheus is part of a self-hosted GitLab installation.
Visualization can be done via Grafana.
Users can defined new diagrams and panels.
Grafana is part of a self-hosted GitLab installation.
What's missing from my point of view - and here I'm seeking help or ideas:
How to connect new time series in Prometheus with a Git repository?
How to define access rights based on who can access a Git repository?
How to add new views to Grafana if a repository pushes such statistics?
How to get rid of old data if the repository gets deleted.
At best configure it with a YAML file in the repository itself.
...
Of cause I could set it up if it's just a single repository pushing data points, but I have > 100 repositories and currently 1-3 are added per week.
Is such a service / add-on already available?
(I tried to asked it at DevOps but it got only 10 views due to low activity in DevOps.)

You could use aws cloudwatch custom metrics and then build a cloudwatch dashboard to look at the metrics.
aws cloudwatch put-metric-data --metric-name Buffers --namespace MyNameSpace --unit Bytes --value 231434333 --dimensions InstanceId=1-23456789,InstanceType=m1.small
aws cloudwatch get-metric-statistics --metric-name Buffers --namespace MyNameSpace --dimensions Name=InstanceId,Value=1-23456789 Name=InstanceType,Value=m1.small --start-time 2016-10-15T04:00:00Z --end-time 2016-10-19T07:00:00Z --statistics Average --period 60```

Related

How to understand Gitlab latency metric

Gitlab has a complete metrics dashboard running on a Grafana instance.
If I head to Gitlab Omnibus - Overview, within this dashboard, I can see a Workhorse Latency panel and if I step the mouse over, it shows me the count of requests on that specific bucket. So if I understand correctly, there are three requests in the bucket 10s - 30s in the image below.
However, when I inspect the data, I see these numbers in image 2, which do not make sense to me. What do these figures mean? How do I make sense of them?
On Prometheus I see the same numbers from the query:
sum by (le)(rate(gitlab_workhorse_http_request_duration_seconds_bucket{instance=~"localhost:9229"}[1m]))
How can I count those numbers to make the same as what I'm seeing in the panel?
The GitLab Prometheus metrics does include indeed:
gitlab_rails_queue_duration_seconds Measures latency between GitLab Workhorse forwarding a request to Rails
gitlab_transaction_rails_queue_duration_total Measures latency between GitLab Workhorse forwarding a request to Rails
I suppose the first graph display the number of queries which were within a certain latency limit, as reported by all the occurrences listed by the second image.

GitLab analyze job queue waiting time

Is there a built-in way to get an overview how long jobs, by tag, spend in queue, to check if the runners are over- or undercommited? I checked the Admin Area, but did not find anything, have I overlooked something?
If not, are there any existing solutions you are aware of? I tried searching, but my keywords are too broad and as such the results are as broad, and found nothing yet.
Edit: I see the job REST API can return all runner's jobs and has created_at, started_at and finished_at, maybe I'll have to analyze that.
One way to do this is to use the runners API to list the status of all your active runners. So, at any given moment in time, you can assess how many active jobs you have running by listing each runners 'running' jobs (use ?status=running filter).
So, from the above, you should be able to arrive at the number of currently running jobs. You can compare that against your maximum capacity of jobs (the sum of the configured jobs limits for all your runners).
However, if you are at full capacity, this won't tell you how much over capacity you are or how big the job backlog is. For that, you can look at the pending queue under /admin/jobs. I'm not sure if there's an official API endpoint to list all the pending jobs, however.
Another way to do this is through GitLab's prometheus metrics. Go to the /-/metrics endpoint for your GitLab server (the full URL can be found at /admin/health_check). These are the prometheus metrics exposed by GitLab for monitoring. In these metrics, there is a metrics called job_queue_duration_ that you can use to query metrics on the queue duration for jobs. That is -- how long jobs are queued before they begin running. You can even delimit project by project or whether the runners are shared or not.
To get an average of this time per minute, I've used the following prometheus query:
sum(rate(job_queue_duration_seconds_sum{jobs_running_for_project=~".*",shard="default",shared_runner="true"}[1m]))
Although the feature is currently deprecated, you could even create this metric in the builtin metrics monitoring feature, using localhost:9000 (the gitlab server itself) as the configured prometheus server and using the above query, you would see a chart like so:
Of course, any tool that can view prometheus metrics will work (e.g., grafana, or your own prometheus server).
GitLab 15.6 (November 2022) starts implementing your request:
Admin Area Runners - job queued and duration times
When GitLab administrators get reports from their development team that a CI job is either waiting for a runner to become available or is slower than expected, one of the first areas they investigate is runner availability and queue times for CI jobs.
While there are various methods to retrieve this data from GitLab, those options could be more efficient.
They should provide what users need - a view that makes it more evident if there is a bottleneck on a specific runner.
The first iteration of solving this problem is now available in the GitLab UI.
GitLab administrators can now use the runner details view in Admin Area > Runners to view the queue time for CI job and job execution duration metrics.
See Documentation and Issue.

Azure DevOps build using Docker becoming progressively slower

I'm building multiple projects using a single docker build, generating an image and pushing that into AWS ECR. I've recently noticed that builds that were taking 6-7 minutes are now taking on the order of 25 minutes. The Docker build portion of the process that checks out git repos and does the project builds takes ~5 minutes, but what is really slow are the individual Docker build commands such as COPY, ARG, RUN, ENV, LABEL etc. Each one is taking a very long time resulting in an additional 18 minutes or so. The timings vary quite a bit, even though the build remains generally the same.
When I first noticed this degradation Azure was reporting that their pipelines were impacted by "abuse", which I took as a DDOS against the platform (early April 2021). Now, that issue has apparently been resolved, but the slow performance continues.
Are Azure DevOps builds assigned random agents? Should we be running some kind of cleanup process such as docker system prune etc?
Are Azure DevOps builds assigned random agents? Should we be running some kind of cleanup process such as docker system prune etc?
Based on your description:
The timings vary quite a bit, even though the build remains generally the same.
This issue should still be a performance problem of the hosted agent.
And based on the settings of Azure DevOps, every time you run the pipeline with host-agent, the system will randomly match a new qualified agent. Azure DevOps builds assigned random new agent, so we do not need run some kind of cleanup process.
To verify this, you could set your private agent to check if the build time is much different each time (The first build time may be a bit longer because there is no local cache resource).
By the way, if you still want to determine whether the decline in hosted performance is causing your problem, you should contact the Product team directly, and they can check the region where your organization is located to determine whether there is degradation in the region.

Terraform: Can it be used to spin up reserved infrastructure for short periods of time?

I've just learned about Terraform at the highest level and I'm wondering if it's something I should look into for the following purpose:
I'm developing a toy project and exploring the idea of a quick "back-up-and-share-your-own-data" type of feature. The user would mostly need to access their data when they open or save a document (more of a periodic sync than a pub-sub model). It would be great to allow users to capitalize on DigitalOcean's hourly rate.
Would it be possible to just use a tiny persistent volume on DigitalOcean, then programmatically...
Spin up a minimal droplet in response to a (client-side) application event
Hook up to a DigitalOcean floating IP address
Run for just the 1-2 minutes of operation time required
Then destroy the droplet?
This feels like cheating... is this a use-case for Terraform or would I basically be reinventing the wheel when I should just be using the provider API?

Dynamic Service Creation to Distribute Load

Background
The problem we're facing is that we are doing video encoding and want to distribute the load to multiple nodes in the cluster.
We would like to constrain the number of video encoding jobs on a particular node to some maximum value. We would also like to have small video encoding jobs sent to a certain grouping of nodes in the cluster, and long video encoding jobs sent to another grouping of nodes in the cluster.
The idea behind this is to help maintain fairness amongst clients by partitioning the large jobs into a separate pool of nodes. This helps ensure that the small video encoding jobs are not blocked / throttled by a single tenant running a long encoding job.
Using Service Fabric
We plan on using an ASF service for the video encoding. With this in mind we had an idea of dynamically creating a service for each job that comes in. Placement constraints could then be used to determine which pool of nodes a job would run in. Custom metrics based on memory usage, CPU usage ... could be used to limit the number of active jobs on a node.
With this method the node distributing the jobs would have to poll whether a new service could currently be created that satisfies the placement constraints and metrics.
Questions
What happens when a service can't be placed on a node? (Using CreateServiceAsync I assume?)
Will this polling be prohibitively expensive?
Our video encoding executable is packaged along with the service which is approximately 80MB. Will this make the spinning up of a new service take a long time? (Minutes vs seconds)
As an alternative to this we could use a reliable queue based system, where the large jobs pool pulls from one queue and the small jobs pool pulls from another queue. This seems like the simpler way, but I want to explore all options to make sure I'm not missing out on some of the features of Service Fabric. Is there another better way you would suggest?
I have no experience with placement constraints and dynamic services, so I can't speak to that.
The polling of the perf counters isn't terribly expensive, that being said it's not a free operation. A one second poll interval shouldn't cause any huge perf impact while still providing a decent degree of resolution.
The service packages get copied to each node at deployment time rather than when services get spun up, so it'll make the deployment a bit slower but not affect service creation.
You're going to want to put the job data in reliable collections any way you structure it, but the question is how. One idea I just had that might be worth considering is making the job processing service a partitioned service and base your partitioning strategy based off encoding job size and/or tenant so that large jobs from the same tenant get stuck in the same queue, and smaller jobs for others go elsewhere.
As an aside, one thing I've dealt with in the past is SF remoting limits the size of the messages sent and throws if its too big, so if your video files are being passed from service to service you're going to want to consider a paging strategy for inter service communication.

Resources