Is there a built-in way to get an overview how long jobs, by tag, spend in queue, to check if the runners are over- or undercommited? I checked the Admin Area, but did not find anything, have I overlooked something?
If not, are there any existing solutions you are aware of? I tried searching, but my keywords are too broad and as such the results are as broad, and found nothing yet.
Edit: I see the job REST API can return all runner's jobs and has created_at, started_at and finished_at, maybe I'll have to analyze that.
One way to do this is to use the runners API to list the status of all your active runners. So, at any given moment in time, you can assess how many active jobs you have running by listing each runners 'running' jobs (use ?status=running filter).
So, from the above, you should be able to arrive at the number of currently running jobs. You can compare that against your maximum capacity of jobs (the sum of the configured jobs limits for all your runners).
However, if you are at full capacity, this won't tell you how much over capacity you are or how big the job backlog is. For that, you can look at the pending queue under /admin/jobs. I'm not sure if there's an official API endpoint to list all the pending jobs, however.
Another way to do this is through GitLab's prometheus metrics. Go to the /-/metrics endpoint for your GitLab server (the full URL can be found at /admin/health_check). These are the prometheus metrics exposed by GitLab for monitoring. In these metrics, there is a metrics called job_queue_duration_ that you can use to query metrics on the queue duration for jobs. That is -- how long jobs are queued before they begin running. You can even delimit project by project or whether the runners are shared or not.
To get an average of this time per minute, I've used the following prometheus query:
sum(rate(job_queue_duration_seconds_sum{jobs_running_for_project=~".*",shard="default",shared_runner="true"}[1m]))
Although the feature is currently deprecated, you could even create this metric in the builtin metrics monitoring feature, using localhost:9000 (the gitlab server itself) as the configured prometheus server and using the above query, you would see a chart like so:
Of course, any tool that can view prometheus metrics will work (e.g., grafana, or your own prometheus server).
GitLab 15.6 (November 2022) starts implementing your request:
Admin Area Runners - job queued and duration times
When GitLab administrators get reports from their development team that a CI job is either waiting for a runner to become available or is slower than expected, one of the first areas they investigate is runner availability and queue times for CI jobs.
While there are various methods to retrieve this data from GitLab, those options could be more efficient.
They should provide what users need - a view that makes it more evident if there is a bottleneck on a specific runner.
The first iteration of solving this problem is now available in the GitLab UI.
GitLab administrators can now use the runner details view in Admin Area > Runners to view the queue time for CI job and job execution duration metrics.
See Documentation and Issue.
Related
I have a long-running Java/Gradle process and an Azure Pipelines job to run it.
It's perfectly fine and expected for the process to run for several days, potentially over a week. The Azure Pipelines job is run on a self-hosted agent (to rule out any timeout issues) and the timeout is set to 0, which in theory means that the job can run forever.
Sometimes the Azure Pipelines job fails after a day or two with an error message that says "We stopped hearing from agent". Even when this happens, the job may still be running, as evident when SSH-ing into the machine that hosts the agent.
When I discuss investigating these failures with DevOps, I often hear that Azure Pipelines is a CI tool that is not designed for long-running jobs. Is there evidence to support this claim? Does Microsoft commit to only support running jobs within a certain duration limit?
Based on the troubleshooting guide and timeout documentation page referenced above, there's a duration limit applicable to Microsoft-hosted agents, but I fail to see anything similar for self-hosted agents.
Agree with #Dianel Mann.
It's not common to run long-time jobs, but as per doc, it should be supported.
stopped hearing from agent could be caused by network problem on the agent, or agent issue due to high cpu, storage, ram...etc. You can check the agent diagnostic log to troubleshoot.
CI environments like GitLab (self-hostet and Cloud), GitHub, CodeCov, Codacy, ... collect statistics over time, so a developer or team lead can see the evolution of the project over time:
number of merge requests, commits, contributors, ...
number of passed/failed tests
code coverage
used runtime for e.g. unit tests on a server
...
Unfortunately, these statistics are: decentralized (multipe cloud services are needed), specific to the services who offer them, and they are not general purpose.
I'm seeking for a solution to collect data points over time per repository or repository group. My background is hardware development with e.g. FPGAs and also embedded software.
used hardware resources like gates, memory, multiplier units, ...
timing errors (how many wires do not meet timing constraints)
achievable (max) frequency
number of critical errors, warnings and info messages
Other more software-like parameters could be:
performance / per test-case runtime
executable size
All these metrics are essential to detect improvements / optimizations over time or to notice degradation before a hardware designs stops working (get unreliable).
What I know so far:
Such data is ideally stored in a time series database with either unlimited time span (if you want to compare even years back when the project started) or limited like last 12 month.
Prometheus is used widely in cloud and network setups e.g. to collect CPU/RAM usage, network traffic, temperatures and other arbitrary data points over time.
Prometheus is part of a self-hosted GitLab installation.
Visualization can be done via Grafana.
Users can defined new diagrams and panels.
Grafana is part of a self-hosted GitLab installation.
What's missing from my point of view - and here I'm seeking help or ideas:
How to connect new time series in Prometheus with a Git repository?
How to define access rights based on who can access a Git repository?
How to add new views to Grafana if a repository pushes such statistics?
How to get rid of old data if the repository gets deleted.
At best configure it with a YAML file in the repository itself.
...
Of cause I could set it up if it's just a single repository pushing data points, but I have > 100 repositories and currently 1-3 are added per week.
Is such a service / add-on already available?
(I tried to asked it at DevOps but it got only 10 views due to low activity in DevOps.)
You could use aws cloudwatch custom metrics and then build a cloudwatch dashboard to look at the metrics.
aws cloudwatch put-metric-data --metric-name Buffers --namespace MyNameSpace --unit Bytes --value 231434333 --dimensions InstanceId=1-23456789,InstanceType=m1.small
aws cloudwatch get-metric-statistics --metric-name Buffers --namespace MyNameSpace --dimensions Name=InstanceId,Value=1-23456789 Name=InstanceType,Value=m1.small --start-time 2016-10-15T04:00:00Z --end-time 2016-10-19T07:00:00Z --statistics Average --period 60```
I have a Gitlab Runner running on Kubernetes. I see there are options to limit concurrent jobs from the runner level, but it would be preferable if we could do this in .gitlab-ci.yml level or from project level, but we can't we find the settings for it.
I saw this Disallow CI pipelines of one GitLab project to run concurrently? but I am also not sure how to do this on runner deployed on Kubernetes.
If I do this, it won't stop the runner from creating pods still which defeats the purpose of limiting concurrent jobs and control how much resources to be allocated.
I've also looked up resource_group but it limits the jobs to 1, but we want to limit the jobs to 2 or 3 to run concurrently.
Yes, there now is a way officially supported with GitLab 14.10 (April 2022):
CI/CD Limits set at the Instance Level
To contain resource usage in support of instance stability, GitLab administrators of instances with high CI/CD usage might want to add limits for specific CI/CD events.
This could be to set:
the maximum number of jobs in a single pipeline,
the maximum number of concurrent jobs in active pipelines, or
the maximum number of scheduled pipelines per project.
...
GitLab instance administrators can now set these limits (and others) directly in the instance’s Admin Area panel.
See Documentation and Issue.
I am facing below error while i am running the pipeline.
This agent request is not running because you have reached the maximum number of requests that can run for parallelism type 'Microsoft-Hosted Private'. Current position in queue: 1
Note: This is my first job and i am not running any additional pipelines in the same project.
Please help me to sort out this issue.
Thanks & Regards,
Muthu Kumar M.,
First of all, you can view in-progress jobs in Parallel jobs of Organization Settings to check if there is only one running job.
If your organization is newly created, there could be no agent pool available.
Since March, we have temporarily disabled the free grant of parallel jobs for public projects and for certain private projects in new organizations. However, you can request this grant by sending an email to azpipelines-freetier#microsoft.com.
Related release note
For more information about parallel jobs and free grants, see our documentation.
You can create your own private build server to overcome this limit.
I have two pipelines (also called "build definitions") in azure pipelines, one is executing system tests and one is executing performance tests. Both are using the same test environment. I have to make sure that the performance pipeline is not triggered when the system test pipeline is running and vice versa.
What I've tried so far: I can access the Azure DevOps REST-API to check whether a build is running for a certain definition. So it would be possible for me to implement a job executing a script before the actual pipeline runs. The script then just checks for the build status of the other pipeline by checking the REST-API each second and times out after e.g. 1 hour.
However, this seems quite hacky to me. Is there a better way to block a build pipeline while another one is running?
If your project is private, the Microsoft-hosted CI/CD parallel job limit is one free parallel job that can run for up to 60 minutes each time, until you've used 1,800 minutes (30 hours) per month.
The self-hosted CI/CD parallel job limit is one self-hosted parallel job. Additionally, for each active Visual Studio Enterprise subscriber who is a member of your organization, you get one additional self-hosted parallel job.
And now, there isn't such setting to control different agent pool parallel job limit.But there is a similar problem on the community, and the answer has been marked. I recommend you can check if the answer is helpful for you. Here is the link.