How to make group runners I own pick up jobs faster? - gitlab

I set up GitLab to use runners configured on my host. They pick up jobs which are in a specific group.
This works fine, except that the runners take some time to pick the jobs up. The first job is usually picked up quickly and the second one (sequential) takes some more time to be picked up.
This delay is acceptable but I would like to understand whether this is due to GitLab processing requests (in which case I will not have the ability to fine tune), or that this is something settable on the machine hosting the runners?

The time is configurable on the runner using the check_interval parameter but the default is 3 seconds so that shouldn't be an issue, maybe you need to adjust the concurrent parameter so more jobs can run in parallel?

Related

GitLab analyze job queue waiting time

Is there a built-in way to get an overview how long jobs, by tag, spend in queue, to check if the runners are over- or undercommited? I checked the Admin Area, but did not find anything, have I overlooked something?
If not, are there any existing solutions you are aware of? I tried searching, but my keywords are too broad and as such the results are as broad, and found nothing yet.
Edit: I see the job REST API can return all runner's jobs and has created_at, started_at and finished_at, maybe I'll have to analyze that.
One way to do this is to use the runners API to list the status of all your active runners. So, at any given moment in time, you can assess how many active jobs you have running by listing each runners 'running' jobs (use ?status=running filter).
So, from the above, you should be able to arrive at the number of currently running jobs. You can compare that against your maximum capacity of jobs (the sum of the configured jobs limits for all your runners).
However, if you are at full capacity, this won't tell you how much over capacity you are or how big the job backlog is. For that, you can look at the pending queue under /admin/jobs. I'm not sure if there's an official API endpoint to list all the pending jobs, however.
Another way to do this is through GitLab's prometheus metrics. Go to the /-/metrics endpoint for your GitLab server (the full URL can be found at /admin/health_check). These are the prometheus metrics exposed by GitLab for monitoring. In these metrics, there is a metrics called job_queue_duration_ that you can use to query metrics on the queue duration for jobs. That is -- how long jobs are queued before they begin running. You can even delimit project by project or whether the runners are shared or not.
To get an average of this time per minute, I've used the following prometheus query:
sum(rate(job_queue_duration_seconds_sum{jobs_running_for_project=~".*",shard="default",shared_runner="true"}[1m]))
Although the feature is currently deprecated, you could even create this metric in the builtin metrics monitoring feature, using localhost:9000 (the gitlab server itself) as the configured prometheus server and using the above query, you would see a chart like so:
Of course, any tool that can view prometheus metrics will work (e.g., grafana, or your own prometheus server).
GitLab 15.6 (November 2022) starts implementing your request:
Admin Area Runners - job queued and duration times
When GitLab administrators get reports from their development team that a CI job is either waiting for a runner to become available or is slower than expected, one of the first areas they investigate is runner availability and queue times for CI jobs.
While there are various methods to retrieve this data from GitLab, those options could be more efficient.
They should provide what users need - a view that makes it more evident if there is a bottleneck on a specific runner.
The first iteration of solving this problem is now available in the GitLab UI.
GitLab administrators can now use the runner details view in Admin Area > Runners to view the queue time for CI job and job execution duration metrics.
See Documentation and Issue.

preferred way to schedule job #Scheduled vs crontab

I have to run one utility periodically for instance say, every minute.
So, I have two option #Scheduled spring boot vs crontab of linux box that we are using to deploy the artifact.
SO, my question is which way should I use?
what are the pros and cons for each solution , any other solution if you can suggest.
Just for comparing between these two, I don't have much points, but only based on this situation which I faced now. I just built a new end point and am doing performance testing and stress testing for the same on production. I am yet to decide the cron schedule times, and those may need a slight tweaking over some more time of observation. Setting via #Scheduled needs me to deploy/restart application every time I make a change.
Application restart generally takes more time than crontab edit.
Other than this, a few points considering the aspects of availability and scalability -
Setting only via crontab on a single server would mean a single point of failure, if the server goes down.
Setting via #Scheduled also could mean the same.
If you have multiple instances of the server, this could mean endpoint getting triggered twice and you may not want to have the same. Worst case, is if the scaling up happens after a long time, and you wrote the #Scheduled endpoint long back, while it was only deployed on a single server and then you forgot. As soon as scaling up happens, the process will start getting hit twice.
So, none of these seem to be the best in terms of points of availability and scalability.
In such situations, ideally a distributed cron management system (I have heard about Rundeck) is needed, which manages which, out of the available servers is to be called to hit the desired end point and if needed to call the next server in case the first one is down.
In case of any need for investigation. logs of rundeck could be checked to find the server which was actually called.

Is there any performance overhead when'RunInterval' on Puppet agent is set to 0 (continuously run)?

By default, a Puppet agent polls after every 30 minutes for any configuration changes on Puppet master. So, there is always a lag (when there is any configuration change on master) of <= 30 minutes in applying configuration changes on applicable agents.
I want the changes to be applied to agents in near real time (approximately in less than a minute). For that, I want to set 'RunInterval' to 0 on agent, so that the changes are applied in near real time.
I want to understand if there is any performance overhead associated when 'RunInterval' is set to 0 (continuously run). How do the agent functions when it is set to run continuously? Does it use some sort of long polling? Is it recommended/advisable to override the default and set 'RunInterval' to 0 (continuously run)?
Yes, there is a good amount of overhead.
There is overhead at the master, which must handle many more requests per unit time -- perhaps as many as 200 times as many requests, depending on how long catalog runs take at the agents. For each request, it must sync plugins with the agent, compile and return a catalog, and possibly serve files, none of which are trivial.
There is also overhead at the agent. For each catalog run, it must at minimum go through each declared resource and test whether that resource is in the specified target state. Doing so is non-trivial even when no changes are required.
Your strategy is more likely to fall over because of the greatly-increased demands it will place on your master than because of the extra load on the clients, but your clients will definitely feel it if they're already carrying a heavy load.
If you want the ability to occasionally trigger specific servers to sync immediately, then consider looking into mcollective.
If you want the ability to routinely trigger many servers to sync immediately, then consider switching to masterless mode, combined with mcollective or some other kind of group remote-control software.
The puppet master will collate your puppet manifests into a single file for your client to use. If all clients are querying the puppet master at the same time, it will likely struggle under the load, provided that you have a non-trivial puppet deployment with several hundred/thousand puppet-managed servers.
What should matter to you is eventual consistency - the certainty that given enough time, the clients will converge to the desired configuration.

Simultaneous Lotus notes server-side agents

In my Lotus Notes workflow application, I have a scheduled server agent (every five minutes). When user's act on a document, a server-side agent is also triggered (this agent modifies the said document, server-side). In the production, we are receiving many complaints that the processing are incomplete or sometimes not being processed at all. I checked the server configuration and found out that only 4 agents can run concurrently. Being a global application with over 50,000 users, the only thing that I can blame with these issues are the volume of agent run, but I'm not sure if I'm correct (I'm a developer and lacks knowledge about these stuffs). Can someone help me find if my reasoning is correct (on simulteneous agents) and help me understand how I can solve this? Can you provide me references please. Thank you in advance!
Important thing to remember.
Scheduled agents on the server will only run one agent from the same database at any given time!
So if you have Database A with agent X (5 mins) and Y (10 mins). It will first run X. Once X completes which ever is scheduled next (X or Y) will run next. It will never let you run X + Y at the same time if they are in the same database.
This is intended behaviour to stop possible deadlocks within the database agents.
Also you have an schedule queue which will have a limit to the number of agents that can be scheduled up. For example if you have Agent X every 5 minutes, but it takes 10 minutes to complete, your schedule queue will slowly fill up and then run out of space.
So how to work around this? There is a couple of ways.
Option 1: Use Program Documents on the server.
Set the agent to scheduled "Never" and have a program document execute the agent with the command.
tell amgr run "dir/database.nsf" 'agentName'
PRO:
You will be able to run agents in <5 minute schedule.
You can run multiple agents in the same database.
CON:
You have to be aware of what the agent is interacting with, and code for it to handle other agents or itself running at the same time.
There can be serious performance implications in doing this. You need to be aware of what is going on in the server and how it would impact it.
If you have lots of databases, you have a messy program document list and hard to maintain.
Agents via "Tell AMGR" will not be terminated if they exceed the agent execution time allowed on the server. They have to be manually killed.
There is easy way to determine what agents are running/ran.
Option 2: Create an agent which calls out to web agents.
PRO:
You will be able to run agents in <5 minute schedule.
You can run multiple agents in the same database.
You have slightly better control of what runs via another agent.
CON:
You need HTTP running on the server.
There are performance implications in doing this and again you need to be aware of how it will interact with the system if multiple instances run or other agents.
Agents will not be terminated if they exceed the agent execution time allowed on the server.
You will need to allow concurrent web agents/web services on the server or you can potentially hang the server.
Option 3: Change from scheduled to another trigger.
For example "When new mail arrives". Overall this is the better option of the three.
...
In closing I would say that you should rarely use the "Execute every 5 mins" if you can, unless it is a critical agent that isn't going to be executed by multiple users across different databases.

Crons for Clusters

Just a quick question that has been bothering me today. I own five servers, all have the exact same image and run behind a load balancer. I want to run a process heavy cron on these servers every half an hour.
I don't want to put the cron on each machine, as it is resource heavy and would block all incoming connections for a good thirty seconds. In addition, I don't really want to put the cron on one machine, just to make sure it is redundant and it will be run.
My possible solutions to this would be to have a remote service that would run the cron, just by way of accessing a URL that would trigger it; I think that would be the most feasible at this point.
I'm really curious as to what other solutions might be available.
Thanks for your time!
You could set up staggered cron jobs on your 5 machines, so it runs every 2.5 hours on each of your 5 machines. Probably the cleanest way to do that is to schedule a job to run every 30 minutes, and have the job itself be a script that runs conditionally, depending on the current time and which machine it's on.
Or, if you have some kind of batch scheduling system, you could run a cron job on one system that submits a batch job, letting the scheduling system choose which server to use. This has the advantage that, assuming your batch system works properly, the job should still run if one of your servers is down. You'll likely need to set up some environment variables in your cron job to let it use the batch system properly.

Resources