Gitlab has a complete metrics dashboard running on a Grafana instance.
If I head to Gitlab Omnibus - Overview, within this dashboard, I can see a Workhorse Latency panel and if I step the mouse over, it shows me the count of requests on that specific bucket. So if I understand correctly, there are three requests in the bucket 10s - 30s in the image below.
However, when I inspect the data, I see these numbers in image 2, which do not make sense to me. What do these figures mean? How do I make sense of them?
On Prometheus I see the same numbers from the query:
sum by (le)(rate(gitlab_workhorse_http_request_duration_seconds_bucket{instance=~"localhost:9229"}[1m]))
How can I count those numbers to make the same as what I'm seeing in the panel?
The GitLab Prometheus metrics does include indeed:
gitlab_rails_queue_duration_seconds Measures latency between GitLab Workhorse forwarding a request to Rails
gitlab_transaction_rails_queue_duration_total Measures latency between GitLab Workhorse forwarding a request to Rails
I suppose the first graph display the number of queries which were within a certain latency limit, as reported by all the occurrences listed by the second image.
Related
CI environments like GitLab (self-hostet and Cloud), GitHub, CodeCov, Codacy, ... collect statistics over time, so a developer or team lead can see the evolution of the project over time:
number of merge requests, commits, contributors, ...
number of passed/failed tests
code coverage
used runtime for e.g. unit tests on a server
...
Unfortunately, these statistics are: decentralized (multipe cloud services are needed), specific to the services who offer them, and they are not general purpose.
I'm seeking for a solution to collect data points over time per repository or repository group. My background is hardware development with e.g. FPGAs and also embedded software.
used hardware resources like gates, memory, multiplier units, ...
timing errors (how many wires do not meet timing constraints)
achievable (max) frequency
number of critical errors, warnings and info messages
Other more software-like parameters could be:
performance / per test-case runtime
executable size
All these metrics are essential to detect improvements / optimizations over time or to notice degradation before a hardware designs stops working (get unreliable).
What I know so far:
Such data is ideally stored in a time series database with either unlimited time span (if you want to compare even years back when the project started) or limited like last 12 month.
Prometheus is used widely in cloud and network setups e.g. to collect CPU/RAM usage, network traffic, temperatures and other arbitrary data points over time.
Prometheus is part of a self-hosted GitLab installation.
Visualization can be done via Grafana.
Users can defined new diagrams and panels.
Grafana is part of a self-hosted GitLab installation.
What's missing from my point of view - and here I'm seeking help or ideas:
How to connect new time series in Prometheus with a Git repository?
How to define access rights based on who can access a Git repository?
How to add new views to Grafana if a repository pushes such statistics?
How to get rid of old data if the repository gets deleted.
At best configure it with a YAML file in the repository itself.
...
Of cause I could set it up if it's just a single repository pushing data points, but I have > 100 repositories and currently 1-3 are added per week.
Is such a service / add-on already available?
(I tried to asked it at DevOps but it got only 10 views due to low activity in DevOps.)
You could use aws cloudwatch custom metrics and then build a cloudwatch dashboard to look at the metrics.
aws cloudwatch put-metric-data --metric-name Buffers --namespace MyNameSpace --unit Bytes --value 231434333 --dimensions InstanceId=1-23456789,InstanceType=m1.small
aws cloudwatch get-metric-statistics --metric-name Buffers --namespace MyNameSpace --dimensions Name=InstanceId,Value=1-23456789 Name=InstanceType,Value=m1.small --start-time 2016-10-15T04:00:00Z --end-time 2016-10-19T07:00:00Z --statistics Average --period 60```
Is there a built-in way to get an overview how long jobs, by tag, spend in queue, to check if the runners are over- or undercommited? I checked the Admin Area, but did not find anything, have I overlooked something?
If not, are there any existing solutions you are aware of? I tried searching, but my keywords are too broad and as such the results are as broad, and found nothing yet.
Edit: I see the job REST API can return all runner's jobs and has created_at, started_at and finished_at, maybe I'll have to analyze that.
One way to do this is to use the runners API to list the status of all your active runners. So, at any given moment in time, you can assess how many active jobs you have running by listing each runners 'running' jobs (use ?status=running filter).
So, from the above, you should be able to arrive at the number of currently running jobs. You can compare that against your maximum capacity of jobs (the sum of the configured jobs limits for all your runners).
However, if you are at full capacity, this won't tell you how much over capacity you are or how big the job backlog is. For that, you can look at the pending queue under /admin/jobs. I'm not sure if there's an official API endpoint to list all the pending jobs, however.
Another way to do this is through GitLab's prometheus metrics. Go to the /-/metrics endpoint for your GitLab server (the full URL can be found at /admin/health_check). These are the prometheus metrics exposed by GitLab for monitoring. In these metrics, there is a metrics called job_queue_duration_ that you can use to query metrics on the queue duration for jobs. That is -- how long jobs are queued before they begin running. You can even delimit project by project or whether the runners are shared or not.
To get an average of this time per minute, I've used the following prometheus query:
sum(rate(job_queue_duration_seconds_sum{jobs_running_for_project=~".*",shard="default",shared_runner="true"}[1m]))
Although the feature is currently deprecated, you could even create this metric in the builtin metrics monitoring feature, using localhost:9000 (the gitlab server itself) as the configured prometheus server and using the above query, you would see a chart like so:
Of course, any tool that can view prometheus metrics will work (e.g., grafana, or your own prometheus server).
GitLab 15.6 (November 2022) starts implementing your request:
Admin Area Runners - job queued and duration times
When GitLab administrators get reports from their development team that a CI job is either waiting for a runner to become available or is slower than expected, one of the first areas they investigate is runner availability and queue times for CI jobs.
While there are various methods to retrieve this data from GitLab, those options could be more efficient.
They should provide what users need - a view that makes it more evident if there is a bottleneck on a specific runner.
The first iteration of solving this problem is now available in the GitLab UI.
GitLab administrators can now use the runner details view in Admin Area > Runners to view the queue time for CI job and job execution duration metrics.
See Documentation and Issue.
I want to emit metrics from my node application to monitor how frequently a certain branch of code is reached. For example, I am interested in knowing how many times a service call didn't return the expected response. Also I want to be able to emit for each service call the time it took etc.
I am expecting I will be using a client in the code that will emit metrics to a server and then I will be able to view the metrics in a dashboard on the server. I am more interested in open source solutions that I can host on my own infrastructure.
Please note, I am not interested in system metrics here such as CPU, memory usage etc.
Implement pervasive logging and then use something like Elasticsearch + Kibana to display them in a dashboard.
There are other metric dashboard systems such as Grafana, Graphite, Tableu etc. A lot of them send metrics which are numbers associated with tags such as counting function calls, CPU load etc. The main reason I like the Kibana solution is that it is not based on metrics but instead extracts metrics from your log files.
The only thing you really need to do with your code is make sure your logs are timestamped.
Google for Kibana or "ELK stack" (ELK stands for Elasticsearch + Logstash + Kibana) for how to set this up. The first time I set it up took me just a few hours to get results.
Node has several loggers that can be configured to send log events to ELK. In addition the Logstash (or the modern "Beats") part of ELK can ingest any log file and parse them with regexp to forward data to Elasticsearch so you do not need to modify your software.
The ELK solution can be configured simply or you can spend literally weeks tuning your data parsing and graphs to get more insights - it is very flexible and how you use it is up to you.
Metrics vs Logs (opinion):
What you want is of course the metrics. But metrics alone doesn't say much. What you are ultimately after is being able to analyse your system for debugging and optimisation. This is where logging has an advantage.
With a solution that extracts metrics from logs like Kibana you have another layer to deep-dive into behind the metrics. You can query it to find what events caused the metrics. This is not easy to do on a running system because you would normally have to simulate inputs to your system to get similar metrics to figure out what is happening. But with Kibana you can analyse historical events that already happened instead!
Here's an old screenshot of a Kibana set-up I did a few years back to monitor a web service (including all emails it receives):
Note the screenshot above - apart from the graphs and metrics I extract from my system I also display parsed logs at the bottom of the dashboard so I get near real-time view of what is happening. This is the email received dashboard which we used to monitor things like subscriptions, complaints, click-through rates etc.
I have an Azure VM running (Win) where the Scheduler regularly calls a VBS script to load a small data set, retrieved from a web site API, into a SQL database table. Now, when I see the Network-In and -Out chart on my Azure Portal Dashboard there seems ridiculously high traffic going on, like GBs of data flowing in and out for no obvious reason. My VBS only loads small KB amounts per day - where is all that traffic Azure Dashboard Screen Shot coming from?
From your screenshot, you properly set the Time range as Last 30 days in your Metrics Chart. If you take the mouse off the graph, the total bytes show at the bottom. This reflects the total incoming or outgoing network traffic received all the interface on the machine in the last few days. You could set the time range to one day.
Generally, we use the network in or out metric monitor network performance on the VM. Refer to the network metric in the article.
My conclusion / answer is now the the LRS is responsible for the traffic. Thanks also for your help!
I need to do a performance test where I have to send millions of Url requests to the Server over a period of time and capture all their responses. From the responses, I need to calculate their average response times and standard deviation (this can be done using spreadsheet but when it comes to millions of Urls, it is cumbersome). What is the best possible way to test this performance scenario. Any help would be greatly appreciated.
My environment is as below :
NLBs to route the requests to resolvers.
linux servers as our core resolvers.
Windows machines are used for clients. Requests generated by these machines.
I believe the fastest and the easiest way would be using a load testing tool, i.e. Apache JMeter which is free, open source and doesn't require any extra knowledge.
URLs can be defined either using CSV Data Set Config or JMeter can be configured to act like a "crawler" via HTML Link Parser
Once your test is finished you can visualize results using i.e. Summary Report listener which has average response time, standard deviation and some other "interesting" metrics:
JMeter can be run in Distributed Mode so you can run a JMeter instance per Windows machine and results will be aggregated on a master node.
Check out JMeter Academy to get ramped up on the tool in just few hours (if not minutes)