Grouping servers in grafana/prometheus - percona

I would like to group database servers in grafana dashboards e.g, servers belonging to the same cluster, db-pxc, end up looking like this:
DB-PXC
-Disk_Performance
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
-Disk_Space
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
-MySQL_Overview
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
-MySQL_Table_statistics
-db-pxc-1
-db-pxc-2
-db-pxc-3
...
...
So if i click on the parent dashboard Disk_Space, it displays disk space sub dashboard for each host in the db-pxc cluster (db-pxc-1, db-pxc-2, db-pxc-3, ...). That way i can compare the disk space usage of all my servers in one cluster on a single page. We already have this setup in cacti, but not sure how we can achieve the same with grafana.
We are using Promethues monitoring system, node_exporter & mysqld_exporter for collecting statistics on each individual server, and grafana for viewing the dashboard. To view data of the mysqld and node exporters supported by prometheus in grafana, we are using the Percona Grafana plugin.
Below is an example of what i am asking for. In the picture below, the db cluster name is kdb, db-kdb-1, db-kdb-2, db-kdb-3 and db-kdb-4 being being part of the nodes that forms the cluster. So like seen below, when i click on CPU, it shows all CPU usage of my kdb cluster nodes.

For say percentage root filesystem usage you'd have one graph with an expression like:
100 - node_filesystem_free{job='node',mountpoint='/'} / node_filesystem_size{job='node',mountpoint='/'} * 100
which would show the result for all matching machines.

You need to create a Prometheus target with all the instances' IP of your given cluster and use its job name in Grafana.
Create the following target file with the IP of your cluster's instances:
- targets:
- 10.149.121.21:9100
- 10.149.121.22:9100
- 10.149.121.23:9100
- 10.149.121.24:9100
labels:
job: kdbcluster
Then, on Grafana, you create 4 new graphs with the following respective queries:
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.21:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.22:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.23:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
100 - (avg by (instance) (irate(node_cpu{instance="10.149.121.24:9100",mode="idle", job="kdbcluster"}[5m])) * 100)
If you want to have all graphs on the same one, you can use that query:
100 - (avg by (instance) (irate(node_cpu{mode="idle", job="kdbcluster"}[5m])) * 100)
If you want to add to the previous graph a line which is the average of all the instances CPU load, you can use this query:
100 - (avg (irate(node_cpu{mode="idle", job="kdbcluster"}[5m])) * 100)

Related

Database design issue for Multi-tenant application

We have an application that does lot of data heavy work on the server for a multi-tenant workspace.
Here are the things that it do :
It loads data from files from different file format.
Execute idempotence rules based on the logic defined.
Execute processing logic like adding discount based on country for users / calculating tax amount etc.. These are specific to each tenant.
Generate refreshed data for bulk edit.
Now after these processing is done, the Tenant will go the the Interface, do some bulk edit overrides to users, and finally download them as some format.
We have tried a lot of solutions before like :
Doing it in one SQL database where each tenant is separated with tenant id
Doing it in Azure blobs.
Loading it from file system files.
But none has given performance. So what is presently designed is :
We have a Central database which keeps track of all the databases of Customers.
We have a number of Database Elastic Pools in Azure.
When a new tenant comes in, we create a Database, Do all the processing for the users and notify the user to do manual job.
When they have downloaded all the data we keep the Database for future.
Now, as you know, Elastic Pools has a limit of number of databases, which led us to create multiple Elastic pools, and eventually keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases.
Proposed Changes:
As gradually we are incurring more and more cost to our Azure account, we are thinking how to reduce this.
What I was proposing is :
We create one Elastic Pool, which has 500 database limit with enough DTU.
In this pool, we will create blank databases.
When a customer comes in, the data is loaded on any of the blank databases.
It does all the calculations, and notify the tenant for manual job.
When manual job is done, we keep the database for next 7 days.
After 7 days, we backup the database in Azure Blob, and do the cleanup job on the database.
Finally, if the same customer comes in again, we restore the backup on a blank database and continue. (This step might take 15 - 20 mins to setup, but it is fine for us.. but if we can reduce it would be even better)
What do you think best suited for this kind of problem ?
Our objective is how to reduce Azure cost, and also providing best solution to our customers. Please help on any architecture that you think would be best suited in this scenario.
Each customer can have millions of Record ... we see customers having 50 -100 GB of databases even... and also with different workloads for each tenant.
Here is where the problem starts:
"[...] When they have downloaded all the data we keep the Database for future."
This is very wrong because it leads to:
"[...] keeping on increasing the Azure Cost immensely, while 90% of the databases are not in use at a given point of time. We already have more than 10 elastic pools each consisting of 500 databases."
This is not only a problem of costs but also a problem with security compliance.
How long should you store those data?
Are these data complying with what county policy?
Here is my 2 solution:
It goes by itself that if you don't need those data you just have to delete those databases. You will lower your costs immediately
If you cannot delete them, because they are not in use, switch from Elastic Pool to Serverless
EDIT:
Azure SQL Database gets expensive only when you use them.
If they are unused they will cost nothing. But "unused" means no connections to it. If you have some internal tool that wakes them up ever hours they will never fall in serverless state so you will pay a lot.
HOW TO TEST SERVERLESS:
Take a database that you you know it's unused and put it in serverless state for 1 week; you will see how the cost of that database drop on the Cost Management. And of course, take it off from the Elastc Pool.
You can run this query on the master database:
DECLARE #StartDate date = DATEADD(day, -30, GETDATE()) -- 14 Days
SELECT
##SERVERNAME AS ServerName
,database_name AS DatabaseName
,sysso.edition
,sysso.service_objective
,(SELECT TOP 1 dtu_limit FROM sys.resource_stats AS rs3 WHERE rs3.database_name = rs1.database_name ORDER BY rs3.start_time DESC) AS DTU
/*,(SELECT TOP 1 storage_in_megabytes FROM sys.resource_stats AS rs2 WHERE rs2.database_name = rs1.database_name ORDER BY rs2.start_time DESC) AS StorageMB */
/*,(SELECT TOP 1 allocated_storage_in_megabytes FROM sys.resource_stats AS rs4 WHERE rs4.database_name = rs1.database_name ORDER BY rs4.start_time DESC) AS Allocated_StorageMB*/
,avcon.AVG_Connections_per_Hour
,CAST(MAX(storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) StorageGB
,CAST(MAX(allocated_storage_in_megabytes) / 1024 AS DECIMAL(10, 2)) Allocated_StorageGB
,MIN(end_time) AS StartTime
,MAX(end_time) AS EndTime
,CAST(AVG(avg_cpu_percent) AS decimal(4,2)) AS Avg_CPU
,MAX(avg_cpu_percent) AS Max_CPU
,(COUNT(database_name) - SUM(CASE WHEN avg_cpu_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [CPU Fit %]
,CAST(AVG(avg_data_io_percent) AS decimal(4,2)) AS Avg_IO
,MAX(avg_data_io_percent) AS Max_IO
,(COUNT(database_name) - SUM(CASE WHEN avg_data_io_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Data IO Fit %]
,CAST(AVG(avg_log_write_percent) AS decimal(4,2)) AS Avg_LogWrite
,MAX(avg_log_write_percent) AS Max_LogWrite
,(COUNT(database_name) - SUM(CASE WHEN avg_log_write_percent >= 40 THEN 1 ELSE 0 END) * 1.0) / COUNT(database_name) * 100 AS [Log Write Fit %]
,CAST(AVG(max_session_percent) AS decimal(4,2)) AS 'Average % of sessions'
,MAX(max_session_percent) AS 'Maximum % of sessions'
,CAST(AVG(max_worker_percent) AS decimal(4,2)) AS 'Average % of workers'
,MAX(max_worker_percent) AS 'Maximum % of workers'
FROM sys.resource_stats AS rs1
inner join sys.databases dbs on rs1.database_name = dbs.name
INNER JOIN sys.database_service_objectives sysso on sysso.database_id = dbs.database_id
inner join
(SELECT t.name
,round(avg(CAST(t.Count_Connections AS FLOAT)), 2) AS AVG_Connections_per_Hour
FROM (
SELECT name
--,database_name
--,success_count
--,start_time
,CONVERT(DATE, start_time) AS Dating
,DATEPART(HOUR, start_time) AS Houring
,sum(CASE
WHEN name = database_name
THEN success_count
ELSE 0
END) AS Count_Connections
FROM sys.database_connection_stats
CROSS JOIN sys.databases
WHERE start_time > #StartDate
AND database_id != 1
GROUP BY name
,CONVERT(DATE, start_time)
,DATEPART(HOUR, start_time)
) AS t
GROUP BY t.name) avcon on avcon.name = rs1.database_name
WHERE start_time > #StartDate
GROUP BY database_name, sysso.edition, sysso.service_objective,avcon.AVG_Connections_per_Hour
ORDER BY database_name , sysso.edition, sysso.service_objective
The query will return you statistics for all the databases on the server.
AVG_Connections_per_Hour: collects data of the last 30 days
All AVG and MAX statistics: collects data of the last 14 days
Pick a provider, and host the workloads. Under demand: provide fan-out among the cloud providers when needed.
This solution requires minimal transfer.
You could perhaps denormalise your needed data and store it in ClickHouse? It's a fast noSQL database for online analytical processing meaning that you can run queries which compute discount on the fly and it's very fast millions to billions rows per second. You will query using their custom SQL which is intuitive, powerful and can be extended with Python/C++.
You can try doing it like you did it before but with ClickHouse but opt in for a distributed deployment.
"Doing it in one SQL database where each tenant is separated with tenant id"
The deployment of Clickhouse cluster can be done on Kubernetes using the Altinity operator, it's free and you only have to pay for the resources, paid or managed options are also available.
ClickHouse also supports lots of integrations which means that you can perhaps stream data into it from Kafka or RabbitMQ or from local files/S3 Files
I've been running a test ClickHouse cluster with 150M rows and 70 columns mostly int64 fields. A DB query with 140 filters on all the columns took about 7-8 seconds on light load and 30-50s on heavy load. The cluster had 5 members (2 shards, 3 replicas).
Note: I'm not affiliated with ClickHouse, I just like the database. You could try to find another OLAP alternative on Azure.

How the cAdvisor gets the container_cpu_load_average_10s? and what is the unit of this metric?

I'm trying to check which container makes the cpu load higher in my cluster.
So that I deployed the cAdvisor daemonset on my k8s cluster with the args below.
args:
- -enable_load_reader=true
- -disable_metrics=sched,network,tcp,advtcp,udp
also .spec.hostNetwork: true
I want to know how the cAdvisor calculates or reports the value of container_cpu_load_average_10s. (how the cAdvisor gets that value)
And if the value of container_cpu_load_average_10s is 100, then that container loads 100 cores? or 100 milicores? or sth else? (the unit of this metric)
And when I checked the metrics (container_cpu_load_average_10s), the value of cadvisor pod's metrics is over 3000.
The others are below 300, but the cadvisor pod's metrics are among 4000~7000.
Please let me know why the values of cadvisor pods' container_cpu_load_average_10s are so high.
Thanks a lot
container_cpu_load_average_10s=loadAvg * 1000 .
convert to 'milliLoad' to avoid floats and preserve precision.

Prometheus. CPU process time total to % percent

We started using Prometheus and Grafana as the main tools for monitoring our Service Fabric cluster. For targeting Prometheus we use wmi_exporter, with predefined parameters: CPU, system, process, service, memory, etc. Our main goal was to start monitoring our product services on the node group each instance in Azure Service Fabric.
For instance, we are using this PQuery to calculate total CPU usage in %:
100 - (avg by (hostname) (irate(wmi_cpu_time_total{scaleset="name",mode="idle" }[5m])) * 100) and metrics +- looks realistic.
Until we started to write queries for services.
For services, sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100, and metrics seems to be not realistic time to time, especially it is obvious after you compare it with total CPU time %. I found out an article regarding multiplying to 100 for getting % from CPU time, but in this case, I get metrics around 170% or more. Perhaps I need to divide it into the number of CPU cores?
Regarding query, I'm using the sum process because I get two different metrics for one process in two modes, user and privileged.
Can anyone please help me with the correct calculation for CPU process time total metric and transforming them to perc. ?
Thank you, I would be grateful for any help!
I hope this will help!
The result is pretty much the same as the Windows performance manager.
So, for CPU % for running services (tasks, processes):
sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100 / 2 (number of CPU cores)
First, you summarize all metrics for the running process, the exporter provides results for the same process ID: user and kernel mode metrics, so it needs to be summarized. The same must be done for hostname (instance, etc.). In my case, I have Azure scale sets, from 2 to 5 instances. It must be multiplied on 100 to get % and divide on number of CPU cores.
Cheers!

Nodetool load and own stats

We are running 2 nodes in a cluster - replication factor 1.
After writing a burst of data, we see the following via node tool status.
Node 1 - load 22G (owns 48.2)
Node 2 - load 17G (owns 51.8)
As the payload size per record is exactly equal - what could lead to a node showing higher load despite lower ownership?
Nodetool status uses the Owns column to indicate the effective percentage of the token range owned by the nodes. While GB is Size of your records
Dont see anything wrong here. Your data is almost evenly distributed around your two nodes which is exactly what you want for perfekt performance.

How to calculate Azure SQL Data Warehouse DWU?

I am analyzing Azure SQL DW and I came across the term DWU (Data warehouse units). The link on Azure site only mentions a crude definition of DWU. I want to understand how DWU is calculated and how should I scale my system accordingly.
I have also referred to the link but it does not cover my question:
In addition to the links you found it is helpful to know that Azure SQL DW stores data in 60 different parts called "distributions". If your DW is DWU100 then all 60 distributions are attached to one compute node. If you scale to DWU200 then 30 distributions are detached and reattached to a second compute node. If you scale all the way to DWU2000 then you have 20 compute nodes each with 3 distributions attached. So you see how DWU is a measure of the compute/query power of your DW. As you scale you have more compute operating on less data per compute node.
Update: For Gen2 there are still 60 distributions but the DWU math is a bit different. DWU500c is one full size node (playing both compute and control node roles) where all 60 distributions are mounted. Scales smaller than DWU500c are single nodes that are not full size (meaning fewer cores and less RAM than full size nodes on larger DWUs). DWU1000c is 2 compute nodes each with 30 distributions mounted and there is a separate control node. DWU1500c is 3 compute nodes and a separate control node. And the largest is DWU30000c which is 60 compute nodes each with one distribution mounted.
I just found this link which shows the throughput to DWU relation
You can also checkout the dwucalculator. This site walks you through the process of taking a capture for your existing workload and makes a recommendation on the number of DWUs necessary to fulfill the workload in Azure SQL DW.
http://dwucalculator.azurewebsites.net/
Depending on the amount of time and the number of tables, you may choose DWU.
For eg: If 100 DWU's are taking 15 mins of time for 3 tables and to implement the same in 3 mins you may choose 500 DWU.

Resources