I have an AKS cluster with Windows and Linux nodes and containers. For Linux I collect metrics normally with Prometheus but windows metrics are not displayed. I have installed and configured windows_exporter https://github.com/prometheus-community/windows_exporter. Metrics appeared for pods that are in the same namespace as windows_exporter. Could you please help me how to collect matrices from other namespace. Or advise how best to collect metrics from Windows AKS nodes and pods. Thanks.
You can try with the below steps :
After Downloading the Windows-Exporter,
Open the folder in the terminal and run :
.\windows_exporter.exe --collectors.enabled "cpu,cs,logical_disk,os,system,net"
Once the Windows-Exporter is started , Configure Prometheus
for scraping the exporter by adding the below inside the
scrapes_configs array:
- job_name: "windows_exporter"
static_configs:
- targets: ["localhost:9182"]
Now ,configure Prometheus to remote write by adding the
below in the root config:
remote_write:
- url: "https://<PROMETHEUS_SERVER_NAME>/prometheus/remote/write"
tls_config:
insecure_skip_verify: true
Once, the above steps are performed you can start Prometheus
and if you want one day data retention or retention of data as
per your requirement then you can run the below :
prometheus.exe --storage.tsdb.retention.time=1d ##as per your requirement change 1d
Reference:
Monitoring a Windows cluster with Prometheus – Sysdig
OR
As RahulKumarShaw-MT suggested you can refer How to export metrics from Windows Kubernetes nodes in AKS - Octopus Deploy and aidapsibr/aks-prometheus-windows-exporter
Related
I am running a load test over a kubernetes pod and i want to sample every 5 minutes the CPU and memory usage of it.
I was currently manually using the linux top command over the kubernetes pod.
Is there any way given a kubernetes pod to fetch the CPU/Memory usage every X minutes and append it to a file ?
You can install the metrics server and then write a small bash script that in a loop that calls some combo of kubectl top pods --all-namespaces and outputs to a file. Another option if you want something with some more meat is to run Prometheus and/or kube-state-metrics and set it up to scrape metrics from all pods in the system.
I'm running Azure AKS Cluster 1.15.11 with prometheus-operator 8.15.6 installed as a helm chart and I'm seeing some different metrics displayed by Kubernetes Dashboard compared to the ones provided by prometheus Grafana.
An application pod which is being monitored has three containers in it. Kubernetes-dashboard shows that the memory consumption for this pod is ~250MB, standard prometheus-operator dashboard is displaying almost exactly double value for the memory consumption ~500MB.
At first we thought that there might be some misconfiguration on our monitoring setup. Since prometheus-operator is installed as standard helm chart, Daemon Set for node exporter ensures that every node has exactly one exporter deployed so duplicate exporters shouldn't be the reason. However, after migrating our cluster to different node pools I've noticed that when our application is running on user node pool instead of system node pool metrics does match exactly on both tools. I know that system node pool is running CoreDNS and tunnelfront but I assume these are running as separate components also I'm aware that overall it's not the best choice to run infrastructure and applications in the same node pool.
However, I'm still wondering why running application under system node pool causes metrics by prometheus to be doubled?
I ran into a similar problem (aks v1.14.6, prometheus-operator v0.38.1) where all my values were multiplied by a factor of 3. Turns out you have to remember to remove the extra endpoints called prometheus-operator-kubelet that are created in the kube-system-namespace during install before you remove / reinstall prometheus-operator since Prometheus aggregates the metric types collected for each endpoint.
Log in to the Prometheus-pod and check the status page. There should be as many endpoints as there are nodes in the cluster, otherwise you may have a surplus of endpoints:
What is the best way to monitor if cassandra nodes are up? Due to security reasons JMX and nodetool is out of question. I have cluster metrics monitoring via Rest Api, but I understand that even if a node goes Rest Api will only report on a whole cluster.
Well, I have integrated a system where I can monitor all the metrics regarding to my cluster of all nodes. This seems like complicated but pretty simple to integrate. You will need the following components to build up a monitoring system for cassandra:
jolokia jar
telegraf
influxdb
grafana
I'm writing a short procedure, how it works.
Step 1: copy jolokia jvm jar to install_dir/apache-cassandra-version/lib/ , jolokia jvm agent can be downloaded from anywhere in google.
Step 2: add the following line to install_dir/apache-cassandra-version/conf/cassandra-env.sh
JVM_OPTS="$JVM_OPTS -javaagent:<here_goes_the_path_of_your_jolokia_jar>"
Step 3: install telegraf on each node and configure the metrics you want to monitor. and start telegraf service.
Step 4: install grafana and configure your ip, port, protocol. grafana will give you a dashboard to look after your nodes and start grafana service. Your metrics will be able get visibility here.
Step 5: install influxdb on another server from where you want to store your metrics data which will come through telegraf agent.
Step 6: browse the ip you have mentioned, where you have launched your grafana through browser and add data source ip (influxdb ip), then customize your dashboard.
image source: https://blog.pythian.com/monitoring-cassandra-grafana-influx-db/
This is not for monitoring but only for node state.
Cassandra CQL driver provides info if a particular node is UP or DOWN with Host.StateListener Interface. This info is used by driver to mark a node UP or Down. Thus it could be used if node is down or up if JMX is not accessible.
Java Doc API : https://docs.datastax.com/en/drivers/java/3.3/
I came up with a script which listens for DN nodes in the cluster and reports it to our monitoring setup which is integrated with pagerduty.
The script runs on one of our nodes and executes nodetool status every minute and reports for all down nodes.
Here is the script https://gist.github.com/johri21/87d4d549d05c3e2162af7929058a00d1
[1]:
I have a Kubernetes cluster. Provisioned with kops, running on CoreOS workers. From time to time I see a significant load spikes, that correlate with I/O spikes reported in Prometheus from node_disk_io_time_ms metric. The thing is, I seem to be unable to use any metric to pinpoint where this I/O workload actually originates from. Metrics like container_fs_* seem to be useless as I always get zero values for actual containers, and any data only for whole node.
Any hints on how can I approach the issue of locating what is to be blamed for I/O load in kube cluster / coreos node very welcome
If you are using nginx ingress you can configure it with
enable-vts-status: "true"
This will give you a bunch of prometheus metrics for each pod that has on ingress. The metric names start with nginx_upstream_
In case it is the cronjob creating the spikes, install node-exporter daemonset and check the metrics container_fs_
After reading the discussions about the differences between mesos and kubernetes and kubernetes-vs.-mesos-vs.-swarm, I am still confused about how to create a Spark and TensorFlow cluster with docker containers via some bear metal hosts and AWS like private cloud (OpenNebular).
Currently, I am able to build a static TensorFlow cluster with docker containers manually distributed to different hosts. I only run a stand alone spark on a bear metal host. The way of manually setup a mesos cluster for containers can be found here.
Since my resources are limited, I would like to find a way to deploy docker containers to the current mixed infrastructure to build either a tensorflow or spark cluster, so that I can do data analysis either with tensorflow or spark on the same resources.
Is it possible to create/run/undeploy a spark or tensorflow cluster quickly with docker containers on a mixed infrastructure with mesos or kubernetes? How can I do that?
Any comments and hints are welcome.
Given you have limited resources, I suggest you have a look at using the Spark helm, which gives you:
1 x Spark Master with port 8080 exposed on an external LoadBalancer
3 x Spark Workers with HorizontalPodAutoscaler to scale to max 10 pods when CPU hits 50% of 100m
1 x Zeppelin with port 8080 exposed on an external LoadBalancer
If this configuration doesn't work then you can build your own docker images and deploy those, take a look at this blog series. There is work underway to make Spark more Kubernetes friendly. This issue also gives some insight.
Not looked into Tensorflow, I suggest you look at this blog