Prometheus - Scraping metrics from different endpoints inside an Azure VM - azure

I have Prometheus running inside a Kubernetes cluster in Azure, and I'm trying to use it to also monitor a few VMs inside the same resource-group.
I have setup Azure SD for the VMs, and it's scanning them correctly, but the point is that in these VMs there are more than 1 service exposing metrics in different ports.
Is there a way to tell Prometheus to scan multiple ports under the azure_service_discovery job?
Or at least have these metrics aggregated, so Prometheus can scrape them from one single port?
The job definition that I'm using is:
azure_sd_configs:
- authentication_method: "OAuth"
subscription_id: AZURE_SUBSCRIPTION_ID
tenant_id: AZURE_TENANT_ID
client_id: AZURE_CLIENT_ID
client_secret: AZURE_CLIENT_SECRET
port: 9100
refresh_interval: 300s

You can't have two different ports in the same sd config.
However you can :
Either have multiple jobs with different azure_sd_configs. This way you can have different configuration for each job (drop some targets, customize sample limit, etc)
- job_name: azure_exporters_a
sample_limit: 1000
azure_sd_configs:
- port: 9100
...
- job_name: azure_exporters_b
sample_limit: 5000
azure_sd_configs:
- port: 9800
...
Or have multiple azure_sd_config for a specific job. In that case (the second one), all of your exporters will be regrouped in the same job, thus they will share the same configuration (sample_limit, scrape_timeout, ...)
- job_name: azure_exporters
sample_limit: 5000
azure_sd_configs:
- port: 9100
...
- port: 9800
...

Related

Pushing metrics to prometheus server via prometheus remote write from netdata

I have netdata installed in one of my computers and I want to export data to my prometheus server (both Ubuntu).
But I can't use prometheus' pull system, I need the metrics to be pushed from netdata to prometheus.
Netdata has prometheus remote write implemented in its exporting engine and I am able to configure it to send metrics to my server PC just fine.
But I can't see the metrics in prometheus at all, although I know the metrics are being sent to the server PC as I can see them by listening on the port I'm pushing to, via netcat.
So I think that my prometheus config is wrong.
This is my netdata exporting config:
[prometheus_remote_write:prometheus_receiver]
enabled = yes
destination = 192.168.5.45:9090
remote write URL path = /write
#username = admin
#password = admin
data source = average
prefix = netdata
# hostname = my_hostname
# update every = 10
# buffer on failures = 10
# timeout ms = 20000
# send names instead of ids = yes
# send charts matching = *
send hosts matching = *
And this is my prometheus config:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
remote_read:
- url: http://localhost/api/v1/write
remote_timeout: 30s
If I open the page localhost:9090/api/v1/write I expected to be able to see the metrics pushed from netdata, but instead I get a blank page that says "Method Not Allowed".
I execute prometheus with the flags --web.enable-admin-api --web.enable-remote-write-receiver.
Any clue on what I'm doing wrong?
Try execute prometheus with the flags --enable-feature=remote-write-receiver.
May be you have old version prometheus and this flag will be work.

Kubernetes on Azure - liveness and readiness probes failing - Liveness probe failed with connect: connection refused

I'm a noob with Azure deployment, kubernetes and HA implementation. When I implement health probes as part of my app deployment, the health probes fail and I end up with either 503 (internal server error) or 502 (bad gateway) error when I try accessing the app via the URL. When I remove the health probes, I can successfully access the app using its URL.
I use the following yaml deployment configuration when implementing the health probes, which is utilised by an Azure devops pipeline. The app takes under 5 mins to become available, so I set the initialDelaySeconds for the health probes to 300s.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myApp
spec:
...
template:
metadata:
labels:
app: myApp
spec:
...
containers:
- name: myApp
...
ports:
- containerPort: 5000
...
readinessProbe:
tcpSocket:
port: 5000
initialDelaySeconds: 300
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
livenessProbe:
tcpSocket:
port: 5000
periodSeconds: 30
initialDelaySeconds: 300
successThreshold: 1
failureThreshold: 3
...
When I perform the deployment and describe the pod, I see the following listed under 'Events' at the bottom of the output:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 2m1s (x288 over 86m) kubelet, aks-vm-id-appears-here Readiness probe failed: dial tcp 10.123.1.23:5000: connect: connection refused
(this is confusing as it states the age as 2m1s - but the initialDelaySeconds is greater than this - so I'm not sure why it reports this as the age)
The readiness probe subsequently fails with the same error. The IP number matches the IP of my pod and I see this under Containers in the pod description:
Containers:
....
Port: 5000/TCP
The failure of the liveness and readiness probes results in the pod being continually terminated and restarted.
The app has a default index.html page, so I believe the health probe should receive a 200 response if it's able to connect.
Because the health probe is failing, the pod IP doesn't get assigned to the endpoints object and therefore isn't assigned against the service.
If I comment out the readinessProbe and livenessProbe from the deployment, the app runs successfully when I use the URL via the browser, and the pod IP gets successfully assigned as an endpoint that the service can communicate with. The endpoint address is in the form 10.123.1.23:5000 - i.e. port 5000 seems to be the correct port for the pod.
I don't understand why the health probe would be failing to connect? It looks correct to me that it should be trying to connect on an IP that looks like 10.123.1.23:5000.
It's possible that the port is taking a long time than 300s to become open, but I don't know of a way I can check that. If I enter a bash session on the pod, watch isn't available (I read that watch ss -lnt can be used to examine port availability).
The following answer suggests increasing initialDelaySeconds but I already tried that - https://stackoverflow.com/a/51932875/1549918
I saw this question - but resource utilisation (e.g. CPU/RAM) is not the issue
Liveness and readiness probe connection refused
UPDATE
If I curl from a replica of the pod to https://10.123.1.23:5000, I get a similar error (Failed to connect to ...the IP.. port 5000: Connection refused). Why could this be failing? I read something that suggests that attempting this connection from another pod may indicate reachability for the health probes also.
If you are unsure if your application is starting correctly then replace it with a known good image. e.g. httpd
change the ports to 80, the image to httpd.
You might also want to increase the timeout for the health check as it defaults to 1 second to timeoutSeconds=5
in addition, if your image is a web application then it would be better to use a http probe
Your statement
The app has a default index.html page, so I believe the health probe should receive a 200 response if it's able to connect.
is incorrect.
You are doing a tcpSocket check. Try to switch to:
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 5000
scheme: HTTP

K8S - using Prometheus to monitor another prometheus instance in secure way

I've installed Prometheus operator 0.34 (which works as expected) on cluster A (main prom)
Now I want to use the federation option,I mean collect metrics from other Prometheus which is located on other K8S cluster B
Secnario:
have in cluster A MAIN prometheus operator v0.34 config
I've in cluster B SLAVE prometheus 2.13.1 config
Both installed successfully via helm, I can access to localhost via port-forwarding and see the scraping results on each cluster.
I did the following steps
Use on the operator (main cluster A) additionalScrapeconfig
I've added the following to the values.yaml file and update it via helm.
additionalScrapeConfigs:
- job_name: 'federate'
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 101.62.201.122:9090 # The External-IP and port from the target prometheus on Cluster B
I took the target like following:
on prometheus inside cluster B (from which I want to collect the data) I use:
kubectl get svc -n monitoring
And get the following entries:
Took the EXTERNAL-IP and put it inside the additionalScrapeConfigs config entry.
Now I switch to cluster A and run kubectl port-forward svc/mon-prometheus-operator-prometheus 9090:9090 -n monitoring
Open the browser with localhost:9090 see the graph's and click on Status and there Click on Targets
And see the new target with job federate
Now my main question/gaps. (security & verification)
To be able to see that target state on green (see the pic) I configure the prometheus server in cluster B instead of using type:NodePort to use type:LoadBalacer which expose the metrics outside, this can be good for testing but I need to secure it, how it can be done ?
How to make the e2e works in secure way...
tls
https://prometheus.io/docs/prometheus/1.8/configuration/configuration/#tls_config
Inside cluster A (main cluster) we use certificate for out services with istio like following which works
tls:
mode: SIMPLE
privateKey: /etc/istio/oss-tls/tls.key
serverCertificate: /etc/istio/oss-tls/tls.crt
I see that inside the doc there is an option to config
additionalScrapeConfigs:
- job_name: 'federate'
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 101.62.201.122:9090 # The External-IP and port from the target
# tls_config:
# ca_file: /opt/certificate-authority-data.pem
# cert_file: /opt/client-certificate-data.pem
# key_file: /sfp4/client-key-data.pem
# insecure_skip_verify: true
But not sure which certificate I need to use inside the prometheus operator config , the certificate of the main prometheus A or the slave B?
You should consider using Additional Scrape Configuration
AdditionalScrapeConfigs allows specifying a key of a Secret
containing additional Prometheus scrape configurations. Scrape
configurations specified are appended to the configurations generated
by the Prometheus Operator.
I am affraid this is not officially supported. However, you can update your prometheus.yml section within the Helm chart. If you want to learn more about it, check out this blog
I see two options here:
Connections to Prometheus and its exporters are not encrypted and
authenticated by default. This is one way of fixing that with TLS
certificates and
stunnel.
Or specify Secrets which you can add to your scrape configuration.
Please let me know if that helped.
A couple of options spring to mind:
Put the two clusters in the same network space and put a firewall in-front of them
VPN tunnel between the clusters.
Use istio multicluster routing (but this could get complicated): https://istio.io/docs/setup/install/multicluster

Expose Cassandra running on Kubernetes

I am running Cassandra on Kubernetes (3 instances) and want to expose it to the outside, my application is not yet in Kubernetes. So i crated a load balanced service like so:
apiVersion: v1
kind: Service
metadata:
namespace: getquanty
labels:
app: cassandra
name: cassandra
annotations:
kubernetes.io/tls-acme: "true"
spec:
clusterIP:
ports:
- port: 9042
name: cql
nodePort: 30001
- port: 7000
name: intra-node
nodePort: 30002
- port: 7001
name: tls-intra-node
nodePort: 30003
- port: 7199
name: jmx
nodePort: 30004
selector:
app: cassandra
type: LoadBalancer
This is the result is:
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cassandra 10.55.249.88 GIVEN_IP_GCE_LB 9042:30001/TCP,7000:30002/TCP,7001:30003/TCP,7199:30004/TCP 26m
I am able to connect using sh (cqlsh GIVEN_IP_GCE_LB ) but when i try to add data to Cassandra using the datastax driver for node, i got this:
message: 'Cannot achieve consistency level SERIAL',
info: 'Represents an error message from the server',
code: 4096,
consistencies: 8,
required: 1,
alive: 0,
coordinator: '35.187.166.68:9042' },
'10.52.4.32:9042': 'Host considered as DOWN',
'10.52.2.15:9042': 'Host considered as DOWN' },
info: 'Represents an error when a query cannot be performed because no host is available or could be reached by the driver.',
message: 'All host(s) tried for query failed. First host tried, 35.187.166.68:9042: ResponseError: Cannot achieve consistency level SERIAL. See innerErrors.' }
My first though was I need to expose the other ports too, so I did (intra-node, tls-intra-node, jmx), but it was the same error.
Kubernetes gives you access to proxy, i tried to proxy from my machine using the constructed URL for the pod to test if i have access but i cannot connect using cqlsh:
http://127.0.0.1:8001/api/v1/namespaces/qq/pods/cassandra-0:cql/proxy
I am out of ideas, the one thing left to try is to expose every instance (make a service for every instance) which is very ugly, but it will let me connect to the nodes from the outside until i migrate the application to Kubernetes.
Does any one have ideas how to expose Cassandra nodes to the internet and make the Datastax driver aware of all the nodes? Thank you for your time.
After more reading I found out that the replication strategy was the one causing the problem, NetworkStrategy is suitable for multi-cluster, I have one, so I changed the replication to simple with the number of nodes i had, now every thing works as expected.
EDIT 1:
Putting databases on Kube is not a good solution, I ended up making a standalone cluster, added it to the same Network as kube, and was able to access it from kube pods.
Kube is made to manage application and make them 'elastic', i don't think people really need to scale databases as quick as applications, furthermore, the scaling of a database is not the same operation as a stateless application.
You need to use headless service for the replication controller you created.
Your service should be something like :
apiVersion: v1
kind: Service
metadata:
labels:
app: cassandra
name: cassandra
spec:
clusterIP: None
ports:
- port: 9042
selector:
app: cassandra
Also, you can take reference for the below link and bring up a cassandra cluster.
https://github.com/kubernetes/kubernetes/tree/master/examples/storage/cassandra
I would recommend to run cassandra pod via replication controller or statefulset or daemonset because then kubernetes manages restart/rescheduling of the pod whenever required.

Cassandra and Graphite don't expose metrics other than org.apache.cassandra.metrics.*

I have configured apache Cassandra 2.2 to use graphite using metrics-graphite-3.1.2.jar (in the lib folder and the following metrics_reporter_graphite.yaml in /etc/cassandra/)
The problem is that I don't get any metrics other than org.apache.cassandra.metrics.+,
For example, I want to get metrics on some data on java.lang.+ but it doesn't send to Graphite.
graphite:
period: 30
timeunit: 'SECONDS'
prefix: 'cassandra-clustername-node1'
hosts:
- host: 'localhost'
port: 2003
predicate:
color: 'white'
useQualifiedName: true
patterns:
- '^org.apache.cassandra.metrics.+'
- '^java.lang.+'
JVM related metrics will be available starting with 2.2.8 (see CASSANDRA-12312). These metrics will be exposed under jvm.*, see here for a list of options.

Resources