We are using spark on kubernetes (using the SparkOperator) and Prometheus to expose the metrics of the application. The application is a spark streaming app (NOT structured streaming).
The application used to run on an image with spark version 2.4.7 and was later migrated to spark 3.1.2.
Because of this migration all spark_streaming_* metrics disappeared, like spark_streaming_driver_totalreceivedrecords (as defined here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala#L53)
In general the Prometheus setup seems to work because when you curl the prometheus port you can still see a bunch of other metrics - just none of the spark streaming metrics.
The spark-image contains the prometheus-java agent and in the helm chart of the stream-app the monitoring spec is configured to use it
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
port: 8888
as described in the docu of the spark-operator https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#monitoring.
This is also how the setup used to work with spark 2.4.7
Are this metrics gone in spark3? Or are we maybe just missing some configuration?
Another side note: when you check the metrics via <spark-driver>:<ui-port>/metrics/json you can see the desired metrics
ok it seems the spark-operator currently is not working properly with the current metrics in Spark3.
The issue could be fixed by provisioning the spark-image with a fixed prometheus configuration file and then use it in your helm chart
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
port: 8888
configFile: PATH_TO_THE_CONFIG
more infos and a working prometheus config can be found here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1117
Related
I'm running a Spark 3.0 application (Spark Structured Streaming) on Kubernetes and I'm trying to use the new native Prometheus metric sink. I'm able to make it work and get all the metrics described here.
However, the metrics I really need are the ones provided upon enabling the following config: spark.sql.streaming.metricsEnabled, as proposed in this Spark Summit presentation. Now, even with that config set to "true", I can't see any streaming metrics under /metrics/executors/prometheus as advertised. One thing to note is that I can see them under metrics/json, therefore, we know that the configuration was properly applied.
Why aren't streaming metrics sent to the Prometheus sink? Do I need to add some additional configuration? Is that not supported yet?
After quite a bit of investigation, I was able to make it work. In short, the Spark job k8s definition file needed one additional line, to tell spark where to find the metrics.propreties config file.
Make sure to add the following line under sparkConf in the Spark job k8s definition file, and adjust it to your actual path. The path to the metrics.properties file should be set in your Dockerfile.
sparkConf:
"spark.metrics.conf": "/etc/metrics/conf/metrics.properties"
For reference, here's the rest of my sparkConf, for metric-related config.
sparkConf:
"spark.metrics.conf": "/etc/metrics/conf/metrics.properties"
"spark.ui.prometheus.enabled": "true"
"spark.kubernetes.driver.annotation.prometheus.io/scrape": "true"
"spark.kubernetes.driver.annotation.prometheus.io/path": "/metrics/executors/prometheus/"
"spark.kubernetes.driver.annotation.prometheus.io/port": "4040"
"spark.sql.streaming.metricsEnabled": "true"
"spark.metrics.appStatusSource.enabled": "true"
"spark.kubernetes.driver.service.annotation.prometheus.io/scrape": "true"
"spark.kubernetes.driver.service.annotation.prometheus.io/path": "/metrics/prometheus/"
"spark.kubernetes.driver.service.annotation.prometheus.io/port": "4040"
I've installed Prometheus operator 0.34 (which works as expected) on cluster A (main prom)
Now I want to use the federation option,I mean collect metrics from other Prometheus which is located on other K8S cluster B
Secnario:
have in cluster A MAIN prometheus operator v0.34 config
I've in cluster B SLAVE prometheus 2.13.1 config
Both installed successfully via helm, I can access to localhost via port-forwarding and see the scraping results on each cluster.
I did the following steps
Use on the operator (main cluster A) additionalScrapeconfig
I've added the following to the values.yaml file and update it via helm.
additionalScrapeConfigs:
- job_name: 'federate'
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 101.62.201.122:9090 # The External-IP and port from the target prometheus on Cluster B
I took the target like following:
on prometheus inside cluster B (from which I want to collect the data) I use:
kubectl get svc -n monitoring
And get the following entries:
Took the EXTERNAL-IP and put it inside the additionalScrapeConfigs config entry.
Now I switch to cluster A and run kubectl port-forward svc/mon-prometheus-operator-prometheus 9090:9090 -n monitoring
Open the browser with localhost:9090 see the graph's and click on Status and there Click on Targets
And see the new target with job federate
Now my main question/gaps. (security & verification)
To be able to see that target state on green (see the pic) I configure the prometheus server in cluster B instead of using type:NodePort to use type:LoadBalacer which expose the metrics outside, this can be good for testing but I need to secure it, how it can be done ?
How to make the e2e works in secure way...
tls
https://prometheus.io/docs/prometheus/1.8/configuration/configuration/#tls_config
Inside cluster A (main cluster) we use certificate for out services with istio like following which works
tls:
mode: SIMPLE
privateKey: /etc/istio/oss-tls/tls.key
serverCertificate: /etc/istio/oss-tls/tls.crt
I see that inside the doc there is an option to config
additionalScrapeConfigs:
- job_name: 'federate'
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 101.62.201.122:9090 # The External-IP and port from the target
# tls_config:
# ca_file: /opt/certificate-authority-data.pem
# cert_file: /opt/client-certificate-data.pem
# key_file: /sfp4/client-key-data.pem
# insecure_skip_verify: true
But not sure which certificate I need to use inside the prometheus operator config , the certificate of the main prometheus A or the slave B?
You should consider using Additional Scrape Configuration
AdditionalScrapeConfigs allows specifying a key of a Secret
containing additional Prometheus scrape configurations. Scrape
configurations specified are appended to the configurations generated
by the Prometheus Operator.
I am affraid this is not officially supported. However, you can update your prometheus.yml section within the Helm chart. If you want to learn more about it, check out this blog
I see two options here:
Connections to Prometheus and its exporters are not encrypted and
authenticated by default. This is one way of fixing that with TLS
certificates and
stunnel.
Or specify Secrets which you can add to your scrape configuration.
Please let me know if that helped.
A couple of options spring to mind:
Put the two clusters in the same network space and put a firewall in-front of them
VPN tunnel between the clusters.
Use istio multicluster routing (but this could get complicated): https://istio.io/docs/setup/install/multicluster
I am trying to get the metrics from DSE Cassandra(dse: 5.1.0, Cassandra :3.10.0.1652) using builtin reporters like ConsoleReporter. I could able to get all the metrics except the metrics under ClientRequest.* and Storage.* even though I have reads/writes to this cluster . The only metric under ClientRequest.* group is org.apache.cassandra.metrics.ClientRequest.ViewPendingMutations.ViewWrite
I tried with different reporter config, but no luck and I didn't find any JIRA associated to this as well. The same behavior with StatsD Reporter as well.
Here is the reporter config with wildcard whitelist
console:
-
outfile: '/tmp/metrics.out'
period: 10
timeunit: 'SECONDS'
predicate:
color: "white"
useQualifiedName: true
patterns:
- ".*"
Both the ClientRequest and Storage metrics are critical for me . Is any body has any pointers why I am not getting these metrics? I appreciate any insights on resolving this issue.
It seems some issue with DSE version of Cassandra.Not sure something broken in latest version of DSE/Cassandra . I tested with Open Source Cassandra: 3.9.0 and it seems this works. I could able to get all the metrics under ClientRequest.* and Storage.* with the Open Source Cassandra 3.9.0.
We managed to get Spark (2.x) to send metrics to graphite by changing the metrics.properties file as below:
# Enable Graphite
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=graphite-host
*.sink.graphite.port=2003
*.sink.graphite.period=5
*.sink.graphite.prefix=my-app
However I noticed that we are getting only a subset of the metrics in graphite compared to what we get under Monitoring Web UI (http://localhost:4040). Is there any settings to get all the metrics (including Accumulators) in graphite?
I use this library to sink user defined metrics in user code into Graphite: spark-metrics
Initialise the metric system in driver side:
UserMetricsSystem.initialize(sc, "test_metric_namespace")
Then use Counter Gauge Histogram Meter like Spark Accumulators:
UserMetricsSystem.counter("test_metric_name").inc(1L)
For Spark 2.0, you can specify --conf spark.app.id=job_name so that in Grafana, metrics from different job run with multiple application id could have the same metric name. E.g. without setting spark.app.id, the metric name may include application id like this:
job_name.application_id_1.metric_namespace.metric_name
But with setting spark.app.id, it looks like:
job_name.unique_id.metric_namespace.metric_name
I'm trying to in place a global resource monitoring of a small cluster. The chosen stack:
- collectd on the node for data gathering
- influxdb as backend using the official docker container
- grafana as frontend again using the official container
The container are launched on a central server. Grafana is able to connect to influxdb source and I updated collectd agent (network plugin in collectd.conf) and influxdb (influxdb.conf with collectd plugin) to enable them to talk to each others.
But no data is showing up... No much log to check, but for sure the influxdb data file are empty and nothing comes up when querying.
Anyone experienced such context? Any idea where to dig?
collectd conf extract:
# /etc/collectd/collectd.conf
<Plugin network>
Server "<public_IP_of_the_docker_host>" "25826"
</Plugin>
influxdb conf:
[input_plugins.collectd]
enabled = true
address = "public_IP_of_the_docker_host"
port = 25826
database = "collectd"
typesdb = "/usr/share/collectd/types.db"