Spark 3.0 streaming metrics in Prometheus - apache-spark

I'm running a Spark 3.0 application (Spark Structured Streaming) on Kubernetes and I'm trying to use the new native Prometheus metric sink. I'm able to make it work and get all the metrics described here.
However, the metrics I really need are the ones provided upon enabling the following config: spark.sql.streaming.metricsEnabled, as proposed in this Spark Summit presentation. Now, even with that config set to "true", I can't see any streaming metrics under /metrics/executors/prometheus as advertised. One thing to note is that I can see them under metrics/json, therefore, we know that the configuration was properly applied.
Why aren't streaming metrics sent to the Prometheus sink? Do I need to add some additional configuration? Is that not supported yet?

After quite a bit of investigation, I was able to make it work. In short, the Spark job k8s definition file needed one additional line, to tell spark where to find the metrics.propreties config file.
Make sure to add the following line under sparkConf in the Spark job k8s definition file, and adjust it to your actual path. The path to the metrics.properties file should be set in your Dockerfile.
sparkConf:
"spark.metrics.conf": "/etc/metrics/conf/metrics.properties"
For reference, here's the rest of my sparkConf, for metric-related config.
sparkConf:
"spark.metrics.conf": "/etc/metrics/conf/metrics.properties"
"spark.ui.prometheus.enabled": "true"
"spark.kubernetes.driver.annotation.prometheus.io/scrape": "true"
"spark.kubernetes.driver.annotation.prometheus.io/path": "/metrics/executors/prometheus/"
"spark.kubernetes.driver.annotation.prometheus.io/port": "4040"
"spark.sql.streaming.metricsEnabled": "true"
"spark.metrics.appStatusSource.enabled": "true"
"spark.kubernetes.driver.service.annotation.prometheus.io/scrape": "true"
"spark.kubernetes.driver.service.annotation.prometheus.io/path": "/metrics/prometheus/"
"spark.kubernetes.driver.service.annotation.prometheus.io/port": "4040"

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

Spark streaming: expose spark_streaming_* metrics

We are using spark on kubernetes (using the SparkOperator) and Prometheus to expose the metrics of the application. The application is a spark streaming app (NOT structured streaming).
The application used to run on an image with spark version 2.4.7 and was later migrated to spark 3.1.2.
Because of this migration all spark_streaming_* metrics disappeared, like spark_streaming_driver_totalreceivedrecords (as defined here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala#L53)
In general the Prometheus setup seems to work because when you curl the prometheus port you can still see a bunch of other metrics - just none of the spark streaming metrics.
The spark-image contains the prometheus-java agent and in the helm chart of the stream-app the monitoring spec is configured to use it
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
port: 8888
as described in the docu of the spark-operator https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#monitoring.
This is also how the setup used to work with spark 2.4.7
Are this metrics gone in spark3? Or are we maybe just missing some configuration?
Another side note: when you check the metrics via <spark-driver>:<ui-port>/metrics/json you can see the desired metrics
ok it seems the spark-operator currently is not working properly with the current metrics in Spark3.
The issue could be fixed by provisioning the spark-image with a fixed prometheus configuration file and then use it in your helm chart
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
port: 8888
configFile: PATH_TO_THE_CONFIG
more infos and a working prometheus config can be found here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1117

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Sending Metrics: Spark to Graphite

We managed to get Spark (2.x) to send metrics to graphite by changing the metrics.properties file as below:
# Enable Graphite
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=graphite-host
*.sink.graphite.port=2003
*.sink.graphite.period=5
*.sink.graphite.prefix=my-app
However I noticed that we are getting only a subset of the metrics in graphite compared to what we get under Monitoring Web UI (http://localhost:4040). Is there any settings to get all the metrics (including Accumulators) in graphite?
I use this library to sink user defined metrics in user code into Graphite: spark-metrics
Initialise the metric system in driver side:
UserMetricsSystem.initialize(sc, "test_metric_namespace")
Then use Counter Gauge Histogram Meter like Spark Accumulators:
UserMetricsSystem.counter("test_metric_name").inc(1L)
For Spark 2.0, you can specify --conf spark.app.id=job_name so that in Grafana, metrics from different job run with multiple application id could have the same metric name. E.g. without setting spark.app.id, the metric name may include application id like this:
job_name.application_id_1.metric_namespace.metric_name
But with setting spark.app.id, it looks like:
job_name.unique_id.metric_namespace.metric_name

How to configure SSL between Spark and Cassandra?

I'm trying to configure SSL for the Cassandra Spark connector, but I couldn't find an example of how to do it.
I'm trying to configure it like this:
SparkConf conf = new SparkConf().setAppName("someApp")
.set("spark.cassandra.connection.host", "111.111.111.111")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/some/tfile.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "apassword")
.set("spark.cassandra.connection.ssl.trustStore.type", "JKS")
.set("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.set("spark.cassandra.connection.ssl.keyStore.path", "/some/kfile.jks")
.set("spark.cassandra.connection.ssl.keyStore.password", "anotherpassword")
.set("spark.cassandra.connection.ssl.keyStore.type", "JKS")
.set("spark.cassandra.connection.ssl.protocol", "TLS");
When I try to submit the spark job, I get these errors:
Exception in thread "main" com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.ssl.keyStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabled is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.protocol is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabledAlgorithms is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
So I'm not sure if this is supported or I'm just using the wrong property names.
I saw this ticket for release 1.2.3 of the connector, but I couldn't find an example of how to use it and it sounded like it may not support keystores. I'm using version 1.4.0-M1 of the connector.
Can anyone show me an example of how to configure SSL for the Spark Cassandra connector? Thanks.
Though I don't see any keystore configurations, I can see below config variables and they are working fine for me.
Note: I am using 1.5.0-M1 version. Not sure if there is any other bug in the version you are using.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");

Resources