Sending Metrics: Spark to Graphite - apache-spark

We managed to get Spark (2.x) to send metrics to graphite by changing the metrics.properties file as below:
# Enable Graphite
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=graphite-host
*.sink.graphite.port=2003
*.sink.graphite.period=5
*.sink.graphite.prefix=my-app
However I noticed that we are getting only a subset of the metrics in graphite compared to what we get under Monitoring Web UI (http://localhost:4040). Is there any settings to get all the metrics (including Accumulators) in graphite?

I use this library to sink user defined metrics in user code into Graphite: spark-metrics
Initialise the metric system in driver side:
UserMetricsSystem.initialize(sc, "test_metric_namespace")
Then use Counter Gauge Histogram Meter like Spark Accumulators:
UserMetricsSystem.counter("test_metric_name").inc(1L)
For Spark 2.0, you can specify --conf spark.app.id=job_name so that in Grafana, metrics from different job run with multiple application id could have the same metric name. E.g. without setting spark.app.id, the metric name may include application id like this:
job_name.application_id_1.metric_namespace.metric_name
But with setting spark.app.id, it looks like:
job_name.unique_id.metric_namespace.metric_name

Related

Spark streaming: expose spark_streaming_* metrics

We are using spark on kubernetes (using the SparkOperator) and Prometheus to expose the metrics of the application. The application is a spark streaming app (NOT structured streaming).
The application used to run on an image with spark version 2.4.7 and was later migrated to spark 3.1.2.
Because of this migration all spark_streaming_* metrics disappeared, like spark_streaming_driver_totalreceivedrecords (as defined here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala#L53)
In general the Prometheus setup seems to work because when you curl the prometheus port you can still see a bunch of other metrics - just none of the spark streaming metrics.
The spark-image contains the prometheus-java agent and in the helm chart of the stream-app the monitoring spec is configured to use it
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
port: 8888
as described in the docu of the spark-operator https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#monitoring.
This is also how the setup used to work with spark 2.4.7
Are this metrics gone in spark3? Or are we maybe just missing some configuration?
Another side note: when you check the metrics via <spark-driver>:<ui-port>/metrics/json you can see the desired metrics
ok it seems the spark-operator currently is not working properly with the current metrics in Spark3.
The issue could be fixed by provisioning the spark-image with a fixed prometheus configuration file and then use it in your helm chart
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
port: 8888
configFile: PATH_TO_THE_CONFIG
more infos and a working prometheus config can be found here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1117

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Spark 3.0 streaming metrics in Prometheus

I'm running a Spark 3.0 application (Spark Structured Streaming) on Kubernetes and I'm trying to use the new native Prometheus metric sink. I'm able to make it work and get all the metrics described here.
However, the metrics I really need are the ones provided upon enabling the following config: spark.sql.streaming.metricsEnabled, as proposed in this Spark Summit presentation. Now, even with that config set to "true", I can't see any streaming metrics under /metrics/executors/prometheus as advertised. One thing to note is that I can see them under metrics/json, therefore, we know that the configuration was properly applied.
Why aren't streaming metrics sent to the Prometheus sink? Do I need to add some additional configuration? Is that not supported yet?
After quite a bit of investigation, I was able to make it work. In short, the Spark job k8s definition file needed one additional line, to tell spark where to find the metrics.propreties config file.
Make sure to add the following line under sparkConf in the Spark job k8s definition file, and adjust it to your actual path. The path to the metrics.properties file should be set in your Dockerfile.
sparkConf:
"spark.metrics.conf": "/etc/metrics/conf/metrics.properties"
For reference, here's the rest of my sparkConf, for metric-related config.
sparkConf:
"spark.metrics.conf": "/etc/metrics/conf/metrics.properties"
"spark.ui.prometheus.enabled": "true"
"spark.kubernetes.driver.annotation.prometheus.io/scrape": "true"
"spark.kubernetes.driver.annotation.prometheus.io/path": "/metrics/executors/prometheus/"
"spark.kubernetes.driver.annotation.prometheus.io/port": "4040"
"spark.sql.streaming.metricsEnabled": "true"
"spark.metrics.appStatusSource.enabled": "true"
"spark.kubernetes.driver.service.annotation.prometheus.io/scrape": "true"
"spark.kubernetes.driver.service.annotation.prometheus.io/path": "/metrics/prometheus/"
"spark.kubernetes.driver.service.annotation.prometheus.io/port": "4040"

What does "avoid multiple Kudu clients per cluster" mean?

I am looking at kudu's documentation.
Below is a partial description of kudu-spark.
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.
To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.
Does this mean that I can only run one kudu-spark task at a time?
If I have a spark-streaming program that is always writing data to the kudu,
How can I connect to kudu with other spark programs?
In a non-Spark program you use a KUDU Client for accessing KUDU. With a Spark App you use a KUDU Context that has such a Client already, for that KUDU cluster.
Simple JAVA program requires a KUDU Client using JAVA API and maven
approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
See http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time
against the same Cluster using Spark KUDU Integration. Snippet
borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
See https://kudu.apache.org/docs/developing.html
The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

real time log processing using apache spark streaming

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me.
Apache Flume may help to read the logs in real time.
Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.
1. Download Apache Flume from official site or follow the instructions from here
2. Setup and run Flume
modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here
There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:
flume-conf.properties
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink
agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = for() { ping google.com }
agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.type = file_roll
agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
agent.sinks.loggerSink.serializer = text
agent.sinks.loggerSink.appendNewline = false
agent.sinks.loggerSink.rollInterval = 0
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100
To run the example go to FLUME_INSTALLATION_PATH and execute
java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent
OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.
How to setup Flume to collect and transport logs?
You can use some script for gathering logs from the specified location
agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = your script here
instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO.
To configure the Flume to transport the log infromation read this article
3. Get the Flume stream from your source code and analyze it with Spark.
Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java
You can use Apache Kafka as queue system for your logs. The system that generated your logs e.g websever will send logs to Apache KAFKA. Then you can use apache storm or spark streaming library to read from KAFKA topic and process logs at real time.
You need to create stream of logs , which you can create using Apache Kakfa. There are integration available for kafka with storm and apache spark. both has its pros and cons.
For Storm Kafka Integration look here
For Apache Spark Kafka Integration take a look here
Although this is a old question, posting a link from Databricks, which has a great step by step article for log analysis with Spark considering many areas.
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/index.html
Hope this helps.

Resources