Spark executor metrics don't reach prometheus sink - apache-spark

Circumstances:
I have read through these:
https://spark.apache.org/docs/3.1.2/monitoring.html
https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/
versions: Spark3.1.2, K8s v19
I am submitting my application via
-c spark.ui.prometheus.enabled=true
-c spark.metrics.conf=/spark/conf/metric.properties
metric.properties:
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
Result:
Both of these endpoints have some metrics
<driver-ip>:4040/metrics/prometheus
<driver-ip>:4040/metrics/executors/prometheus
the first one - the driver one - has all the metrics
the second one - the executor one - has all the metrics except the ones under the executor namespace
described here: https://spark.apache.org/docs/3.1.2/monitoring.html#component-instance--executor
So everything is missing from bytesRead.count to threadpool.startedTasks
But these metric are indeed reported by the executors, because under /api/v1/applications/app-id/stages/stage-id I can see those too.
I am struggled with this for hours, moving the configs to --conf flag, splitting up the configs by instances, enabling everything...etc No result.
However if I change the sink from prometheus to ConsoleSink:
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink
*.sink.console.period=10
*.sink.console.unit=seconds
Then the metrics appear successfully.
So something is definitely wrong with the Spark-K8s-Prometheus integration.
Note:
One interesting stuff is if I split up the config by instances like
driver.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
executor.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
driver.sink.prometheusServlet.path=/metrics/prometheus1
executor.sink.prometheusServlet.path=/metrics/executor/prometheus1
(note the trailing '1' at the end)
Then the executor sink path is not taken into account , the driver metrics will be on
/metrics/prometheus1 but the exectutors will be still on /metrics/executor/prometheus.
The class config is indeed working because if I change it to a nonexisting one, then the executor will throw an error as expected.

I have been looking to understand why custom user metrics are not sent to the driver, while the regular spark metrics are.
It looks like the PrometheusSink use the class ExecutorSummary, which doesn't allow to add custom metrics.
For the moment, it seems the only working way is to use the JMXExporter (and use the Java agent to export to Prometheus), or just use the ConsoleSink with
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink

Related

Where to check the pool stats of FAIR scheduler in Spark Web UI

I see my Spark application is using FAIR scheduler:
But I can't confirm whether it is using two pools I set up (pool1, pool2). Here is a thread function I implemented in PySpark which is called twice - one with "pool1" and the other with "pool2".
def do_job(f1, f2, id, pool_name, format="json"):
spark.sparkContext.setLocalProperty("spark.scheduler.pool", pool_name)
...
I thought the "Stages" menu is supposed to show the pool info but I don't see it. Does that mean the pools are not set up correctly or am I looking at the wrong place?
I am using PySpark 3.3.0 on top of EMR 6.9.0
You can confirm like this diagram.
pls refer my article I created 3 pools like module1 module2 module3 based on certin logic.
Each one is using specific pool.like above.. based on this I created below diagrams
Note : Please see the verification steps in the article I gave

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Is there a reference of Spark Log4j properties?

I've been trying to find a reference of all the log4j properties for Spark and having a hard time finding it. I've found a lot of examples where people seen to have pieces of it. But I'm trying to see if there's a reference somewhere that has all of them.
For my particular use case, I'm writing some code that performs a series of data transformations by firing off a spark-submit job, that can then be used/extended by other users. I don't need most of what spark spits out by default and it's easy to just set something like log4j.rootLogger=WARN,stdout. However, there's some useful bits in INFO that would be good to have printed to the screen. In particular:
org.apache.spark.deploy.yarn.Client (Logging.scala:logInfo(54)) -
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: ****
start time: 1508185983070
final status: UNDEFINED
tracking URL: ***My tracking URL***
user: ***User***
And even more specifically the tracking URL. Probably also somewhat due to my limited knowledge of Log4j makes this a bit tough. I've tried doing something like:
org.apache.spark.deploy.yarn.Client=Info
But that doesn't appear to be a legit logging property. Is there a way to only get that piece of info in Spark? Is there a trick to seeing all the possible logging properties to set?
Thanks!
Update
I was able to figure this out. Most of it was due to me not knowing how log4j.properties works but have a much better handle on it now.
You can set the logger and log level per class, and that persist down to all child classes.
I changed my log4j.properties to look something like this:
log4j.logger.org.apache.spark=INFO, RollingAppender
log4j.additivity.org.apache.spark=false
log4j.logger.org.apache.hadoop=INFO, RollingAppender
log4j.additivity.org.apache.hadoop=false
log4j.logger.org.spark_project.jetty=INFO, RollingAppender
log4j.additivity.org.spark_project.jetty=false
log4j.logger.org.apache.spark.deploy.yarn.Client=INFO, RollingAppender
log4j.additivity.org.apache.spark.deploy.yarn.Client=false
And that redirects pretty much all Spark on YARN logs to a file (slightly modified from the link Thiago shared).
The key things I was missing...
1) I needed to include log4j.logger.CLASS_NAME, I was missing the log4j.logger bit..
2) Need to have log4j.additivity.CLASS_NAME=false. Without this it will just log INFO to the default setting.
It's pretty confusing at first but starts to make a bit of sense once you get the pattern down.
I will suggest you take a look in this article at Hacker Noon:
https://hackernoon.com/how-to-log-in-apache-spark-f4204fad78a
It is a little bit more complex to generate logs in Spark if you want to generate your own logs in Yarn application as Spark Submit.

Slurm sinfo format

When I use "sinfo" in slurm, I see an asterik near one of the partition (like: RUNNING-CLUSTER*).
The partition look well and all nodes under it are idle.
When I run a simple script with "sleep 300" for example, I can see the jobs in the queue (using "squeue") but they run for a few seconds and end. No error message (I can see in the log that they failed. No more info there).
Any idea what the asterisk is for?
Couldn't find it in the manual.
Thanks.
The "*" following the partition name indicates that this is the default partition for submitted jobs. LLNL provides documentation that directly supports my findings:
LLNL Documentation

Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster

I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so.
Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
The generate phase is not generating more than one segment.
And as a result the fetch phase is not creating multiple map tasks, also I belive the script is written it does not allow the fecth to fecth multiple segemnts even if the generate were to generate multiple segments.
Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used?
Thanks.
Are you using Nutch 1.xx for this? In this case, the Generator class looks for a flag called "mapred.job.tracker" and tries to see if it is local. This property has been deprecated in Hadoop2 and the default value is set to local. You will have to overwrite the value of the property to something other than local and the Generator will generate multiple partitions for the segments.
I've recently faced this problem and thought it'd be a good idea to build upon Keith's answer to provide a more thorough explanation about how to solve this issue.
I've tested this with Nutch 1.10 and Hadoop 2.4.0.
As Keith said the if block on line 542 in Generator.java reads the mapred.job.tracker property and sets the value of variable numLists to 1 if the property is local. This variable seems to control the number of reduce tasks and has influence in the number of map tasks.
Overwriting the value of said property in mapred-site.xml fixes this:
<property>
    <name>mapred.job.tracker</name>
    <value>distributed</value>
</property>
(Or any other value you like except local).
The problem is this wasn't enough in my case to generate more than one fetch map task. I also had to update the value of the numSlaves parameter in the runtime/deploy/bin/crawl script. I didn't find any mentions of this parameter in the Nutch 1.x docs so I stumbled upon it after a bit of trial and error.
#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################
# set the number of slaves nodes
numSlaves=3
# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`
...

Resources