Is there a reference of Spark Log4j properties? - apache-spark

I've been trying to find a reference of all the log4j properties for Spark and having a hard time finding it. I've found a lot of examples where people seen to have pieces of it. But I'm trying to see if there's a reference somewhere that has all of them.
For my particular use case, I'm writing some code that performs a series of data transformations by firing off a spark-submit job, that can then be used/extended by other users. I don't need most of what spark spits out by default and it's easy to just set something like log4j.rootLogger=WARN,stdout. However, there's some useful bits in INFO that would be good to have printed to the screen. In particular:
org.apache.spark.deploy.yarn.Client (Logging.scala:logInfo(54)) -
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: ****
start time: 1508185983070
final status: UNDEFINED
tracking URL: ***My tracking URL***
user: ***User***
And even more specifically the tracking URL. Probably also somewhat due to my limited knowledge of Log4j makes this a bit tough. I've tried doing something like:
org.apache.spark.deploy.yarn.Client=Info
But that doesn't appear to be a legit logging property. Is there a way to only get that piece of info in Spark? Is there a trick to seeing all the possible logging properties to set?
Thanks!
Update
I was able to figure this out. Most of it was due to me not knowing how log4j.properties works but have a much better handle on it now.
You can set the logger and log level per class, and that persist down to all child classes.
I changed my log4j.properties to look something like this:
log4j.logger.org.apache.spark=INFO, RollingAppender
log4j.additivity.org.apache.spark=false
log4j.logger.org.apache.hadoop=INFO, RollingAppender
log4j.additivity.org.apache.hadoop=false
log4j.logger.org.spark_project.jetty=INFO, RollingAppender
log4j.additivity.org.spark_project.jetty=false
log4j.logger.org.apache.spark.deploy.yarn.Client=INFO, RollingAppender
log4j.additivity.org.apache.spark.deploy.yarn.Client=false
And that redirects pretty much all Spark on YARN logs to a file (slightly modified from the link Thiago shared).
The key things I was missing...
1) I needed to include log4j.logger.CLASS_NAME, I was missing the log4j.logger bit..
2) Need to have log4j.additivity.CLASS_NAME=false. Without this it will just log INFO to the default setting.
It's pretty confusing at first but starts to make a bit of sense once you get the pattern down.

I will suggest you take a look in this article at Hacker Noon:
https://hackernoon.com/how-to-log-in-apache-spark-f4204fad78a
It is a little bit more complex to generate logs in Spark if you want to generate your own logs in Yarn application as Spark Submit.

Related

Where to check the pool stats of FAIR scheduler in Spark Web UI

I see my Spark application is using FAIR scheduler:
But I can't confirm whether it is using two pools I set up (pool1, pool2). Here is a thread function I implemented in PySpark which is called twice - one with "pool1" and the other with "pool2".
def do_job(f1, f2, id, pool_name, format="json"):
spark.sparkContext.setLocalProperty("spark.scheduler.pool", pool_name)
...
I thought the "Stages" menu is supposed to show the pool info but I don't see it. Does that mean the pools are not set up correctly or am I looking at the wrong place?
I am using PySpark 3.3.0 on top of EMR 6.9.0
You can confirm like this diagram.
pls refer my article I created 3 pools like module1 module2 module3 based on certin logic.
Each one is using specific pool.like above.. based on this I created below diagrams
Note : Please see the verification steps in the article I gave

Spark executor metrics don't reach prometheus sink

Circumstances:
I have read through these:
https://spark.apache.org/docs/3.1.2/monitoring.html
https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/
versions: Spark3.1.2, K8s v19
I am submitting my application via
-c spark.ui.prometheus.enabled=true
-c spark.metrics.conf=/spark/conf/metric.properties
metric.properties:
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
Result:
Both of these endpoints have some metrics
<driver-ip>:4040/metrics/prometheus
<driver-ip>:4040/metrics/executors/prometheus
the first one - the driver one - has all the metrics
the second one - the executor one - has all the metrics except the ones under the executor namespace
described here: https://spark.apache.org/docs/3.1.2/monitoring.html#component-instance--executor
So everything is missing from bytesRead.count to threadpool.startedTasks
But these metric are indeed reported by the executors, because under /api/v1/applications/app-id/stages/stage-id I can see those too.
I am struggled with this for hours, moving the configs to --conf flag, splitting up the configs by instances, enabling everything...etc No result.
However if I change the sink from prometheus to ConsoleSink:
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink
*.sink.console.period=10
*.sink.console.unit=seconds
Then the metrics appear successfully.
So something is definitely wrong with the Spark-K8s-Prometheus integration.
Note:
One interesting stuff is if I split up the config by instances like
driver.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
executor.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
driver.sink.prometheusServlet.path=/metrics/prometheus1
executor.sink.prometheusServlet.path=/metrics/executor/prometheus1
(note the trailing '1' at the end)
Then the executor sink path is not taken into account , the driver metrics will be on
/metrics/prometheus1 but the exectutors will be still on /metrics/executor/prometheus.
The class config is indeed working because if I change it to a nonexisting one, then the executor will throw an error as expected.
I have been looking to understand why custom user metrics are not sent to the driver, while the regular spark metrics are.
It looks like the PrometheusSink use the class ExecutorSummary, which doesn't allow to add custom metrics.
For the moment, it seems the only working way is to use the JMXExporter (and use the Java agent to export to Prometheus), or just use the ConsoleSink with
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink

How to get job status of crawl tasks in nutch

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job.
Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ?
To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.
You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

How to tune "spark.rpc.askTimeout"?

We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?
I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.

How to make Spark to fail fast with clarity

I'm learning Spark, and quite often I have some issue that causes tasks and stages to fail. With my default configuration, there are rounds of retries and a bunch of ERROR messages to that effect.
While I totally appreciate the idea of retrying tasks when I finally get to production, I'd love to know how to make my application fail at the first sign of trouble so that I can avoid all the extra noise in the logs and within the application history itself. For example, if I run it out of memory, I'd love to just see the OOM exception near the end of my log and have the whole app fail.
What's the best way to setup configs for this kind of workflow?
You can set spark.task.maxFailures to 1.
spark.task.maxFailures is the number of individual task failures before giving up on the job, and its default value is 4.

Resources