EMR pyspark trackable logging architecture - apache-spark

I am in middle of building a pyspark application that fails alot and has lot of jobs with lot of steps, so it is not possible to search with cluster id and step id. the current format in which spark on emr save is below
I want something traceable in place of {clusterid} and {stepid} such that clustername+datetime and step-name
I saw log4j.properties and it has something named datepattern, but it is not saving anything with datetime

You could index the logs into an ELK cluster (managed or not) using filebeats.
Or send the logs to cloudwatch logs using a bootstrap script on the EMR or a Lambda. You can then customize the log group and log stream names to your needs.


Get Dataproc Logs to Stackdriver Logging

I am running Dataproc and submitting Spark Jobs using the default client-mode.
The logs for the jobs are visible in the GCP console and is available in the GCS bucket. However, I would like to see the logs in Stackdriver Logging.
Currently, the only way I found was to use cluster-mode instead.
Is there a way to push logs to Stackdriver when using client-mode?
This is something the Dataproc team is actively working on and should have a solution for you sometime soon. If you want to file a public feature request for tracking this that is an option, but I will try to update this response when this feature is usable by you.
Digging into it a bit, the reason why you can see the logs when using cluster-mode is that we have Fluentd configurations that pick up YARN container logs (userlogs) by default. When running in cluster-mode the driver runs in a YARN container and those logs are picked up by that configuration.
Currently, output produced by the driver is forwarded directly to GCS by the Dataproc agent. In the future there will be an option to have all driver output sent to Stackdriver when starting a cluster.
This feature is now in Beta and is stable to use. When creating a Cluster, the property "dataproc:dataproc.logging.stackdriver.job.driver.enable" can be used to toggle whether the cluster will send Job driver logs to Stackdriver. Additionally you can use the property "dataproc:dataproc.logging.stackdriver.job.yarn.container.enable" to have the cluster associate YARN container logs with the Job they were created by instead of the Cluster they ran on.
Documentation is available here

Spark custom user log from aws EMR

I'm running a spark job on EMR, (yarn, cluster-mode, transient - the cluster shuts down after the job is done) with debug mode turned on. all spark logs are uploaded to s3 as expected but I can't upload my own custom logs...
Using log4j, I'm trying to write them to the folowing path acording to the spark doc log4j.appender.algoLog.File=${spark.yarn.app.container.log.dir}/algoLog.log
It seems like the variable is undefined. It tries to write directly to root. /algoLog.log.
If I'm writing it to other arbitrary location. It just doesn't appear on s3.
where should I write my own log files if I want EMR to upload them to s3 after the cluster shut down?
Log4J isn't set up to write to object stores; it's notion of filesystem is different.
you may be able to get YARN to do it with its log collection. See How to keep YARN's log files?

Bluemix Apache Spark Metrics

I have been looking for a way to monitor performance in Spark on Bluemix. I know in the Apache Spark project, they provide a metrics service based on the Coda Hale Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. Details here: http://spark.apache.org/docs/latest/monitoring.html
Does anyone know of any way to do this in the Bluemix Spark service? Ideally, I would like to save the metrics to a csv file in Object Storage.
Appreciate the help.
Currently, I do not see an option for usage of "Coda Hale Metrics Library" and reporting the job history or accessing the information via REST API.
However, on the main page of the Spark history server, you can see the Event log directory. It refers to your following user directory: file:/gpfs/fs01/user/USER_ID/events/
There I saw JSON (like) formatted files.

Apache Spark: Yarn logs Analysis

I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this :
hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID>
I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?
Also, if there is a better approach to analyse Spark logs, I am open to your suggestions.
The format is called a TFile, and it is a compressed file format.
Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”.
Splunk / Hadoop Rant
There may be a way to edit YARN and Spark's log4j.properties to send messages to Logstash using SocketAppender
However, that method is being deprecated

View worker / executor logs in Spark UI since 1.0.0+

In 0.9.0 to view worker logs it was simple, they where one click away from the spark ui home page.
Now (1.0.0+) I cannot find them. Furthermore the Spark UI stops working when my job crashes! This is annoying, what is the point of a debugging tool that only works when your application does not need debugging. According to http://apache-spark-user-list.1001560.n3.nabble.com/Viewing-web-UI-after-fact-td12023.html I need to find out what my master-url is, but I don't how to, spark doesn't spit out this information at startup, all it says is:
... -Dspark.master=\"yarn-client\" ...
and obviously http://yarn-client:8080 doesn't work. Some sites talk about how now in YARN finding logs has been super obfuscated - rather than just being on the UI, you have to login to the boxes to find them. Surely this is a massive regression and there has to be a simpler way??
How am I supposed to find out what the master URL is? How can I find my worker (now called executor) logs?
Depending on your configuration of YARN NodeManager log aggregation, the spark job logs are aggregated automatically. Runtime log is usually be found in following ways:
Spark Master Log
If you're running with yarn-cluster, go to YARN Scheduler web UI. You can find the Spark Master log there. Job description page "log' button gives the content.
With yarn-client, the driver runs in your spark-submit command. Then what you see is the driver log, if log4j.properties is configured to output in stderr or stdout.
Spark Executor Log
Search for "executorHostname" in driver logs. See comments for more detail.
These answers document how to find them from command line or UI
Where are logs in Spark on YARN?
For UI, on an edge node
Look in /etc/hadoop/conf/yarn-site.xml for the yarn resource manager URI (yarn.resourcemanager.webapp.address).
Or use command line:
yarn logs -applicationId [OPTIONS]
