Spark custom user log from aws EMR - apache-spark

I'm running a spark job on EMR, (yarn, cluster-mode, transient - the cluster shuts down after the job is done) with debug mode turned on. all spark logs are uploaded to s3 as expected but I can't upload my own custom logs...
Using log4j, I'm trying to write them to the folowing path acording to the spark doc log4j.appender.algoLog.File=${spark.yarn.app.container.log.dir}/algoLog.log
It seems like the variable is undefined. It tries to write directly to root. /algoLog.log.
If I'm writing it to other arbitrary location. It just doesn't appear on s3.
where should I write my own log files if I want EMR to upload them to s3 after the cluster shut down?

Log4J isn't set up to write to object stores; it's notion of filesystem is different.
you may be able to get YARN to do it with its log collection. See How to keep YARN's log files?

Related

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

EMR pyspark trackable logging architecture

I am in middle of building a pyspark application that fails alot and has lot of jobs with lot of steps, so it is not possible to search with cluster id and step id. the current format in which spark on emr save is below
S3/buckt-name/logs/sparksteps/j-{clusterid}/steps/s-{stepid}/stderr.gz
I want something traceable in place of {clusterid} and {stepid} such that clustername+datetime and step-name
I saw log4j.properties and it has something named datepattern, but it is not saving anything with datetime
You could index the logs into an ELK cluster (managed or not) using filebeats.
Or send the logs to cloudwatch logs using a bootstrap script on the EMR or a Lambda. You can then customize the log group and log stream names to your needs.

How to forward logs to s3 from yarn container?

I am setting up Spark on Hadoop Yarn cluster in AWS EC2 machines.
This cluster will be ephemeral (For few hours within a day) and hence i want to forward the container logs generated to s3.
I have seen Amazon EMR supporting this feature by forwarding logs to s3 every 5 minutes
Is there any built in configuration inside hadoop/spark that i can leverage ..?
Any other solution to solve this issue will also be helpfull.
Sounds like you're looking for YARN log aggregation.
Haven't tried changing it myself, but you can configure yarn.nodemanager.remote-app-log-dir to point to S3 filesystem, assuming you've setup your core-site.xml accordingly
yarn.log-aggregation.retain-seconds +
yarn.log-aggregation.retain-check-interval-seconds will determine how often the YARN containers will ship out their logs
The alternate solution would be to build your own AMI that has Fluentd or Filebeat pointing at the local YARN log directories, then setup those log forwarders to write to a remote location. For example, Elasticsearch (or one of the AWS log solutions) would be a better choice than just S3

Apache Spark: Yarn logs Analysis

I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this :
hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID>
I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?
Also, if there is a better approach to analyse Spark logs, I am open to your suggestions.
Thanks.
The format is called a TFile, and it is a compressed file format.
Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”.
Splunk / Hadoop Rant
There may be a way to edit YARN and Spark's log4j.properties to send messages to Logstash using SocketAppender
However, that method is being deprecated

Resources