Apache Spark: Yarn logs Analysis - apache-spark

I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this :
hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID>
I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?
Also, if there is a better approach to analyse Spark logs, I am open to your suggestions.
Thanks.

The format is called a TFile, and it is a compressed file format.
Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”.
Splunk / Hadoop Rant
There may be a way to edit YARN and Spark's log4j.properties to send messages to Logstash using SocketAppender
However, that method is being deprecated

Related

integration of csv file with flume vs spark

I have a project, is to integrate a CSV files from servers of partners to our Hadoop cluster.
To do that I found Flume and Spark can do it.
I know that Spark is preferred when you need to perform data transformations.
My question is what's the difference between Flume and Spark in integration logic?
Is there a performance difference between them in importing CSV files?
Flume is a constantly running process that watches paths or executes functions on files. It is more comparable to Logstash or Fluentd because it's config file driven, not programmed as well as deployed and tuned.
Preferably, you would parse said CSV files while you are reading them, then covert to a more self-describing format such as Avro, then put it into HDFS. See Morphlines Flume processors
Spark on the other hand, you'd have to manually write all that code from end to end. While Spark Streaming can do the same thing, you generally would not run it the same way as Flume, rather you run in within YARN or other clustered scheduler, where you have no control which server it's running on because at the end of the day, you should only care if there's resource constraints.
Other alternatives still exist such as Apache Nifi or Streamsets, which allow more visual pipeline building rather than writing code

spark history server in yarn mode

Can someone help in understanding the clear difference between
spark.eventLog.dir and spark.history.fs.logDirectory?
Also please relate these properties to yarn.nodemanager.remote-app-log-dir and mapreduce.jobhistory.done-dir
Please don't paste documentation as the response :)
I have already gone through the below link, but could not understand.
What's the difference between spark.eventLog.dir and spark.history.fs.logDirectory?
As the linked post says,
spark.eventLog.dir is to write logs while spark.history.fs.logDirectory is the place where Spark History Server reads log events.
In certain scenarios, these could be different - for example, some external, periodic job could move the files being actively written into the history location
The Spark log directories should be separated from YARN and MapReduce history server

Spark custom user log from aws EMR

I'm running a spark job on EMR, (yarn, cluster-mode, transient - the cluster shuts down after the job is done) with debug mode turned on. all spark logs are uploaded to s3 as expected but I can't upload my own custom logs...
Using log4j, I'm trying to write them to the folowing path acording to the spark doc log4j.appender.algoLog.File=${spark.yarn.app.container.log.dir}/algoLog.log
It seems like the variable is undefined. It tries to write directly to root. /algoLog.log.
If I'm writing it to other arbitrary location. It just doesn't appear on s3.
where should I write my own log files if I want EMR to upload them to s3 after the cluster shut down?
Log4J isn't set up to write to object stores; it's notion of filesystem is different.
you may be able to get YARN to do it with its log collection. See How to keep YARN's log files?

Replace LogStash with Spark Streaming

My requirement is to read log data from multiple machines.
LogStash - As far as i understand, LogStash agents to be installed on all the machines and LogStash can push data to Kafka as and when it arrives i.e. even if a new line is added to a file, LogStash reads only that not the entire file again.
Questions
Now i it possible to achieve the same with Spark Streaming?
If So, whats the advantage\disadvantage of using Spark Streaming over
LogStash?
LogStash agents to be installed on all the machines
Yes, you need some agent on all machines. The solution in the ELK stack is actually FileBeat, not Logstash agents. Logstash is more of a server/message-bus in this scenario.
Similarly, some Spark job would need running to read a file. Personally, I would have anything else tail-ing a log file (even literally just tail -f file.log piping out a network socket). Needing to write and distribute a Spark JAR + config files is a clear disadvantage. Especially when you need to have Java installed on each of those machines you are collecting logs on.
Flume or Fluentd are other widely used options for distributed log collection with Kafka destinations
LogStash can push data to Kafka
The Beats framework has a Kafka Output, but you can also ship to Logstash first.
It's not clear if you are using LogStash purely for Kafka, or also using ElasticSearch here, but Kafka Connect provides a file-source (and Elasticsearch output).
reads only that not the entire file again
Whatever tool you use (including Spark Streaming's File source) will typically be watching directories of files (because if you aren't rotating log files, you're doing it wrong). As files come in, or bytes written to a file, that framework will need to commit some type of marker internally to indicate what elements have been consumed so far. To reset the agent, this metadata should be able to be removed/reset to start from the beginning

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

Resources