ELK apache spark application log - apache-spark

How to configure Filebeats to read apache spark application log. The logs generated is moved to history server, in non readable format as soon as the application is completed. What is the ideal way here.

You can configure Spark logging via Log4J. For a discussion around some edge cases for setting up log4j configuration, see SPARK-16784, but if you simply want to collect all application logs coming off a cluster (vs logs per job) you shouldn't need to consider any of that.
On the ELK side, there was a log4j input plugin for logstash, but it is deprecated.
Thankfully, the documentation for the deprecated plugin describes how to configure log4j to write data locally for FileBeat, and how to set up FileBeat to consume this data and sent it to a Logstash instance. This is now the recommended way to ship logs from systems using log4j.
So in summary, the recommended way to get logs from Spark into ELK is:
Set the Log4J configuration for your Spark cluster to write to local files
Run FileBeat to consume from these files and sent to logstash
Logstash will send data into Elastisearch
You can search through your indexed log data using Kibana

Related

EMR pyspark trackable logging architecture

I am in middle of building a pyspark application that fails alot and has lot of jobs with lot of steps, so it is not possible to search with cluster id and step id. the current format in which spark on emr save is below
S3/buckt-name/logs/sparksteps/j-{clusterid}/steps/s-{stepid}/stderr.gz
I want something traceable in place of {clusterid} and {stepid} such that clustername+datetime and step-name
I saw log4j.properties and it has something named datepattern, but it is not saving anything with datetime
You could index the logs into an ELK cluster (managed or not) using filebeats.
Or send the logs to cloudwatch logs using a bootstrap script on the EMR or a Lambda. You can then customize the log group and log stream names to your needs.

Configure elasticsearch APM with Apache Storm

Am trying to get realtime data of topoloiges running in Apache Storm using Elasticsearch APM.
During topology submission, am passing required arguments like -Delastic.apm.service_name, -Delastic.apm.server_url, -Delastic.apm.application_packages & javaagent.
Topologies are also running and i can see required parameters in topology process. Topologies are processing data but no data is coming to APM.
Can someone help me on this? Am I missing some arguments or Storm is not supported or something different?
Has anyone configured APM on Apache Storm or Spark?

Replace LogStash with Spark Streaming

My requirement is to read log data from multiple machines.
LogStash - As far as i understand, LogStash agents to be installed on all the machines and LogStash can push data to Kafka as and when it arrives i.e. even if a new line is added to a file, LogStash reads only that not the entire file again.
Questions
Now i it possible to achieve the same with Spark Streaming?
If So, whats the advantage\disadvantage of using Spark Streaming over
LogStash?
LogStash agents to be installed on all the machines
Yes, you need some agent on all machines. The solution in the ELK stack is actually FileBeat, not Logstash agents. Logstash is more of a server/message-bus in this scenario.
Similarly, some Spark job would need running to read a file. Personally, I would have anything else tail-ing a log file (even literally just tail -f file.log piping out a network socket). Needing to write and distribute a Spark JAR + config files is a clear disadvantage. Especially when you need to have Java installed on each of those machines you are collecting logs on.
Flume or Fluentd are other widely used options for distributed log collection with Kafka destinations
LogStash can push data to Kafka
The Beats framework has a Kafka Output, but you can also ship to Logstash first.
It's not clear if you are using LogStash purely for Kafka, or also using ElasticSearch here, but Kafka Connect provides a file-source (and Elasticsearch output).
reads only that not the entire file again
Whatever tool you use (including Spark Streaming's File source) will typically be watching directories of files (because if you aren't rotating log files, you're doing it wrong). As files come in, or bytes written to a file, that framework will need to commit some type of marker internally to indicate what elements have been consumed so far. To reset the agent, this metadata should be able to be removed/reset to start from the beginning

Apache Spark: Yarn logs Analysis

I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this :
hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID>
I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?
Also, if there is a better approach to analyse Spark logs, I am open to your suggestions.
Thanks.
The format is called a TFile, and it is a compressed file format.
Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”.
Splunk / Hadoop Rant
There may be a way to edit YARN and Spark's log4j.properties to send messages to Logstash using SocketAppender
However, that method is being deprecated

Is there any open source program that integrates logs in log4j format in a distributed system to a Cassandra DB?

I have a distributed system that all of them use lig4j for log system events.I have a Cassandra cluster that I want to put all of my logs in log4j format in that.
Is there any open source program that integrates logs in log4j format to my Cassandra cluster?
A good choice would be Apache Flume. https://cwiki.apache.org/FLUME/
Flume is distributed log collection engine. It has built in support for log4j: http://archive.cloudera.com/cdh/3/flume/UserGuide/#_logging_via_log4j_directly
There is also a plugin for using cassandra as the log storage mechanism instead of hdfs: https://github.com/thobbs/flume-cassandra-plugin
I started an open source project with a custom log4j appender sending the message to a apache cassandra cluster. Project site is: https://github.com/rviper/cassandra-log4j

Resources