write spark application ID with spark logs - apache-spark

I have been researching it for 1 month and couldn't find a good solution. Default spark logs doesn't contain the application ID.
Default logs contains - "time":,"date":,"level":,"thread":,"message"
I tried to customize the log4j properties but I could't find a way. I am a newbie to the the big data area.
My default log4j.properties file is
log4j.rootLogger=${hadoop.root.logger}
hadoop.root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
Does anyone know a solution ? Appreciate your help even it is a small help.

Related

Databricks Log4J Custom Appender Not Working as expected

I'm trying to figure out how a custom appender should be configured in a Databricks environment but I cannot figure it out.
When cluster is running, in driver logs, time is displayed as 'unknown' for my custom log file and when cluster is stopped, custom log file is not displayed at all in the log files list
#appender configuration
log4j.appender.bplm=com.databricks.logging.RedactionRollingFileAppender
log4j.appender.bplm.layout=org.apache.log4j.PatternLayout
log4j.appender.bplm.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.bplm.rollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.bplm.rollingPolicy.FileNamePattern=logs/log4j-%d{yyyy-MM-dd-HH}-bplm.log.gz
log4j.appender.bplm.rollingPolicy.ActiveFileName=logs/log4j-bplm.log
log4j.logger.com.myPackage=INFO,bplm
Above configuration was added to following files
"/databricks/spark/dbconf/log4j/executor/log4j.properties"
"/databricks/spark/dbconf/log4j/driver/log4j.properties"
"/databricks/spark/dbconf/log4j/master-worker/log4j.properties"
After above configuration was added to above mentioned files, there are two issues which i cannot figure out.
1 - When cluster is running, if I go to driver logs in the list of log files, I can see my custom logfile generated, correctly populated, but time column is displayed as 'unknown'.
2 - When cluster is stopped, if I go to driver logs in the list of log files, my custom appender are not displayed. ( stdout, stderr, and log4j-active are displayed )
I also used different FileNamePatterns, but issues mentioned above seems to happens for any configuration I tried
log4j.appender.bplm.rollingPolicy.FileNamePattern=logs/log4j-%d{yyyy-MM-dd-HH}.bplm.log.gz - appender1
log4j.appender.bplm.rollingPolicy.FileNamePattern=logs/log4j.bplm-%d{yyyy-MM-dd-HH}.log.gz - appender2
log4j.appender.bplm.rollingPolicy.FileNamePattern=logs/bplm-log4j-%d{yyyy-MM-dd-HH}.log.gz - appender3
log4j.appender.bplm.rollingPolicy.FileNamePattern=logs/bplm.log4j-%d{yyyy-MM-dd-HH}.log.gz - appender4
log4j.appender.bplm.rollingPolicy.FileNamePattern=logs/log4j-%d{yyyy-MM-dd-HH}.log.bplm.gz - appender5
log4j.appender.bplm7.rollingPolicy.FileNamePattern=logs/log4j-bplm-%d{yyyy-MM-dd-HH}.log.gz - appender7
log4j.appender.bplm8.rollingPolicy.FileNamePattern=logs/log4j-%d{yyyy-MM-dd-HH}-bplm.log.gz - appender8
I also tried to put *-active in ActiveFileName, but result was the same
log4j.appender.custom.rollingPolicy.FileNamePattern=/tmp/custom/logs/log4j-bplm-%d{yyyy-MM-dd-HH}.log.gz
log4j.appender.custom.rollingPolicy.ActiveFileName=/tmp/custom/logs/log4j-bplm-active.log

How To Get Local Spark on AWS to Write to S3

I have installed Spark 2.4.3 with Hadoop 3.2 on an AWS EC2 instance. I’ve been using spark (mainly pyspark) in local mode with great success. It is nice to be able to spin up something small and then resize it when I need power, and do it all very quickly. When I really need to scale I can switch to EMR and go to lunch. It all works smoothly apart from one issue: I can’t get the local spark to reliably write to S3 (I've been using local EBS space). This is clearly something to do with all the issues outlined in the docs about S3’s limitations as a file system. However, using the latest hadoop, my reading of the docs is that should be able to get it working.
Note that I'm aware of this other post, which asks a related question; there is some guidance here, but no solution that I can see. How to use new Hadoop parquet magic commiter to custom S3 server with Spark
I have the following settings (set in various places), following my best understanding of the documentation here: https://hadoop.apache.org/docs/r3.2.1/hadoop-aws/tools/hadoop-aws/index.html
fs.s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.committer.name: directory
fs.s3a.committer.magic.enabled: false
fs.s3a.committer.threads: 8
fs.s3a.committer.staging.tmp.path: /cache/staging
fs.s3a.committer.staging.unique-filenames: true
fs.s3a.committer.staging.conflict-mode: fail
fs.s3a.committer.staging.abort.pending.uploads: true
mapreduce.outputcommitter.factory.scheme.s3a: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
fs.s3a.connection.maximum: 200
fs.s3a.fast.upload: true
A relevant point is that I’m saving using parquet. I see that there was some problem with the Parquet saving previously, but I don’t see this mentioned in the latest docs. Maybe this is the problem?
In any case, here is the error I’m getting, which seems indicative of the kind of error S3 gives when trying to rename the temporary folder. Is there some array of correct settings that will make this go away?
java.io.IOException: Failed to rename S3AFileStatus{path=s3://my-research-lab-recognise/spark-testing/v2/nz/raw/bank/_temporary/0/_temporary/attempt_20190910022011_0004_m_000118_248/part-00118-c8f8259f-a727-4e19-8ee2-d6962020c819-c000.snappy.parquet; isDirectory=false; length=185052; replication=1; blocksize=33554432; modification_time=1568082036000; access_time=0; owner=brett; group=brett; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=false; isErasureCoded=false} isEmptyDirectory=FALSE to s3://my-research-lab-recognise/spark-testing/v2/nz/raw/bank/part-00118-c8f8259f-a727-4e19-8ee2-d6962020c819-c000.snappy.parquet
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:473)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:486)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:597)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:560)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:225)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
... 10 more
I helped #brettc with his configuration and we found out the correct one to set.
Under $SPARK_HOME/conf/spark-defaults.conf
# Enable S3 file system to be recognise
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
# Parameters to use new commiters
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.fs.s3a.committer.name directory
spark.hadoop.fs.s3a.committer.magic.enabled false
spark.hadoop.fs.s3a.commiter.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads true
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
If you look at the last 2 configurations lines above you see that you need org.apache.spark.internal.io library which contains PathOutputCommitProtocol and BindingParquetOutputCommitter classes. To do so you have to download spark-hadoop-cloud jar here (in our case we took version 2.3.2.3.1.0.6-1) and place it under $SPARK_HOME/jars/.
You can easily verify that you are using the new committer by creating a parquet file. The _SUCCESS file should contains a json like the one below:
{
"name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
"timestamp" : 1574729145842,
"date" : "Tue Nov 26 00:45:45 UTC 2019",
"hostname" : "<hostname>",
"committer" : "directory",
"description" : "Task committer attempt_20191125234709_0000_m_000000_0",
"metrics" : { [...] },
"diagnostics" : { [...] },
"filenames" : [...]
}

parsing log4j HTMLLayout logfile with filebeat

i am a newbie in ELK Stack and it could be that my approuch ist wrong.
Im trying to get my logs into Elasticsearch via filebeat>logstash
Logfiles are saved as [Date]log.html.
Looks like this (picture copied from a stackoverflow Thread): HTMLLayout Log4j picture example
Im able to read the [Date]log.html and view it via Kibana.
But the way it is displayed is awful. All HTML tags are shown too.
I would like to parse it into the headers like shown in the picture linked above:
- Time
- Thread
- Level
- Category
- Message
Im trying this in the filebeat.yml under filebeat.prospectors without success.
Thank you for your help in advance.
Best regards
Edit:
My filebeat.yml (try and error)
filebeat.prospectors:
- input_type: log
# Paths that should be crawled and fetched. Glob based paths.
paths:
- c:\MyDir\log.html
multiline:
pattern: '^[[:space:]]'
negate: false
match: after
#----------------------------- Logstash output --------------------------------
output.logstash:
# The Logstash hosts
hosts: ["localhost:5044"]

is Rollover logic of TimeBasedRollingPolicy correct?

Documentation says that 'if file names haven't changed, no rollover'.fileName is derived with the fileName pattern String.
I have two observations :
1)If today appender doesn't have message to write then it wont roll the file even if triggring time has passed (i.e. we have a log file which was modified yesterday ).
2)If the yesterdays log file is 0KB(means yesterday no log was written in file) & today appender have some messages to write then it rolls the 0kb file and writes data to newly created log file
I wanted to discuss whether above both cases are correctly implemented by TimeBasedRollingPolicy class OR should the implementation be changed ?
My implementation strategy for 1st Scenario would be 'if FileNamePattern is set to %d{dd-MM-yyyy} then at the midnight file should be rolled irrespective whether appender has data to write if yesterday's file is non empty.'
In case of 2nd case if yesterday's file is 0kb means no message was logged yesterday then it should write data into same file . because main purpose of rolling is to take backup of the logs and if file is empty is it worth to roll it ?
take below configuration of log4j.properties file as referenec for above two scenario's discussion
Sample log4j.properties
####### Root Logger ########################################
log4j.rootLogger=ERROR,CA,FA
############################################################
################### APPENDERS ##############################
############################################################
# CA is set to be a ConsoleAppender
log4j.appender.CA=org.apache.log4j.ConsoleAppender
log4j.appender.CA.layout=org.apache.log4j.PatternLayout
log4j.appender.CA.layout.ConversionPattern=%d %p %t %c: %m%n
# FA is set to be a FileAppender
log4j.appender.FA=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.FA.RollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.FA.RollingPolicy.FileNamePattern=.\\logs\\application.log-%d{dd-MM-yyyy}
log4j.appender.FA.File=.\\logs\\application.log
log4j.appender.FA.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.FA.layout.ConversionPattern=%d %p %t %c: %m%n
log4j.appender.FA.Append=true

How to avoid duplicating column headers in nlog CSVLayout?

I'm using a nlog's CSV layout file for data logging.
<target xsi:type="File" [...]>
<layout xsi:type="CSVLayout">
<column name="Value1" layout="${event-context:item=Value1}"/>
[...]
I'm getting column headers appended to the file every time I run the logging application. Is there a way to only get the column headers created when a new log file is created?
Thanks

Resources