log4j in spark main code - apache-spark

i want to collect logger according to log4j, i can get the hadoop logger but can't get the main code logger.
tow attached - 1 log4j.properties
log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/home/spark.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN
log4j.logger.com.test.main.Main=WARN
tow attached - 2 com.test.main.Main
object Main {
def main(args: Array[String]): Unit = {
val logger = LogManager.getLogger(Main.getClass)
logger.info("info\n")
logger.warn("warn\n")
logger.debug("DEBUG\n")
logger.error("EEOR\n")
now,I can get spark logger in /home/spark.log, such as
[2018-08-14 18:04:28,852] INFO Running Spark version 2.2.0.cloudera1 (org.apache.spark.SparkContext)
[2018-08-14 18:04:29,705] INFO Submitted application: steaming_Test (org.apache.spark.SparkContext)
but no logger in main code such as
18:05:17.250 [main] ERROR com.sgm.bgdt.main.Main$ - EEOR
is there a error setting for "log4j.logger.com.test.main.Main=WARN" or something wrong in my main code?
PS.
this is my spark submit
--driver-java-options "-Dlog4j.configuration=file:/path/log4j.properties
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/log4j.properties

Related

Add Sentry Log4j2 appender at runtime

I've been browsing previous threads about adding Log4j2 appenders at runtime but none of them really seem to fit my scenario.
We have a longrunning Flink Job packaged into a FAT jar that we essentially submit to a running Flink cluster. We want to forward error logs to Sentry. Conveniently enough Sentry provides a Log4j2 appender that I want to be able to use, but all attempts to get Log4j2 to work have failed -- going a bit crazy about this (spent days).
Since Flink (who also uses log4j2) provides a set of default logging configurations that takes precedence of any configuration files we bundle in our jar; I'm essentially left with attempting to configure the appender at runtime to see if that will make it register the appender and forward the LogEvents to it.
As a side note: I attempted to override the Flink provided configuration file (to essentially add the appender directly into the Log4j2.properties file) but Flink fails to load the plugin due to a missing dependency - io.sentry.IHub - which doesn't make sense since all examples/sentry docs don't mention any other dependencies outside of log4j related ones which already exists in the classpath.
I've followed the example in the log4j docs: Programmatically Modifying the Current Configuration after Initialization but the logs are not getting through to Sentry.
SentryLog4j.scala
package com.REDACTED.thoros.config
import io.sentry.log4j2.SentryAppender
import org.apache.logging.log4j.Level
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.core.LoggerContext
import org.apache.logging.log4j.core.config.AppenderRef
import org.apache.logging.log4j.core.config.Configuration
import org.apache.logging.log4j.core.config.LoggerConfig
object SentryLog4j2 {
val SENTRY_LOGGER_NAME = "Sentry"
val SENTRY_BREADCRUMBS_LEVEL: Level = Level.ALL
val SENTRY_MINIMUM_EVENT_LEVEL: Level = Level.ERROR
val SENTRY_DSN =
"REDACTED"
def init(): Unit = {
// scalafix:off
val loggerContext: LoggerContext =
LogManager.getContext(false).asInstanceOf[LoggerContext]
val configuration: Configuration = loggerContext.getConfiguration
val sentryAppender: SentryAppender = SentryAppender.createAppender(
SENTRY_LOGGER_NAME,
SENTRY_BREADCRUMBS_LEVEL,
SENTRY_MINIMUM_EVENT_LEVEL,
SENTRY_DSN,
false,
null
)
sentryAppender.start()
configuration.addAppender(sentryAppender)
// Creating a new dedicated logger for Sentry
val ref: AppenderRef =
AppenderRef.createAppenderRef("Sentry", null, null)
val refs: Array[AppenderRef] = Array(ref)
val loggerConfig: LoggerConfig = LoggerConfig.createLogger(
false,
Level.ERROR,
"org.apache.logging.log4j",
"true",
refs,
null,
configuration,
null
)
loggerConfig.addAppender(sentryAppender, null, null)
configuration.addLogger("org.apache.logging.log4j", loggerConfig)
println(configuration.getAppenders)
loggerContext.updateLoggers()
// scalafix:on
}
}
Then invoking the SentryLog4j.init() in the Main module.
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.Logger
import org.apache.logging.log4j.core.LoggerContext
import org.apache.logging.log4j.core.config.Configuration
object Main {
val logger: Logger = LogManager.getLogger()
sys.env.get("ENVIRONMENT") match {
case Some("dev") | Some("staging") | Some("production") =>
SentryLog4j2.init()
case _ => SentryLog4j2.init() // <-- this was only added during debugging
}
def main(args: Array[String]): Unit = {
logger.error("test") // this does not forward the logevent to the appender
}
}
I think I somehow need to register the appender to loggerConfig that the rootLogger uses so that all logger.error statements are propogated to the configured Sentry appender?
Greatly appreciate any guidance with this!
Although not an answer to how you get log4j2 and the SentryAppender to work. For anyone else that is stumbling on this problem, I'll just briefly explain what I did to get the sentry integration working.
What I eventually decided to do was drop the use of the SentryAppender and instead used the raw sentry client. Adding a wrapper class exposing the typical debug, info, warn and error methods. Then for the warn+ methods, I'd also send the logevent to Sentry.
This is essentially the only way I got this to work within a Flink cluster.
See example below:
sealed trait LoggerLike {
type LoggerFn = (String, Option[Object]) => Unit
val debug: LoggerFn
val info: LoggerFn
val warn: LoggerFn
val error: LoggerFn
}
trait LazyLogging {
#transient
protected lazy val logger: CustomLogger =
CustomLogger.getLogger(getClass.getName, enableSentry = true)
}
final class CustomLogger(slf4JLogger: Logger) extends LoggerLike {...your implementation...}
Then for each class/object (scala language at least), you'd just extend the LazyLogging trait to get a logger instance.

Log4j in file Spark

I am trying to log my project. For that, I'm using log4j and I'm putting the information and settings in the code itself, without using the properties file, as shown below.
public class Teste {
Logger log = Logger.getLogger(Teste.class.getName());
public static void configError() {
EnhancedPatternLayout layout = new EnhancedPatternLayout();
String conversionPattern = "%d{ISO8601}{GMT+1} %-5p %m%n";
layout.setConversionPattern(conversionPattern);
String fileError = "C:/ProducerError.log";
// creates console appender
ConsoleAppender consoleAppender = new ConsoleAppender();
consoleAppender.setLayout(layout);
consoleAppender.activateOptions();
// creates file appender
FileAppender fileAppender = new FileAppender();
fileAppender.setFile(fileError);
fileAppender.setLayout(layout);
fileAppender.activateOptions();
// configures the root logger
Logger rootLogger = Logger.getRootLogger();
rootLogger.setLevel(Level.ERROR);
rootLogger.addAppender(consoleAppender);
rootLogger.addAppender(fileAppender);
log.error("Error teste");
rootLogger.removeAllAppenders();
}
}
I wanted to do the same but in a spark file. I tried the same way but it doesn't return anything. How does spark logs work? Can't I put it in the code like I did before? I have a DockerFile with spark-submit, but I didn't want to mess with that code.
Provide config file path while submitting the spark job -Dlog4j.configuration=path/to/log4j.properties

Too many open files in spark aborting spark job

In my application i am reading 40 GB text files that is totally spread across 188 files .
I split this files and create xml files per line in spark using pair rdd .
For 40 GB of input it will create many millions small xml files and this is my requirement.
All working fine but when spark saves files in S3 it throws error and job fails .
Here is the exception i get
Caused by: java.nio.file.FileSystemException:
/mnt/s3/emrfs-2408623010549537848/0000000000: Too many open files at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361) at
java.nio.file.Files.createFile(Files.java:632) at
com.amazon.ws.emr.hadoop.fs.files.TemporaryFiles.create(TemporaryFiles.java:70)
at
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.openNewPart(MultipartUploadOutputStream.java:493)
... 21 more
ApplicationMaster host: 10.97.57.198 ApplicationMaster RPC port: 0
queue: default start time: 1542344243252 final status: FAILED
tracking URL:
http://ip-10-97-57-234.tr-fr-nonprod.aws-int.thomsonreuters.com:20888/proxy/application_1542343091900_0001/
user: hadoop Exception in thread "main"
org.apache.spark.SparkException: Application
application_1542343091900_0001 finished with failed status
And this as well
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
Please reduce your request rate. (Service: Amazon S3; Status Code:
503; Error Code: SlowDown; Request ID: D33581CA9A799F64; S3 Extended
Request ID:
/SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=),
S3 Extended Request ID:
/SlEplo+lCKQRVVH+zHiop0oh8q8WqwnNykK3Ga6/VM2HENl/eKizbd1rg4vZD1BZIpp8lk6zwA=
Here is my code to do that .
object TestAudit {
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
//val conf = new SparkConf().setAppName("AuditXML").setMaster("local");
val conf = new SparkConf().setAppName("AuditXML")
val sc = new SparkContext(conf);
val input = sc.textFile(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
pairedRDD.partitionBy(new HashPartitioner(10000)).saveAsHadoopFile("s3://a205381-tr-fr-development-us-east-1-trf-auditabilty//AUDITOUTPUT", classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
}
}
Even i tried reducing no of HashPartitioner then also it does not work
Every process on Unix systems has a limitation of open files or number of file descriptors. As your data is large and partitions to subfile (in internal of Spark), your process meet the limitation and error.
You can increase the number of file descriptors for each user as following:
edit the file: /etc/security/limits.conf and add (or modify)
* hard nofile 500000
* soft nofile 500000
root hard nofile 500000
root soft nofile 500000
This will set the nofile (number of file descriptors) feature to 500000 for each user along with the root user.
After restarting the changes will be applied.
Also, someone can set the number of file descriptors for a special process, by setting the LimitNOFILE. For example, if you use yarn to run Spark jobs and the Yarn daemon will be started using systemd, you can add LimitNOFILE=128000 to Yarn systemd script(resource manager and nodemanager) to set Yarn process number of file descriptors to 128000.
related articles:
3 Methods to Change the Number of Open File Limit in Linux
Limits on the number of file descriptors

Spark Structured Streaming -

I'm trying to run the following code from IntelliJ idea to print messages from Kafka to console. But it throws the following error -
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Stacktrace started from Dataset.checkpoint and way up. If I remove .checkpoint() then I get some other error - related to permission
17/08/02 12:10:52 ERROR StreamMetadata: Error writing stream metadata StreamMetadata(4e612f22-efff-4c9a-a47a-a36eb533e9d6) to C:/Users/rp/AppData/Local/Temp/temporary-2f570b97-ad16-4f00-8356-d43ccb7660db/metadata
java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\rp\AppData\Local\Temp\temporary-2f570b97-ad16-4f00-8356-d43ccb7660db\metadata
Source:
def main(args : Array[String]) = {
val spark = SparkSession.builder().appName("SparkStreaming").master("local[*]").getOrCreate()
val canonicalSchema = new StructType()
.add("cid",StringType)
.add("uid",StringType)
.add("sourceSystem",
new StructType().add("id",StringType)
.add("name",StringType))
.add("name", new StructType()
.add("firstname",StringType)
.add("lastname",StringType))
val messages = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe","c_canonical")
.option("startingOffset","earliest")
.load()
.checkpoint()
.select(from_json(col("value").cast("string"),canonicalSchema))
.writeStream.outputMode("append").format("console").start.awaitTermination
}
Can anyone please help me understand where I'm doing wrong?
Structured Streaming doesn't support Dataset.checkpoint(). There is an open ticket to provide a better message or just ignore it: https://issues.apache.org/jira/browse/SPARK-20927
IOException probably is because you don't install cygwin on Windows.

Apache Spark: saveAsTextFile not working correctly in Stand Alone Mode

I wrote a simple Apache Spark (1.2.0) Java program to import a text file and then write it to disk using saveAsTextFile. But the output folder either has no content (just the _SUCCESS file) or at times has incomplete data (data from just 1/2 of the tasks ).
When I do a rdd.count() on the RDD, it shows the correct number, so I know the RDD correctly constructed, it is just the saveAsTextFile method which is not working.
Here is the code:
/* SimpleApp.java */
import java.util.List;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
String logFile = "/tmp/READ_ME.txt"; // Should be some file on your system
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile);
logData.saveAsTextFile("/tmp/simple-output");
System.out.println("Lines -> " + logData.count());
}
}
This is because you're saving to a local path. Are you running multiple machines? so, each worker is saving to its own /tmp directory. Sometimes, you have the driver executing a task so you get part of the result locally. Really you won't want to mix distributed mode and local file systems.
You can try code like below(for eg)..
JavaSparkContext sc = new JavaSparkContext("local or your network IP","Application name");
JavaRDD<String> lines = sc.textFile("Path Of Your File", No. of partitions).count();
And then you print no. of lines containing in file.

Resources