slf4j logger is not creating log file in my spring batch application - log4j

I am using slf4j in my application for logging purpose. But it's not creating log file in file system but it's logging to console. Below is my log4j.properties file:
# Root logger option
log4j.rootLogger=DEBUG, RollingAppender,stdout
# Redirect log messages to console
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
# Redirect log messages to a log file
log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.File=E:\\SalesTerritory.log
log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
And dependency I am using in pom.xml for logging is:
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.6</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.5</version>
</dependency>[enter image description here][1]
Please find the dependency jar list in attached image files.
[1]: https://i.stack.imgur.
com/KGa2x.png

use this below Dependency for log4j
<!-- https://mvnrepository.com/artifact/log4j/log4j -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
Use INFO in log4j.rootLogger instead of DEBUG because INFO designates informational messages that highlight the progress of the crawl at a
coarse-grained level.

Related

log4j2 TimeBasedTriggeringPolicy isn't logging anything

I'm looking to create a simple TimeBasedTriggeringPolicy that creates a new log file every 5 min. But it's not logging anything. I'm not sure what I'm doing wrong. Here are the details, thank you!
My log4j dependencies:
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.17.1</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-to-slf4j</artifactId>
<version>2.17.1</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.17.1</version>
</dependency>
log4j.rootLogger=INFO,rolling
log4j.appender.rolling.type=RollingFile
log4j.appender.rolling.name=fileAppender
log4j.appender.rolling.fileName=${spark.yarn.app.container.log.dir}/spark-${date:yyyyMMdd_HH-mm}.log
log4j.appender.rolling.filePattern=${spark.yarn.app.container.log.dir}/spark-%d{yyyyMMdd_HH-mm}-%i.log
log4j.appender.rolling.layout.type=PatternLayout
log4j.appender.rolling.layout.pattern=%d %p %t %c - %m%n
log4j.appender.rolling.policies.type=Policies
log4j.appender.rolling.policies.time.type=TimeBasedTriggeringPolicy
log4j.appender.rolling.policies.time.interval=5
log4j.appender.rolling.policies.time.modulate=true
log4j.rootLogger.level=info
log4j.rootLogger.appenderRef.rolling.ref=fileLogger
Error Log:
log4j:ERROR Could not find value for key log4j.appender.rolling
log4j:ERROR Could not instantiate appender named "rolling".
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/45/__spark_libs__6130489434181934811.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Final Working Result (creates log every minute, and rotates monthly):
log4j.rootLogger=INFO, loggerId
log4j.appender.loggerId=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.loggerId.rollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.loggerId.rollingPolicy.ActiveFileName=${spark.yarn.app.container.log.dir}/spark.log
log4j.appender.loggerId.rollingPolicy.FileNamePattern=${spark.yarn.app.container.log.dir}/spark_%d{dd-HHmm}.log.gz
log4j.appender.loggerId.layout=org.apache.log4j.PatternLayout
log4j.appender.loggerId.layout.ConversionPattern=%d %p %t %c - %m%n
log4j.appender.loggerId.encoding=UTF-8
There are several errors in your configuration (run with -Dlog4j.debug=true to catch them all):
Log4j2 does not have properties prefixed with log4j., remove the prefix,
There is no property spark.yarn.app.container.log.dir in your configuration. If you meant to use a Java system property use ${sys:spark.yarn.app.container.log.dir},
You define the root logger twice:
once with the shorthand notation introduced in version 2.17.2: rootLogger = INFO, rolling,
another time using the full notation:
rootLogger.level = INFO
rootLogger.appenderRef.rolling ref = fileLogger
neither definition uses the name you gave to the appender: fileAppender.
You don't have a <SizeBasedRollingPolicy>, but your pattern contains %i. If you remove -%i, your fileName and filePattern are identical: you probably want to use a direct rollover strategy instead and omit fileName.
This creates log every minute, we can't do 5 minute increments unless we use xml or customization.
This rotates monthly, I took out the yyyyMM so it can rotate monthly, take out dd and it'll rotate daily:
log4j.rootLogger=INFO, loggerId
log4j.appender.loggerId=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.loggerId.rollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.loggerId.rollingPolicy.ActiveFileName=${spark.yarn.app.container.log.dir}/spark.log
log4j.appender.loggerId.rollingPolicy.FileNamePattern=${spark.yarn.app.container.log.dir}/spark_%d{dd-HHmm}.log.gz
log4j.appender.loggerId.layout=org.apache.log4j.PatternLayout
log4j.appender.loggerId.layout.ConversionPattern=%d %p %t %c - %m%n
log4j.appender.loggerId.encoding=UTF-8

Flink job with CassandrSink fails with Error writing

I have two simple Flink streaming jobs that read from Kafka do some transformations and put the result into a Cassandra Sink. They read from different Kafka topics and save into different Cassandra tables.
When I run any one of the two jobs alone everything works fine. Checkpoints are triggered and completed and data is saved to Cassandra.
But when ever I run both jobs (or one of them twice) the second job fails at start up with this exception:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1] Error writing)).
I could not find much info about this error, it may be caused by any one of the following:
Flink (v 1.10.0-scala_2.12),
Flink Cassandra Connector (flink-connector-cassandra_2.11:jar:1.10.2, also tried with flink-connector-cassandra_2.12:jar:1.10.0),
Datastax underlying driver (v 3.10.2),
Cassandra v4.0 (same with v3.0),
Netty transport (v 4.1.51.Final).
I also use packages that may have collisions with the first ones:
mysql-connector-java (v 8.0.19),
cassandra-driver-extras (v 3.10.2)
Finally this is my code for the cluster builder:
ClusterBuilder builder = new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
Cluster cluster = null;
try {
cluster = builder
.addContactPoint("localhost")
.withPort(9042)
.withClusterName("Test Cluster")
.withoutJMXReporting()
.withProtocolVersion(ProtocolVersion.V4)
.withoutMetrics()
.build();
// register codecs from datastax extras.
cluster.getConfiguration().getCodecRegistry()
.register(LocalTimeCodec.instance);
} catch (ConfigurationException e) {
e.printStackTrace();
} catch (NoHostAvailableException nhae) {
nhae.printStackTrace();
}
return cluster;
}
};
I tried with different PoolingOptions and SocketOptions settings but no success.
Cassandra Sink:
CassandraSink.addSink(dataRows)
.setQuery("insert into table_name_(16 columns names) " +
"values (16 placeholders);")
.enableWriteAheadLog()
.setClusterBuilder(builder)
.setFailureHandler(new CassandraFailureHandler() {
#Override
public void onFailure(Throwable throwable) {
LOG.error("A {} occurred.", "Cassandra Failure", throwable);
}
})
.build()
.setParallelism(1)
.name("Cassandra Sink For Unique Count every N minutes.");
The full trace log from flink job manager:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1] Error writing))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1414)
at com.datastax.driver.core.Cluster.init(Cluster.java:162)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:333)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:308)
at com.datastax.driver.core.Cluster.connect(Cluster.java:250)
at org.apache.flink.streaming.connectors.cassandra.CassandraSinkBase.createSession(CassandraSinkBase.java:143)
at org.apache.flink.streaming.connectors.cassandra.CassandraSinkBase.open(CassandraSinkBase.java:87)
at org.apache.flink.streaming.connectors.cassandra.AbstractCassandraTupleSink.open(AbstractCassandraTupleSink.java:49)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at org.apache.flink.streaming.api.operators.StreamSink.open(StreamSink.java:48)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1007)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:454)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:449)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:461)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
at java.base/java.lang.Thread.run(Thread.java:834)
Any help is appreciated.
Edit:
I just tried using two Cassandra separate instances (different machines and different clusters). I then pointed one job to an instance and the other job to the other instance. Nothing has changed, I still get the same error.
Tried to reduce dependencies, here is the new pom file:
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.abcde.ai</groupId>
<artifactId>analytics-etl</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Flink Quickstart Job</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<flink.version>1.10.2</flink.version>
<java.version>1.8</java.version>
<scala.binary.version>2.11</scala.binary.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
<maven.compiler.target>${java.version}</maven.compiler.target>
</properties>
<repositories>
<repository>
<id>apache.snapshots</id>
<name>Apache Development Snapshot Repository</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<dependencies>
<!-- Apache Flink dependencies -->
<!-- These dependencies are provided, because they should not be packaged into the JAR file. -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-cassandra_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.10</version>
</dependency>
<!-- Add logging framework, to produce console output when running in the IDE. -->
<!-- These dependencies are excluded from the application JAR by default. -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.7</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- Java Compiler -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<!-- We use the maven-shade plugin to create a fat jar that contains all necessary dependencies. -->
<!-- Change the value of <mainClass>...</mainClass> if your program entry point changes. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<!-- Run shade goal on package phase -->
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>org.apache.flink:force-shading</exclude>
<exclude>com.google.code.findbugs:jsr305</exclude>
<exclude>org.slf4j:*</exclude>
<exclude>log4j:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder.
Otherwise, this might cause SecurityExceptions when using the JAR. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.abcde.analytics.etl.KafkaUniqueCountsStreamingJob</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<!-- This improves the out-of-the-box experience in Eclipse by resolving some warnings. -->
<plugin>
<groupId>org.eclipse.m2e</groupId>
<artifactId>lifecycle-mapping</artifactId>
<version>1.0.0</version>
<configuration>
<lifecycleMappingMetadata>
<pluginExecutions>
<pluginExecution>
<pluginExecutionFilter>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<versionRange>[3.1.1,)</versionRange>
<goals>
<goal>shade</goal>
</goals>
</pluginExecutionFilter>
<action>
<ignore/>
</action>
</pluginExecution>
<pluginExecution>
<pluginExecutionFilter>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<versionRange>[3.1,)</versionRange>
<goals>
<goal>testCompile</goal>
<goal>compile</goal>
</goals>
</pluginExecutionFilter>
<action>
<ignore/>
</action>
</pluginExecution>
</pluginExecutions>
</lifecycleMappingMetadata>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
Edit:
I managed to narrow down the problem. The error gets fixed when I mark the dependency flink-connector-cassandra as provided and I simply copy the jar file from my local maven repository (~/.m2/repository/org/apache/flink/flink-connector-cassandra_2.11/1.10.2/flink-connector-cassandra_2.11-1.10.2.jar) to Flink lib folder. My problem is solved but the root cause is still a mystery.
I might be wrong, but most likely the issue is caused by netty client version conflict. The error states NoHostAvailableException, however the underlying error is TransportException with Error writing error message. Cassandra s definetely operating well.
There is a kind of similar stackoverflow case - Cassandra - error writing, with a very similar symptoms - a single project running well and AllNodesFailedException with TransportException with Error writing message as a root cause when adding one more. The author was able to solve it by unifying the netty client.
In your case, I'm not sure why there are so many dependencies, so I would try to get rid of all extras and libraries and would just leave Flink (v 1.10.0-scala_2.12) and Flink Cassandra Connector (flink-connector-cassandra_2.12:jar:1.10.0) libraries. They must already include necessary drivers, netty, etc. All other drivers should be skipped (at least for initial iteration to ensure that this solves the issue and it's library conflict).
To fix the error I marked the dependency flink-connector-cassandra as provided and I simply copy the jar file from my local maven repository (~/.m2/repository/org/apache/flink/flink-connector-cassandra_2.11/1.10.2/flink-connector-cassandra_2.11-1.10.2.jar) to Flink lib folder and restarted Flink, here is my new pom.xml file:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-cassandra_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
How I found this? I was about to try to compile the connector from source with a more recent driver version. First I tried to reproduce the error with the unchanged sources. So I compiled it without changing anything, put the jar into Flink lib folder, Hooray it works! I then suspected that the jar from maven had something different. I copied it into the lib folder and at my surprise it also worked.
My problem is solved but the root cause remains a mystery.
My last attempt was to check if any packages are in conflict with Cassandra connector so I run dependency:tree -Dverbose there was one conflict with org.apache.flink:flink-metrics-dropwizard about metrics-core:
[INFO] +- org.apache.flink:flink-connector-cassandra_2.12:jar:1.10.0:provided
[INFO] | +- (io.dropwizard.metrics:metrics-core:jar:3.1.2:provided - omitted for conflict with 3.1.5)
[INFO] | \- (org.apache.flink:force-shading:jar:1.10.0:provided - omitted for duplicate)
I removed this dependency from my project but the error remains if the connector is not marked as provided and also put in the lib folder.

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala, scalatest/junit, and spark:
<dependency>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
<version>3.0.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
</dependencies>
When attempting to run a basic spark program the SparkSession init fails on this line:
SparkSession.builder.master(master).appName("sparkApp").getOrCreate
Here is the output / error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/07 18:06:15 INFO SparkContext: Running Spark version 2.2.1
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder
.refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)
Lcom/google/common/cache/CacheBuilder;
at org.apache.hadoop.security.Groups.<init>(Groups.java:96)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
I have run spark locally many dozens of times on other projects, what might be wrong with this simple one? Is there a dependency on $HADOOP_HOME environment variable or similar?
Update By downgrading the spark version to 2.0.1 I was able to compile. That does not fix the problem (we need newer) version. But it helps point out the source of the problem
Another update In a different project the hack to downgrade to 2.0.1 does help - i.e. execution proceeds further : but then when writing out to parquet a similar exception does happen.
8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0 (TID 2618)
java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
at org.apache.hadoop.io.compress.CodecPool.<clinit>(CodecPool.java:74)
at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.<init>(CodecFactory.java:92)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetFileFormat.scala:562)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
This error occurs due to version mismatch between Google's guava library and Spark.
Spark shades guava but many libraries use guava. You can try Shading the Guava dependencies as per this post.
Apache-Spark-User-List
Adding shade plugin to your pom file and relocating google package can resolve this issue.
More information can found here and here
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>shade.com.google.common</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
If this also doesn't help then adding guava library of version 15.0 works nicely. The reason of this work around is in dependencyManagement. The nice SO answer is here
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>15.0</version>
</dependency>
</dependencies>
</dependencyManagement>
I am getting this error in spring boot : java.lang.TypeNotPresentException: Type com.google.common.cache.CacheBuilderSpec
com.google.common.cache.CacheBuilder.build()Lcom/google/common/cache/Cache
The issue is due to "com.google.guava:guava" api. In springboot this api comes under some other api might be "spring-boot-starter-web" or "springfox-swagger2" api so we need to first exclude guava api from springfox-swagger2 jar and need to add updated version of guava api.spring-data-mongodb
Solution:
1. add guava dependency on the top of all the dependency so that springboot can ge the latest version:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
Find out the spring boot dependecy where artifactId: "guava" included then exlude "guava" artifact from that dependency and then add the guava dependency like above.

Logback to log4j bridge

My entire system is logged by slf4j with log4j implementation.
I had a problem when a new module used a logback-classic depdency for logging which cannot be excluded from pom.xml file since it breaks it.
First I tried to look for solution in this 3rd side depdency but couldn't find any solutions so I thought maybe a bridge between them could be something that solves it.
What I mainly looking for is to split my logs between the deafult console logging of the dependency to my own log4j.xml loggers and appenders so I can use the separetly..
Is there any bridge so I could use both logback-classic and log4j under slf4j with example?
Thanks!
You said:
I had a problem when a new module used a logback-classic depdency for logging which cannot be excluded from pom.xml file since it breaks it.
Is this an assumption or did you try it? Also, if you did try it did you remove the logback-core dependency as well? The only way removing these dependencies would break the module is if the module depends on the logging implementation since logback natively implements the slf4j API. If the module does depend on the implementation rather than the API/interface I don't think there's anything you can do without either removing those dependencies (changing the module source code) or writing some stubbed versions of the implementation classes that the module depends on.
When I write code that follows the pattern in the logback manual I'm able to swap the implementation from logback to log4j2 without any issues as long as I don't introduce dependencies on the logback implementation classes.
Here is the example I wrote:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
private Logger log = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
Main main = new Main();
main.main();
}
public void main(){
log.trace("trace msg");
log.debug("debug msg");
log.info("info msg");
log.warn("warn msg");
log.error("Error msg");
log.info(log.getClass().getName());
}
}
Here are the dependencies in the pom:
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.22</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>1.2.3</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
</dependency>
Here is the output:
00:48:26.378 [main] DEBUG blah.Main - debug msg
00:48:26.380 [main] INFO blah.Main - info msg
00:48:26.380 [main] WARN blah.Main - warn msg
00:48:26.380 [main] ERROR blah.Main - Error msg
00:48:26.380 [main] INFO blah.Main - ch.qos.logback.classic.Logger
Now I change the pom to replace the logback jars with log4j2:
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.22</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.7</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.7</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.7</version>
</dependency>
Here's the output after making this change:
2017-09-03 00:52:21,630 INFO b.Main [main] info msg
2017-09-03 00:52:21,631 WARN b.Main [main] warn msg
2017-09-03 00:52:21,631 ERROR b.Main [main] Error msg
2017-09-03 00:52:21,632 INFO b.Main [main] org.apache.logging.slf4j.Log4jLogger
So based on this I think you should, if things are implemented the "right way", be able to swap the logback jars with log4j2 and it should "just work".
You also said:
What I mainly looking for is to split my logs between the deafult console logging of the dependency to my own log4j.xml loggers and appenders so I can use the separetly..
Now it's entirely clear to me what you were asking but I think you wanted to have log messages from the module go to console as well as any logs you are using with your log4j2 configuration. If this is the case that's as simple as modifying your log4j2 configuration - add a logger with the appropriate name and assign the appropriate appenders. For example if your module's classes are com.my.package.Class1, com.my.package.Class2, com.my.package.Class3, etc then you could create a logger for com.my.package and give it a console appender along with the appropriate file appenders.
Hope this helps!

Kinesis Stream with Empty Records in Google Dataproc with Spark 1.6.1 Hadoop 2.7.2

I am trying to connect to Amazon Kinesis Stream from Google Dataproc but am only getting Empty RDDs.
Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
Detailed Log: https://gist.github.com/sshrestha-datalicious/e3fc8ebb4916f27735a97e9fcc42136c
More Details
Spark 1.6.1
Hadoop 2.7.2
Assembly Used: /usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar
Surprisingly that works when I download and use the assembly containing SPARK 1.6.1 with Hadoop 2.6.0 with the following command.
Command: SPARK_HOME=/opt/spark-1.6.1-bin-hadoop2.6 spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
I am not sure if there is any version conflict between the two hadoop versions and Kinesis ASL or it has to do with custom settings with Google Dataproc.
Any help would be appreciated.
Thanks
Suren
Our team was in a similar situation and we managed to solve it:
We are running on the same environment:
DataProc Image Version 1 with Spark 1.6.1 with Hadoop 2.7
A simple SparkStream Kinesis Script that boils down to this:
# Run the script as
# spark-submit \
# --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1\
# demo_kinesis_streaming.py\
# --awsAccessKeyId FOO\
# --awsSecretKey BAR\
# ...
import argparse
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.storagelevel import StorageLevel
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
ap = argparse.ArgumentParser()
ap.add_argument('--awsAccessKeyId', required=True)
ap.add_argument('--awsSecretKey', required=True)
ap.add_argument('--stream_name')
ap.add_argument('--region')
ap.add_argument('--app_name')
ap = ap.parse_args()
kinesis_application_name = ap.app_name
kinesis_stream_name = ap.stream_name
kinesis_region = ap.region
kinesis_endpoint_url = 'https://kinesis.{}.amazonaws.com'.format(ap.region)
spark_context = SparkContext(appName=kinesis_application_name)
streamingContext = StreamingContext(spark_context, 60)
kinesisStream = KinesisUtils.createStream(
ssc=streamingContext,
kinesisAppName=kinesis_application_name,
streamName=kinesis_stream_name,
endpointUrl=kinesis_endpoint_url,
regionName=kinesis_region,
initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
checkpointInterval=60,
storageLevel=StorageLevel.MEMORY_AND_DISK_2,
awsAccessKeyId=ap.awsAccessKeyId,
awsSecretKey=ap.awsSecretKey
)
kinesisStream.pprint()
streamingContext.start()
streamingContext.awaitTermination()
The code had been tested working on AWS EMR and on local environment using the same Spark 1.6.1 with Hadoop 2.7 setup.
The script is returning empty RDDs without printing any error while there is data in the Kinesis stream on DataProc.
We've tested it on DataProc with the following envs, and none of them worked.
Submit job via gcloud command;
ssh into Cluster Master Node and run in yarn client mode;
ssh into Cluster Master Node and run as local[*].
Upon enabling verbose logging by updating /etc/spark/conf/log4.properties with the following value:
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=DEBUG
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=DEBUG
log4j.logger.org.apache.spark=DEBUG
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=DEBUG
log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=DEBUG
log4j.logger.org.apache=DEBUG
log4j.logger.com.amazonaws=DEBUG
We've notice something weird in the log(Note that spark-streaming-kinesis-asl_2.10:1.6.1 uses aws-sdk-java/1.9.37 as dependence while somehow aws-sdk-java/1.7.4 was used [suggested by user-agent]):
16/07/10 06:30:16 DEBUG com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer: PROCESS task encountered execution exception:
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:126)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:334)
at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:174)
Caused by: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:119)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
content-length:282
content-type:application/x-amz-json-1.1
host:kinesis.ap-southeast-2.amazonaws.com
user-agent:SparkDemo,amazon-kinesis-client-library-java-1.4.0, aws-sdk-java/1.7.4 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91
x-amz-date:20160710T063016Z
x-amz-target:Kinesis_20131202.GetRecords
It appears that DataProc had build its own Spark with a much older AWS SDK as dependencies and it will blow up when used in conjunction with codes that requires much newer version of AWS SDK although we are not sure exactly which module had cause this error.
Update:
Base on #DennisHuo's comment, this behaviour is caused by Hadoop's leaky classpath:
https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-project/pom.xml#L650
To make things worst, the AWS KCL 1.4.0 (used by Spark 1.6.1) will suppress any runtime error silently instead of throwing RuntimeException and causing a lot of headache while debugging.
Eventually Our solution was to build our org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1 with all of its com.amazonaws.* shaded.
Building the JAR with the following pom (update spark/extra/kinesis-asl/pom.xml) and shit the new JAR with --jars flag in spark-submit
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.6.1</version>
<relativePath>../../pom.xml</relativePath>
</parent>
<!-- Kinesis integration is not included by default due to ASL-licensed code. -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<packaging>jar</packaging>
<name>Spark Kinesis Integration</name>
<properties>
<sbt.project.name>streaming-kinesis-asl</sbt.project.name>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>${aws.kinesis.client.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-producer</artifactId>
<version>${aws.kinesis.producer.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_${scala.binary.version}</artifactId>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<artifactSet>
<includes>
<!-- At a minimum we must include this to force effective pom generation -->
<include>org.spark-project.spark:unused</include>
<include>com.amazonaws:*</include>
</includes>
</artifactSet>
<relocations>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
<includes>
<include>com.amazonaws.**</include>
</includes>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

Resources