Apache Spark custom log4j configuration for application - apache-spark

I would like to customize the Log4J configuration for my application in a standalone Spark cluster. I have a log4j.xml file which is inside my application JAR. What is the correct way to get Spark to use that configuration instead of its own Log4J configuration?
I tried using the --conf options to set the following, but no luck.
spark.executor.extraJavaOptions -> -Dlog4j.configuration=log4j.xml
spark.driver.extraJavaOptions -> -Dlog4j.configuration=log4j.xml
I am using Spark 1.4.1 and there's no log4j.properties file in my /conf.

If you are using SBT as package manager/builder:
There is a log4j.properties.template in $SPARK_HOME/conf
copy it in your SBT project's src/main/resource
remove the .template suffix
edit it to fit your needs
SBT run/package/* will include this in JAR and Spark references it.
Works for me, and will probably include similar steps for other package managers, e.g. maven.

Try using driver-java-options. For example:
spark-submit --class my.class --master spark://myhost:7077 --driver-java-options "-Dlog4j.configuration=file:///opt/apps/conf/my.log4j.properties" my.jar

Related

How to configure Spark 2.4 correctly with user-provided Hadoop

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.
http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7.
Another option is to use the Spark with user-provided Hadoop, so I tried that one.
As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either.
There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?
When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using
spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5
in spark-defaults.conf, I get this error:
20/02/26 11:20:45 ERROR spark.SparkContext:
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416
A workaround for the classifier probleme looks like this:
$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar
but DevOps won't accept this.
The complete list of dependencies looks like this (I have added line breaks for better readability)
root#a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307
(everything works - except for Hive)
Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
Is it necessary to build Spark to get around the Hive dependency problem ?
There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0
As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.
./dev/make-distribution.sh \
--name hadoop-2.10.0 \
--tgz \
-Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
-Phive -Phive-thriftserver \
-Pyarn
Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.
In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.
Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).
Update 2021-04-08
If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments
Assuming you don't want to run Spark-on-YARN -- start from bundle "Spark 2.4.5 with Hadoop 2.7" then cherry-pick the Hadoop libraries to upgrade from bundle "Hadoop 2.10.x"
Discard spark-yarn / hadoop-yarn-* / hadoop-mapreduce-client-* JARs because you won't need them, except hadoop-mapreduce-client-core that is referenced by write operations on HDFS and S3 (cf. "MR commit procedure" V1 or V2)
you may also discard spark-mesos / mesos-* and/or spark-kubernetes / kubernetes-* JARs depending on what you plan to run Spark on
you may also discard spark-hive-thriftserver and hive-* JARS if you don't plan to run a "thrift server" instance, except hive-metastore that is necessary for, as you might guess, managing the Metastore (either a regular Hive Metastore service or an embedded Metastore inside the Spark session)
Discard hadoop-hdfs / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl JARs
Replace with hadoop-hdfs-client / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl / stax2-api JARs from Hadoop 2.10 (under common/and common/lib/, or hdfs/ and hdfs/lib/)
Add the S3A connector from Hadoop 2.10 i.e. hadoop-aws / jets3t / woodstox-core JARs (under tools/lib/)
download aws-java-sdk from Amazon (cannot be bundled with Hadoop because it's not an Apache license, I guess)
and finally, run a lot of tests...
That worked for me, after some trial-and-error -- with a caveat: I ran my tests against an S3-compatible storage system, but not against the "real" S3, and not against regular HDFS. And without a "real" Hive Metastore service, just the embedded in-memory & volatile Metastore that Spark runs by default.
For the record, the process is the same with Spark 3.0.0 previews and Hadoop 3.2.1, except that
you also have to upgrade guava
you don't have to upgrade xercesImpl nor htrace-core nor stax2-api
you don't need jets3t any more
you need to retain more hadoop-mapreduce-client-* JARs (probably because of the new "S3 committers")

Building Spark with provided Hadoop

I've been trying to build a custom Spark build with a custom built Hadoop (I need to apply a patch to Hadoop 2.9.1 that allows me to use S3Guard on paths that start with s3://).
Here is how I build spark, after cloning it and being on Spark 2.3.1 on my Dockerfile:
ARG HADOOP_VER=2.9.1
RUN bash -c \
"MAVEN_OPTS='-Xmx2g -XX:ReservedCodeCacheSize=512m' \
./dev/make-distribution.sh \
--name hadoop${HADOOP_VER} \
--tgz \
-Phadoop-provided \
-Dhadoop.version=${HADOOP_VER} \
-Phive \
-Phive-thriftserver \
-Pkubernetes"
This compiles successfully, but when I try to use Spark with s3:// paths I still an error on the Hadoop code that I'm sure I removed through my patch when compiling it. So that Spark build is not using my Hadoop provided JARs as far as I can tell.
What is the right way to compile Spark so that it does not includes the Hadoop JARs and uses the one I provide.
Note: I run on standalone mode and I set SPARK_DIST_CLASSPATH=$(hadoop classpath) so that it points to my Hadoop classpath.
For custom hadoop versions, you need to get your own artifacts onto the local machines, and into the spark tar file which is distributed round the cluster (usually in HDFS), and downloaded when the workers are deployed (in YARN; no idea about k8s)
The best way to do this reliably is to locally build a hadoop release with a new version number, and build spark against that.
dev/make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Phadoop-3.1 -Phadoop-cloud -Dhadoop.version=2.9.3-SNAPSHOT
That will create a spark distro with the hadoop-aws and matching SDK which you have built up.
It's pretty slow: run nailgun/zinc if you can for some speedup. If you refer to a version which is also in the public repos, there's a high chance whatever cached copies in your maven repo ~/.m2/repository has crept in.
Then: bring up spark shell and test from there, before trying any more complex setups.
Finally, there is some open JIRA for s3guard to not worry about s3 vs s3a in URLs. Is that your patch? If not, does it work? we could probably get it in to future hadoop releases if people who need it are happy

connecting hive to from spark in intellij

I'm trying to connect to remote hive from within my spark program in Intellij installed on local machine.
I placed the hadoop cluster config files on local machine and configured environment variables HADOOP_CONF_DIR in Intellij run configurations of this spark program to be able to detect this hadoop cluster but intelliJ is somehow not reading these files and spark program defaults to local hive metastore instance.
Is there anyway to configure intelliJ to read hadoop config files locally. Any help is highly appreciated.
Please configure SPARK_CONF_DIR variable and copy the hive-site.xml in that directory. Spark will connect the specified hive meta-store and make sure that hive-site.xml points to your cluster details.
Thanks
Ravi
Add hadoop configuration files folder to intellij project class path
Project Settings -> Libraries -> + -> java -> select folder with all the config files -> classes

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed.
Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version).
Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection:
17/01/13 09:14:44 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
Similar debug logs for Spark 2.0.2 do not show any initialisation of HiveMetastoreConnection.
Is this intended behaviour? Could it be related to the fact that spark.sql.warehouse.dir is now a static configuration shared among sessions? How do I avoid this, since I have no Hive installed?
Thanks in advance!
From Spark 2.1.0 documentation pages:
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started. Note that the hive.metastore.warehouse.dir property in
hive-site.xml is deprecated since Spark 2.0.0. Instead, use
spark.sql.warehouse.dir to specify the default location of database in
warehouse.
Since you do not have Hive installed, you will not have a hive-site.xml config file, and this must be defaulting to the current directory.
If you are not planning to use HiveContext in Spark, you could reinstall Spark 2.1.0 from source, rebuilding it with Maven and making sure you omit -Phive -Phive-thriftserver flags which enable Hive support.
For future googlers: the actual underlying reason for the creation of metastore_db and derby.log in every working directory is the default value of derby.system.home.
This can be changed in spark-defaults.conf, see here.
This happen also with Spark 1.6. You can change the path by adding in Spark submit extra options:
-Dderby.system.home=/tmp/derby
(or by derby.properties, there are several ways to change it).

How to manually deploy 3rd party utility jar for Apache Spark cluster?

I have a Apache Spark cluster (multi-nodes) and I would like to manually deploy some utility jars to each Spark node. Where should I put these jars to?
For example: spark-streaming-twitter_2.10-1.6.0.jar
I know we can use maven to build a fat jar which including these jars, however I would like to deploy these utilities manually. In this way, programmers would not have to deploy these utilities jars.
Any suggestion?
1, Copy your 3rd party jars to reserved HDFS directory;
for example hdfs://xxx-ns/user/xxx/3rd-jars/
2, In spark-submit, specify these jars using hdfs path;
hdfs: - executors will pull down files and JARs from hdfs directory
--jars hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
3, spark-submit will not repleatly upload these jars
Client: Source and destination file systems are the same. Not copying hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
spark-submit and spark-shell have a --jars option. This will distribute the jars to all the executors. The spark-submit --help for --jars is as follows
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
This is taken from the programming guide..
Or, to also add code.jar to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar

Resources