Passing custom log4j.properties file from s3 - apache-spark

I'm trying to set custom logging configurations. If I add the log file to the cluster and reference it in my spark submit, the configurations take effect. But if I try to access the file using --files s3://... then it doesn't work.
Works (assuming I placed the file in the home dir):
spark-submit \
--master yarn \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
Doesn't work:
spark-submit \
--master yarn \
--files s3://my_path/log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
How can I use a config file in s3 to set the logging configuration?

You can't directly Log4J loads its files from the local filesystem, always.
You can use configs inside a JAR, and as spark will download JARs with your job, you should be able to get it indirectly. Create a JAR containing only the log4j.properties file, tell spark to load it with the job

Related

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

df.show() prints empty result while in hdfs it is not empty

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs
in my code, i have a dataframe which is read directly from hdfs:
df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")
when i use df.show(n=2) directly in my code after the above code, it outputs:
+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+
But when i manually go to the hdfs path, data is not empty.
What i have tried?
1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.
2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist
What i am assuming?
1- i think this may have something to do with drivers and executors
2- it may i have something to do with yarn
3- configs provided when using spark-submit
current config:
spark-submit \
--master yarn \
--queue my_queue_name \
--deploy-mode cluster \
--jars some_jars \
--conf spark.yarn.dist.files some_files \
--conf spark.sql.catalogImplementation=in-memory \
--properties-file some_zip_file \
--py-files some_py_files \
main.py
What i am sure
data is not empty. the same hdfs path is provided in another project which is working fine.
So the problem was with the jar files i was providing
The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

How to deploy Spark application jar file to Kubernetes cluster?

I am currently trying to deploy a spark example jar on a Kubernetes cluster running on IBM Cloud.
If I try to follow these instructions to deploy spark on a kubernetes cluster, I am not able to launch Spark Pi, because I am always getting the error message:
The system cannot find the file specified
after entering the code
bin/spark-submit \
--master k8s://<url of my kubernetes cluster> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///examples/jars/spark-examples_2.11-2.3.0.jar
I am in the right directory with the spark-examples_2.11-2.3.0.jar file in the examples/jars directory.
Ensure your.jar file is present inside the container image.
Instruction tells that it should be there:
Finally, notice that in the above example we specify a jar with a
specific URI with a scheme of local://. This URI is the location of
the example jar that is already in the Docker image.
In other words, local:// scheme is removed from local:///examples/jars/spark-examples_2.11-2.3.0.jar and the path /examples/jars/spark-examples_2.11-2.3.0.jar is expected to be available in a container image.
Please make sure this absolute path /examples/jars/spark-examples_2.11-2.3.0.jar is exists.
Or you are trying loading a jar file in current directory, In this case it should be an relative path like local://./examples/jars/spark-examples_2.11-2.3.0.jar.
I'm not sure if spark-submit accepts relative path or not.

cant override Typesafe configuration on commmandline in spark

I have a typesafe configuration application.conf in the src/main/resourcesfolder which is loaded by default.
A single value can be overridden by specifying:
--conf spark.driver.extraJavaOptions=-DsomeValue="foo"
However, specifying a complete new, i.e. overriding application.conf file like:
spark-submit \
--class my.Class \
--master "local[2]" \
--files foo.conf \
--conf spark.driver.extraClassPath="-Dconfig.file=file:foo.conf" \
--conf spark.driver.extraJavaOptions=-Dvalue="abcd" \
job.jar
will fail to load foo.conf. Instead, the original file from the resources folder will be loaded.
Trying the tricks from: Using typesafe config with Spark on Yarn did not help as well.
edit
Overriding multiple config values in Typesafe config when using an uberjar to deploy seems to be the answer for plain (without spark) programs.
The question remains how to bring this to spark.
Also passing:
--conf spark.driver.extraClassPath="-Dconfig.resource=file:foo.conf"
--conf spark.driver.extraClassPath="-Dconfig.resource=foo.conf"
fails to load my configuration from the command line .
Though, according to the docs:
https://github.com/lightbend/config For applications using
application.{conf,json,properties}, system properties can be used to
force a different config source (e.g. from command line
-Dconfig.file=path/to/config-file):
config.resource specifies a resource name - not a basename, i.e. application.conf not application
config.file specifies a filesystem path, again it should include the extension, not be a basename
config.url specifies a URL
These system properties specify a replacement for
application.{conf,json,properties}, not an addition. They only affect
apps using the default ConfigFactory.load() configuration. In the
replacement config file, you can use include "application" to include
the original default config file; after the include statement you
could go on to override certain settings.
it should be possible with these parameters.
spark-submit \
--class my.Class \
--master "local[2]" \
--files foo.conf \
--conf spark.driver.extraJavaOptions="-Dvalue='abcd' -Dconfig.file=foo.conf" \
target/scala-2.11/jar-0.1-SNAPSHOT.jar
changing from spark.driver.extraClassPathto spark.driver.extraJavaOptions is doing the trick

spark-submit, how to specify log4j.properties

In spark-submit, how to specify log4j.properties ?
Here is my script. I have tried all of combinations and even just use one local node. but looks like the log4j.properties is not loaded, all debug level info was dumped.
current_dir=/tmp
DRIVER_JAVA_OPTIONS="-Dlog4j.configuration=file://${current_dir}/log4j.properties "
spark-submit \
--conf "spark.driver.extraClassPath=$current_dir/lib/*" \
--conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=${current_dir}/config/krb5.conf -Djava.security.auth.login.config=${current_dir}/config/mssqldriver.conf" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file://${curent_dir}/log4j.properties " \
--class "my.AppMain" \
--files ${current_dir}/log4j.properties \
--master local[1] \
--driver-java-options "$DRIVER_JAVA_OPTIONS" \
--num-executors 4 \
--driver-memory 16g \
--executor-cores 10 \
--executor-memory 6g \
$current_dir/my-app-SNAPSHOT-assembly.jar
log4j properties:
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.additivity.org=false
log4j.logger.org=WARN
parquet.hadoop=WARN
log4j.logger.com.barcap.eddi=WARN
log4j.logger.com.barcap.mercury=WARN
log4j.logger.yarn=WARN
log4j.logger.io.netty=WARN
log4j.logger.Remoting=WARN
log4j.logger.org.apache.hadoop=ERROR
# this disables the table creation logging which is so verbose
log4j.logger.hive.ql.parse.ParseDriver=WARN
# this disables pagination nonsense when running in combined mode
log4j.logger.com.barcap.risk.webservice.servlet.PaginationFactory=WARN
Pay attention the Spark worker is not your Java application, so you can't use a log4j.properties file from the class-path.
To understand how Spark on YARN will read a log4j.properties file, you can use the log4j.debug=true flag:
spark.executor.extraJavaOptions=-Dlog4j.debug=true
Most of the time, the error is that the file is not found/available from the worker YARN container. There is a very useful Spark directive that allows to share file: --files.
--files "./log4j.properties"
This will make this file available from all your driver/workers. Add Java extra options:
-Dlog4j.configuration=log4j.properties
Et voilĂ !
log4j: Using URL [file:/var/log/ambari-server/hadoop/yarn/local/usercache/hdfs/appcache/application_1524817715596_3370/container_e52_1524817715596_3370_01_000002/log4j.properties] for automatic log4j configuration.
How to pass local log4j.properties file
As I see from your script you want to:
Pass local log4j.properties to executors
Use this file for node's configuration.
Note two things about --files settings:
Files uploaded to spark-cluster with --files will be available at root dir of executor workspace, so there is no need to add any path in file:log4j.properties.
Files listed in --files must be provided with absolute path!
Fixing your snippet is very easy now:
current_dir=/tmp
log4j_setting="-Dlog4j.configuration=file:log4j.properties"
spark-submit \
...
--conf "spark.driver.extraJavaOptions=${log4j_setting}" \
--conf "spark.executor.extraJavaOptions=${log4j_setting}" \
--class "my.AppMain" \
--files ${current_dir}/log4j.properties \
...
$current_dir/my-app-SNAPSHOT-assembly.jar
Need more?
If you would like to read about other ways of configuring logging while using spark-submit, please visit my other detailed answer: https://stackoverflow.com/a/55596389/1549135
Just to add,
you can directly pass the conf via spark-submit, no need to modify defaults conf file
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///export/home/siva/log4j.properties
i ran below command, it worked fine
/usr/hdp/latest/spark2/bin/spark-submit --master local[*] --files ~/log4j.properties --conf spark.sql.catalogImplementation=hive --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///export/home/siva/log4j.properties ~/SCD/spark-scd-assembly-1.0.jar test_run
Note: If you have extra java options configured in conf file, just append and submit
Copy the spark-defaults.conf to a new app-spark-defaults.conf
Add -Dlog4j.configuration=file://log4j.properties to the spark.driver.extraJavaOptions in the app-spark-defaults.conf. For example:
spark.driver.extraJavaOptions -XXOther_flag -Dlog4j.configuration=file://log4j.properties
Run your spark using --properties-file to the new conf file.
For example :
spark-submit --properties-file app-spark-defaults.conf --class my.app.class --master yarn --deploy-mode client ~/my-jar.jar
Solution for spark-on-yarn
for me, run spark on yarn,just add --files log4j.properties makes everything ok.
1. make sure the directory where you run spark-submit contains file "log4j.properties".
2. run spark-submit ... --files log4j.properties
let's see why this work
1.spark-submit will upload log4j.properties to hdfs like this
20/03/31 01:22:51 INFO Client: Uploading resource file:/home/ssd/homework/shaofengfeng/tmp/firesparkl-1.0/log4j.properties -> hdfs://sandbox/user/homework/.sparkStaging/application_1580522585397_2668/log4j.properties
2.when yarn launches containers for driver or executor,yarn will download all files uploaded into node's local file cache, including files under ${spark_home}/jars,${spark_home}/conf and ${hadoop_conf_dir} and files specified by --jars and --files.
3.before launcher container, yarn export classpath and make soft links like this
export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*
ln -sf "/var/hadoop/yarn/local/usercache/homework/filecache/1484419/log4j.properties" "log4j.properties"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
exit $hadoop_shell_errorcode
fi
ln -sf "/var/hadoop/yarn/local/usercache/homework/filecache/1484440/apache-log4j-extras-1.2.17.jar" "apache-log4j-extras-1.2.17.jar"
4.after step3, "log4.properties" is already in CLASSPATH, no need for setting
spark.driver.extraJavaOptions or spark.executor.extraJavaOption.
Be aware that spark 3.3.0 switched to log4j2.
Which means you have to configure things differently.
If this is just for a self-learning project or small development project, There is already a log4j.properties in hadoop_home/conf. Just edit that one, add your own loggers

Resources