How to specify cluster location in HADOOP_CONF_DIR?

How to specify cluster location in HADOOP_CONF_DIR? - apache-spark

The Spark documentation about submitting applications says:
Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
I am afraid I did not get it. I found that HADOOP_CONF_DIR is set to /etc/hadoop that contains many shell scripts and configuration files.
Where exactly should I find the cluster location there?

HADOOP_CONF_DIR is the directory with the configuration files that Hadoop libraries use for various Hadoop-specific stuff. I wrote various Hadoop-specific stuff to highlight that there's not much here Spark-related.
What's more important is that HADOOP_CONF_DIR can also point to an empty directory (which says to assume the defaults).
To answer your question, you can define the cluster location in yarn-site.xml using yarn.resourcemanager.address. If yarn-site.xml is not found, the YARN cluster is available at localhost.
Where should I place yarn-site.xml so spark-submit will use it?
I used to use YARN_CONF_DIR to point to the directory with yarn-site.xml.
YARN_CONF_DIR=/tmp ./bin/spark-shell --master yarn

Related

Can there be conflicts between HADOOP_CONF_DIR and SPARK_CONF_DIR?

We have a virtual machine (Redhat) with pyspark installed in a python virtual environment.
We have set HADOOP_CONF_DIR and SPARK_CONF_DIR.
HADOOP_CONF_DIR contains core-site.xml, hdfs-site.xml etc. and also a log4j.properties file.
SPARK_CONF_DIR contains spark-defaults.conf and log4j2.properties file.
SPARK_HOME is set to the python virtual environment lib/python3.x/site-packages/pyspark
the application works as expected except for one single thing and I can't seem to find the root cause.
for some reason the settings of the log4j.properties file in HADOOP_CONF_DIR are applied to the root logger of the spark application. All dependencies of packages like spark, parquet, iceberg etc. are using the settings of the log4j2.properties file in SPARK_CONF_DIR.
If we remove the log4j.properties file from the HADOOP_CONF_DIR then nothing of the spark application gets logged, if we change that file, that configuration is applied to the root logger of the application.
We also deployed the application in a docker container and for these deployments we don't have the issue.
In those deployments the root logger of the spark application follows the configuration as specified by the log4j2.properties file in SPARK_CONF_DIR. We can remove the log4j.properties file in the HADOOP_CONF_DIR -> there's no impact.
So it seems that in one environment (virtual machine) the system somehow gives priority to the HADOOP_CONF_DIR to find the location of the log4j configuration file while in the other environment (dockerized) the system uses the SPARK_CONF_DIR to find the location of the log4j configuration file.
Even if we use in the spark-defaults.conf the spark.driver.extraJavaOptions -Dlog4j.configurationFile=file:/full_path_to_log4j2.properties the system still uses the log4j.properties file in the HADOOP_CONF_DIR for its root logger.
What could be the cause for the system messing up the localization of the log4j configuration file?

windows log path for running Spark HistoryServer

I have followed instruction on spark website for configuring pySpark HistoryServer locally on Windows but cannot get past this error when I run: spark-class.cmd org.apache.spark.deploy.history.HistoryServer
: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
spark-defaults.conf has:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/tmp/spark-events
spark.eventLog.dir file:/tmp/spark-events
I can get pyspark to run and I can successfully submit .py script with spark-submit
I have created the directory /tmp/spark-events in both SPARK_HOME and SPARK_HOME/bin because i'm not exactly sure where "file:/tmp/spark-events" should actually located. Where exactly on Windows do I need to create this directory "tmp/spark-events" so it can be found? Am I missing anything else? Also, even if I change the paths in the .conf file it still gives error saynig can't find tmp/spark-events so it's like it's not even using the values in the config.

You can choose where spark.history.fs.logDirectory points to! In your case, it should be a windows path. The idea is the following:
You make a directory wherever you would like it, with the proper permissions on there (more info on that here)
When that is done, you should be able to start up your history server, with spark.history.fs.logDirectory pointing to that one directory you made. This is not a relative path w.r.t. your $SPARK_HOME env variable, but an absolute path.
If that works, you should see a rather uninteresting screen (the default port is 18080 so locally you should visit localhost:18080): since none of your applications have written to your directory yet you will see an empty History Server screen
If you want to make use of the history server, you have to make your apps write to the eventlog directory you made. That can be done by adding --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=<your-dir> to your spark-submit call.
If that was successful, you should see a file in your log directory you made!
Have a look at your History server (by default on localhost:18080). You should see your application's logs in there!
Hope this helps :)

Spark Event log directory

I am using PySpark (standalone without hadoop etc) and calling my pyspark jobs below and it works fine:
PYSPARK_PYTHON=python3 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre" SPARK_HOME=~/.local/lib/python3.6/site-packages/pyspark spark-submit job.py --master local
The History Server is running however I am trying to configure the Spark History Server to read the correct directory. The settings I have configured are in /pyspark/conf/spark-env.sh:
....
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=/home/Documents/Junk/logs/ -Dspark.history.fs.logDirectory=/home/Documents/Junk/logs"
....
But when I run jobs, this directory is empty (logs not writing to this directory)
Am I specifying the directory addresses correctly? (thes are local addresses in my file system)

To get it working, do the following. Do not use spark-env.sh and instead edit the conf/spark-defaults.conf file with the following, note the file:// prefix.
spark.eventLog.enabled true
spark.eventLog.dir file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs
spark.history.fs.logDirectory file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs

What specific properties are required to submit job on remote yarn cluster

We are trying to submit a spark/map-red job to a remote yarn cluster and we know to submit that we would require core-site and yarn-site xmls at conf directory.But I am trying to understand what specific properties is need for yarn client and spark client to submit job remotely.I don't want to share all the properties.
Any pointers to this would help.

We generally call it Edge Node or Gateway Node and following files must be available under Hadoop and yarn configuration.
#hadoop folder
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
Yarn configuration
#yarn folder
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
mapred-site.xml
yarn-site.xml
Attaching a sample screenshot.

Custom log4j.properties on AWS EMR

I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j.
--driver-java-options "-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
I have also tried picking from local filesystem using file://// instead of hdfs. None of this seem to work. However, I can get this working when running on my local Yarn setup.
Any ideas?

Basically, after chatting with the support and reading the documentation, I see that there are 2 options available to do this:
1 - Pass the log4j.properties through configuration passed when bringing up EMR. Jonathan has mentioned this on his answer.
2 - Include the --files /path/to/log4j.properties switch to your spark-submit command. This will distribute the log4j.properties file to the working directory of each Spark Executor, then change your -Dlog4jconfiguration to point to the filename only: "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"

log4j knows nothing about HDFS, so it can't accept an hdfs:// path as its configuration file. See here for more information about configuring log4j in general.
To configure log4j on EMR, you may use the Configuration API to add key-value pairs to the log4j.properties file that is loaded by the driver and executors. Specifically, you want to add your Properties to the spark-log4j configuration classification.

Here is the simplest solution which worked quite well in my case
ssh to the EMR cluster via terminal
Go to the conf directory (cd /usr/lib/spark/conf)
Replace the log4j.properties file with your custom values.
Make sure you are editing the file with root user access(type sudo -i to login as a root user)
Note: All the spark applications running in this cluster will output the logs defined in the custom log4j.properties file.

For those using terraform, it can be cumbersome to define a bootstrap action to create a new log4j file within in EMR or update the default one /etc/spark/conf/log4j.properties, because this will recreate the EMR cluster.
In this case, it's possible to use s3 paths in --files option so something like --files=s3://my-bucket/log4j.properties is valid. As mentioned by #Kaptrain, EMR will distribute the log4j.properties file to the working directory of each Spark Executor. Then we can pass these two flags to our Spark jobs in order to use the new log4j configuration:
spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string