Spark Event log directory - apache-spark

I am using PySpark (standalone without hadoop etc) and calling my pyspark jobs below and it works fine:
PYSPARK_PYTHON=python3 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre" SPARK_HOME=~/.local/lib/python3.6/site-packages/pyspark spark-submit job.py --master local
The History Server is running however I am trying to configure the Spark History Server to read the correct directory. The settings I have configured are in /pyspark/conf/spark-env.sh:
....
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=/home/Documents/Junk/logs/ -Dspark.history.fs.logDirectory=/home/Documents/Junk/logs"
....
But when I run jobs, this directory is empty (logs not writing to this directory)
Am I specifying the directory addresses correctly? (thes are local addresses in my file system)

To get it working, do the following. Do not use spark-env.sh and instead edit the conf/spark-defaults.conf file with the following, note the file:// prefix.
spark.eventLog.enabled true
spark.eventLog.dir file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs
spark.history.fs.logDirectory file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs

Related

windows log path for running Spark HistoryServer

I have followed instruction on spark website for configuring pySpark HistoryServer locally on Windows but cannot get past this error when I run: spark-class.cmd org.apache.spark.deploy.history.HistoryServer
: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
spark-defaults.conf has:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/tmp/spark-events
spark.eventLog.dir file:/tmp/spark-events
I can get pyspark to run and I can successfully submit .py script with spark-submit
I have created the directory /tmp/spark-events in both SPARK_HOME and SPARK_HOME/bin because i'm not exactly sure where "file:/tmp/spark-events" should actually located. Where exactly on Windows do I need to create this directory "tmp/spark-events" so it can be found? Am I missing anything else? Also, even if I change the paths in the .conf file it still gives error saynig can't find tmp/spark-events so it's like it's not even using the values in the config.
You can choose where spark.history.fs.logDirectory points to! In your case, it should be a windows path. The idea is the following:
You make a directory wherever you would like it, with the proper permissions on there (more info on that here)
When that is done, you should be able to start up your history server, with spark.history.fs.logDirectory pointing to that one directory you made. This is not a relative path w.r.t. your $SPARK_HOME env variable, but an absolute path.
If that works, you should see a rather uninteresting screen (the default port is 18080 so locally you should visit localhost:18080): since none of your applications have written to your directory yet you will see an empty History Server screen
If you want to make use of the history server, you have to make your apps write to the eventlog directory you made. That can be done by adding --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=<your-dir> to your spark-submit call.
If that was successful, you should see a file in your log directory you made!
Have a look at your History server (by default on localhost:18080). You should see your application's logs in there!
Hope this helps :)

What specific properties are required to submit job on remote yarn cluster

We are trying to submit a spark/map-red job to a remote yarn cluster and we know to submit that we would require core-site and yarn-site xmls at conf directory.But I am trying to understand what specific properties is need for yarn client and spark client to submit job remotely.I don't want to share all the properties.
Any pointers to this would help.
We generally call it Edge Node or Gateway Node and following files must be available under Hadoop and yarn configuration.
#hadoop folder
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
Yarn configuration
#yarn folder
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
mapred-site.xml
yarn-site.xml
Attaching a sample screenshot.

How to specify cluster location in HADOOP_CONF_DIR?

The Spark documentation about submitting applications says:
Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
I am afraid I did not get it. I found that HADOOP_CONF_DIR is set to /etc/hadoop that contains many shell scripts and configuration files.
Where exactly should I find the cluster location there?
HADOOP_CONF_DIR is the directory with the configuration files that Hadoop libraries use for various Hadoop-specific stuff. I wrote various Hadoop-specific stuff to highlight that there's not much here Spark-related.
What's more important is that HADOOP_CONF_DIR can also point to an empty directory (which says to assume the defaults).
To answer your question, you can define the cluster location in yarn-site.xml using yarn.resourcemanager.address. If yarn-site.xml is not found, the YARN cluster is available at localhost.
Where should I place yarn-site.xml so spark-submit will use it?
I used to use YARN_CONF_DIR to point to the directory with yarn-site.xml.
YARN_CONF_DIR=/tmp ./bin/spark-shell --master yarn

Custom log4j.properties on AWS EMR

I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j.
--driver-java-options "-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
I have also tried picking from local filesystem using file://// instead of hdfs. None of this seem to work. However, I can get this working when running on my local Yarn setup.
Any ideas?
Basically, after chatting with the support and reading the documentation, I see that there are 2 options available to do this:
1 - Pass the log4j.properties through configuration passed when bringing up EMR. Jonathan has mentioned this on his answer.
2 - Include the --files /path/to/log4j.properties switch to your spark-submit command. This will distribute the log4j.properties file to the working directory of each Spark Executor, then change your -Dlog4jconfiguration to point to the filename only: "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
log4j knows nothing about HDFS, so it can't accept an hdfs:// path as its configuration file. See here for more information about configuring log4j in general.
To configure log4j on EMR, you may use the Configuration API to add key-value pairs to the log4j.properties file that is loaded by the driver and executors. Specifically, you want to add your Properties to the spark-log4j configuration classification.
Here is the simplest solution which worked quite well in my case
ssh to the EMR cluster via terminal
Go to the conf directory (cd /usr/lib/spark/conf)
Replace the log4j.properties file with your custom values.
Make sure you are editing the file with root user access(type sudo -i to login as a root user)
Note: All the spark applications running in this cluster will output the logs defined in the custom log4j.properties file.
For those using terraform, it can be cumbersome to define a bootstrap action to create a new log4j file within in EMR or update the default one /etc/spark/conf/log4j.properties, because this will recreate the EMR cluster.
In this case, it's possible to use s3 paths in --files option so something like --files=s3://my-bucket/log4j.properties is valid. As mentioned by #Kaptrain, EMR will distribute the log4j.properties file to the working directory of each Spark Executor. Then we can pass these two flags to our Spark jobs in order to use the new log4j configuration:
spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

Apache spark: permission denied for files uploaded to job's staging directory

I wrote an apache spark job that uses some configuration file. When I run this job locally, it works fine. But when I submit this job to a YARN cluster, it fails with a java.io.FileNotFoundException: (Permission denied)
I submit my job with the following command:
bin/spark-submit --master yarn --deploy-mode cluster --num-executors 1 --files /home/user/app.conf --class org.myorg.PropTest assembly.jar
It uploads assembly.jar and app.conf file to a subdirectory of the .sparkStaging directory in my home directory on HDFS.
I'm trying to access app.conf file on the following line:
ConfigFactory.parseFile(new File("app.conf"))
When I upload a file with a name other than app.conf, it fails with a FileNotFoundException as expected.
But when I upload app.conf, it also fails with a FileNotFoundException, but with message that permission to ./app.conf is denied. So, it seems that it can gain access to this file, but can't gain required permissions.
What can be wrong?
Ok, I've figured it out. An uploaded file is added to driver's classpath, so it can be accessed as resource:
val config = ConfigFactory.parseResources("app.conf")

Resources