Separate logs from Apache spark

Separate logs from Apache spark - apache-spark

I would like to have separate log files from workers, masters and jobs(executors, submits, don't know how call it). I tried configuration in log4j.properties like
log4j.appender.myAppender.File=/some/log/dir/${log4j.myAppender.FileName}
and than passing log4j.myAppender.FileName in SPARK_MASTER_OPTS, SPARK_WORKER_OPTS, spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.
It works perfectly well with workers and masters but fails with executors and drivers. Here is example of how I use these:
./spark-submit ... --conf "\"spark.executor.extraJavaOptions=log4j.myAppender.FileName=myFileName some.other.option=foo\"" ...
I also tried putting log4j.myAppender.FileName with some default value in spark-defaults.conf but it doesn't work neither.
Is there some way to achieve what I want?

Logging for Executors and Drivers can be configured by conf/spark-defaults.conf by adding these entries (from my windows config)
spark.driver.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-driver.properties
spark.executor.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-executor.properties
Note that each entry above references a different log4j.properties file so you can configure them independently.

Related

How to understand the relationship and use of spark --jars, extraClassPath and extraLibraryPath?

First of all, I have seen similar problems. another problem link But I think its answer is not very clear.
Some of my questions are as follows：
（1）--jars parameter Is the same as the spark.executor.extraClassPath parameter, if they are different, what is the difference?
I have checked --help on the --jars in spark-submit command line, which explains as follows:
Comma-separated list of local jars to include on the driver
and executor classpaths.
However, I did not find the explanation of spark.executor.extraClassPath in the spark-submit command line. finally, I found the following explanation about spark.executor.extraClassPath in the official website of spark:
Extra classpath entries to prepend to the classpath of executors.
Seemingly the same effect from the interpretation of the two ？
But I see the following paragraph from another question link:
--jars vs SparkContext.addJar: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better. One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath, you'll need to explicitly add them using the extraClassPath config on both.
Why is this again?
（2）spark.executor.extraClassPath and spark.executor.extraLibraryPath, same as spark.driver.extraXXXXpath with the same prefix。
What is the difference between extraClassPath and extraLibraryPath?
Explanation of spark.executor.extraLibraryPath from spark official website。
Set a special library path to use when launching executor JVM's.
I don't understand, what is the difference between this explanation and --jars and spark.executor.extraClassPath?
I look forward to your explanation and answer，thank you .

Apache Spark: setting executor instances

I run my Spark application on YARN with parameters:
in spark-defaults.conf:
spark.master yarn-client
spark.driver.cores 1
spark.driver.memory 1g
spark.executor.instances 6
spark.executor.memory 1g
in yarn-site.xml:
yarn.nodemanager.resource.memory-mb 10240
All other parameters are set to default.
I have a 6-node cluster and the Spark Client component is installed on each node.
Every time I run the application there are only 2 executors and 1 driver visible in the Spark UI. Executors appears on different nodes.
Why can't Spark create more executors? Why are only 2 instead of 6?
I found a very similar question: Apache Spark: setting executor instances does not change the executors, but increasing the memoty-mb parameter didn't help in my case.

The configuration looks OK at first glance.
Make sure that you have overwritten the proper spark-defaults.conf file.
Execute echo $SPARK_HOME for the current user and verify, if the modified spark-defaults file is in the $SPARK_HOME/conf/ directory. Otherwise Spark cannot see your changes.
I have modified the wrong spark-defaults.conf file. I had two users in my system and each user had a different $SPARK_HOME directory set (I didn't know it before). That's why I couldn't see any effect of my settings for one of the users.
You can run your spark-shell or spark-submit with an argument --num-executors 6 (if you want to have 6 executors). If Spark creates more executors than before, you will be sure, that it's not the memory issue but something with the unreadable configuration.

spark streaming application and kafka log4j appender issue

I am testing my spark streaming application, and I have multiple functions in my code:
- some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD).
I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self.
But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.
Does anyone know reason to this behaviour?
I am working on a single virtual machine, where I have Spark & Kafka. I submit applications using spark submit.
EDITED
Actually I have figured out the part of the problem. Data gets appended to Kafka only from the part of the code that is in my main function. All the code that Is outside of my main, doesnt write data to kafka.
In main I declared the logger like this:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
While outside of my main, I had to declare it like:
#transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
in order to avoid serialization issues.
The reason might be behind JVM serialization concept, or simply because workers don't see the log4j configuration file (but my log4j file is in my source code, in resource folder)
Edited 2
I have tried in many ways to send log4j file to executors but not working. I tried:
sending log4j file in --files command of spark-submit
setting: --conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties" in spark-submit
setting log4j.properties file in --driver-class-path of spark-submit...
None of this option worked.
Anyone has the solution? I do not see any errors in my error log..
Thank you

I think you are close..first you want to make sure all the files are exported to the WORKING DIRECTORY (not CLASSPATH) on all nodes using --files flag. And then you want to reference these files to extracClassPath option of executors and driver. I have attached the following command, hope it helps. Key is to understand once the files are exported, all the files can be accessed on the node using just file name of the working directory (and not url path).
Note: Putting log4j file in the resources folder will not work. (at least when i had tried, it didnt.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar

Which configuration options are preferred in spark?

I wanted to enquire that which configuration option is given priority in spark? Is it the configuration file or the options we manually specify when running the spark-submit shell?
What if I have different options for executor memory in my configuration file and I specify a different value while running the spark-submit shell?

The Spark (1.5.0) configuration page clearly states what the priorities are:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
So this is the priority order (from highest to lowest):
Properties set on the SparkConf (in program).
Flags passed to spark-submit or spark-shell.
Options set in the spark-defaults.conf file.

Spark - config file that sets spark.storage.memoryFraction

I have come to learn that spark.storage.memoryFraction and spark.storage.safteyFraction are multiplied by the executor memory supplied in the sparkcontext. Also, I have learned that it is desirable to lower the memoryFraction for better performance.
The question is where do I set the spark.storage.memoryFraction? Is there a config file?

The default file that Spark search for such configurations is conf/spark-defaults.conf
If you want to change dir conf to a customized position, set SPARK_CONF_DIR in conf/spark-env.sh

I recommend you to keep it on per job basis instead of updatring spark-defaults.conf
you can create a config file per job, say spark.properties and pass it in spark-submit
--properties-file /spark.properties

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string