spark streaming application and kafka log4j appender issue - apache-spark

I am testing my spark streaming application, and I have multiple functions in my code:
- some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD).
I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self.
But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.
Does anyone know reason to this behaviour?
I am working on a single virtual machine, where I have Spark & Kafka. I submit applications using spark submit.
EDITED
Actually I have figured out the part of the problem. Data gets appended to Kafka only from the part of the code that is in my main function. All the code that Is outside of my main, doesnt write data to kafka.
In main I declared the logger like this:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
While outside of my main, I had to declare it like:
#transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
in order to avoid serialization issues.
The reason might be behind JVM serialization concept, or simply because workers don't see the log4j configuration file (but my log4j file is in my source code, in resource folder)
Edited 2
I have tried in many ways to send log4j file to executors but not working. I tried:
sending log4j file in --files command of spark-submit
setting: --conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties" in spark-submit
setting log4j.properties file in --driver-class-path of spark-submit...
None of this option worked.
Anyone has the solution? I do not see any errors in my error log..
Thank you

I think you are close..first you want to make sure all the files are exported to the WORKING DIRECTORY (not CLASSPATH) on all nodes using --files flag. And then you want to reference these files to extracClassPath option of executors and driver. I have attached the following command, hope it helps. Key is to understand once the files are exported, all the files can be accessed on the node using just file name of the working directory (and not url path).
Note: Putting log4j file in the resources folder will not work. (at least when i had tried, it didnt.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar

Related

How to dynamically add dependencies to spark executors at runtime

I would like to add archive dependencies to my spark executors in a way that would work similarly to how it functions when passing the archive paths in to the spark-submit with --archives option. However, I will not know what dependencies are required until runtime, so I need to do this programmatically after the spark job has already been submitted.
Is there a way to do this? I'm currently working on a hacky solution where I download the required archives from within the function running on the executors, however this is much slower than having the driver just download the archives once and then distribute them to the executors.
Assuming your resource manager is YARN, it is posible to set the property spark.yarn.dist.archives when creating the SparkSession.
SparkSession.builder \
.appName("myappname") \
.conf("spark.yarn.dist.archives", "file1.zip#file1,file2.zip#file2,...") \
.getOrCreate()
More info here: https://spark.apache.org/docs/latest/running-on-yarn.html
You may find the properties spark.yarn.dist.files and spark.yarn.dist.jars useful too.

Fail to enable hive support in Spark submit (spark 3)

I am using spark in an integration test suite. It has to run locally and read/write files to local file-system. I also want to read/write these data as tables.
In the first step of the suite I write some hive tables in the db feature_store specifying
spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse. The step completes correctly and I see the files in the folder I expect.
Afterwards I run a spark-submit step with (among others) these confs
--conf spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse --conf spark.sql.catalogImplementation=hive
and when trying to read a table previously written I get
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'feature_store' not found
However if I try to do exactly the same thing with exactly the same configs in a spark-shell I am able to read the data.
In the spark-submit I use the following code to get the spark-session
SparkSession spark = SparkSession.active();
I have also tried to use instead
SparkSession spark = SparkSession.builder().enableHiveSupport().getOrCreate();
but I keep getting the same problem as above.
I have understood that the problem is related to the spark-submit not picking up hive as
catalog implementation. In fact I see that the class spark.catalog is not an instance of HiveCatalogImpl during the spark-submit (while it is when using spark-shell).

How to understand the relationship and use of spark --jars, extraClassPath and extraLibraryPath?

First of all, I have seen similar problems. another problem link But I think its answer is not very clear.
Some of my questions are as follows:
(1)--jars parameter Is the same as the spark.executor.extraClassPath parameter, if they are different, what is the difference?
I have checked --help on the --jars in spark-submit command line, which explains as follows:
Comma-separated list of local jars to include on the driver
and executor classpaths.
However, I did not find the explanation of spark.executor.extraClassPath in the spark-submit command line. finally, I found the following explanation about spark.executor.extraClassPath in the official website of spark:
Extra classpath entries to prepend to the classpath of executors.
Seemingly the same effect from the interpretation of the two ?
But I see the following paragraph from another question link:
--jars vs SparkContext.addJar: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better. One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath, you'll need to explicitly add them using the extraClassPath config on both.
Why is this again?
(2)spark.executor.extraClassPath and spark.executor.extraLibraryPath, same as spark.driver.extraXXXXpath with the same prefix。
What is the difference between extraClassPath and extraLibraryPath?
Explanation of spark.executor.extraLibraryPath from spark official website。
Set a special library path to use when launching executor JVM's.
I don't understand, what is the difference between this explanation and --jars and spark.executor.extraClassPath?
I look forward to your explanation and answer,thank you .

How to stop Spark Structured Streaming from filling HDFS

I have a Spark Structured Streaming task running on AWS EMR that is essentially a join of two input streams over a one minute time window. The input streams have a 1 minute watermark. I don't do any aggregation. I write results to S3 "by hand" with a forEachBatch and a foreachPartition per Batch that converts the data to string and writes to S3.
I would like to run this for a long time, i.e. "forever", but unfortunately Spark slowly fills up HDFS storage on my cluster and eventually dies because of this.
There seem to be two types of data that accumulate. Logs in /var and .delta, .snapshot files in /mnt/tmp/.../. They don't get deleted when I kill the task with CTRL+C (or in case of using yarn with a yarn application kill) either, I have to manually delete them.
I run my task with spark-submit. I tried adding
--conf spark.streaming.ui.retainedBatches=100 \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.cleaner.referenceTracking.cleanCheckpoints=true \
--conf spark.cleaner.periodicGC.interval=15min \
--conf spark.rdd.compress=true
without effect. When I add --master yarn the paths where the temporary files are stored change a bit, but the problem of them accumulating over time persists. Adding a --deploy-mode cluster seems to make the problem worse as more data seems to be written.
I used to have a Trigger.ProcessingTime("15 seconds) in my code, but removed it as I read that Spark might fail to clean up after itself if the trigger time is too short compared to the compute time. This seems to have helped a bit, HDFS fills up slower, but temporary files are still piling up.
If I don't join the two streams, but just select on both and union the results to write them to S3 the accumulation of cruft int /mnt/tmp doesn't happen. Could it be that my cluster is too small for the input data?
I would like to understand why Spark is writing these temp files, and how to limit the space they consume. I would also like to know how to limit the amount of space consumed by logs.
Spark fills HDFS with logs because of https://issues.apache.org/jira/browse/SPARK-22783
One needs to set spark.eventLog.enabled=false so that no logs are created.
in addition to #adrianN's answer, on the EMR side, they retain application logs on HDFS - see https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/

Separate logs from Apache spark

I would like to have separate log files from workers, masters and jobs(executors, submits, don't know how call it). I tried configuration in log4j.properties like
log4j.appender.myAppender.File=/some/log/dir/${log4j.myAppender.FileName}
and than passing log4j.myAppender.FileName in SPARK_MASTER_OPTS, SPARK_WORKER_OPTS, spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.
It works perfectly well with workers and masters but fails with executors and drivers. Here is example of how I use these:
./spark-submit ... --conf "\"spark.executor.extraJavaOptions=log4j.myAppender.FileName=myFileName some.other.option=foo\"" ...
I also tried putting log4j.myAppender.FileName with some default value in spark-defaults.conf but it doesn't work neither.
Is there some way to achieve what I want?
Logging for Executors and Drivers can be configured by conf/spark-defaults.conf by adding these entries (from my windows config)
spark.driver.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-driver.properties
spark.executor.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-executor.properties
Note that each entry above references a different log4j.properties file so you can configure them independently.

Resources