spark.worker.cleanup does not work, logs are not deleted - apache-spark

I want periodically cleanup the log files that stored in the ${SPARK_HOME}/logs for our spark cluster (1 master + 4 workers).
The default log directory for spark logs should be ${SPARK_HOME}/logs, since I didn't configure the SPARK_LOG_DIR in spark-env, so all logs are being stored there.
In order to test it, I have added the conf below (spark.worker.cleanup.enabled) in one of the worker node.
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=300 -Dspark.worker.cleanup.appDataTtl=300"
And then executed stop-slave.sh to stop the worker node, and start the worker with start-slave.sh.
But those log files in ${SPARK_HOME}/logs are not being deleted after the configured interval time.
I would like to know am I doing the right step ? Or something more has to be done? I also put that spark.worker.cleanup conf in the master node's spark-env.sh. And I don't see any effect there also.

I think I was a bit confused on which folder to cleanup. In the spark document, it mentioned the spark.worker.cleanup.enabled is only cleanup worker "APPLICATION" directory.
And our application directory is located at "spark-2.3.3-bin-hadoop2.7/work", and that directory did get cleaned up .
So after changing the spark-env.sh, and then stop-slave, and then start-slave again. Everything working good.

Related

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Apache Spark FileNotFoundException

I am trying to play a little bit with apache-spark cluster mode.
So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine).
I send a textfile using sparkContext.addFile(filepath) where the filepath is the path of my text file in local machine for which I get the following output:
INFO Utils: Copying /home/files/data.txt to /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
INFO SparkContext: Added file /home/files/data.txt at http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
But when I try to access the same file using SparkFiles.get("data.txt"), I get the path to file in my driver instead of worker.
I am setting my file like this
SparkConf conf = new SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List<String> file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.
I have recently faced the same issue and hopefully my solution can help other people solve this issue.
We know that when you use SparkContext.addFile(<file_path>), it sends the file to the automatically created working directories in the driver node (in this case, your machine) as well as the worker nodes of the Spark cluster.
The block of code that you shared where you are using SparkFiles.get("data.txt") is being executed on the driver, so it returns the path to the file on the driver, instead of the worker. But, the task is being run on the worker and path to the file on the driver does not match the path to the file on the worker because the driver and worker nodes have different working directory paths. Hence, you get the FileNotFoundException.
There is a workaround to this problem without using any distributed file system or ftp server. You should put the file in your working directory on your host machine. Then, instead of using SparkContext.get("data.txt"), you use "./data.txt".
List<String> file = sparkContext.textFile("./data.txt").collect();
Now, even though there is a mismatch of working directory paths between the spark driver and worker nodes, you will NOT face FileNotFoundException since you are using a relative path to access the file.
I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node.
Thus, when you're trying to collect the data, the worker is asked to read the file at the URL you've passed to textFile, which is told by the driver. Since your file is in the local filesystem of the driver and the worker node doesn't have access to it, you get the FileNotFoundException.
The solution is to make the file available to the worker node by putting it into a distributed filesystem as HDFS or via (s)ftp or you have to trasfer the file into the worker node before running the Spark job and then you have to put as an argument of textFile the path of the file in the worker filesystem.

Write spark event log to local filesystem instead of hdfs

I want to redirect event log of my spark applications to a local directory like "/tmp/spark-events" instead of "hdfs://user/spark/applicationHistory".
I set the "spark.eventLog.dir" variable to "file:///tmp/spark-events" in cloudera manager (Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.con).
But when I restart spark, spark-conf conatains (spark.eventLog.dir=hdfs://nameservice1file:///tmp/spark-eventstmp/spark) and this not works.

Spark webUI - completed application details page

I use spark 1.1.0 on a standalone cluster with 3 nodes.
I want to see the detailed logs of Completed Applications so I've set in my program :
set("spark.eventLog.enabled","true")
set("spark.eventLog.dir","file:/tmp/spark-events")
but when I click on the application in the webui, I got a page with the message :
Application history not found (app-20150126000651-0331)
No event logs found for application xxx$ in file:/tmp/spark-events/xxx-1422227211500. Did you specify the correct logging directory?
despite the fact that the directory exist and contains 3 files :
APPLICATION_COMPLETE*, EVENT_LOG_1* and SPARK_VERSION_1.1.0*
Any suggestion to solve the problem ?
Thanks.
why is your application name xxx$ and then xxx in your error message ? Is that really what Spark reports ?
Permissions problem : check that the directory in which you log is readable and executable by the user under which you run Spark (and that the inside files a readable as well).
Check that you do specify master correctly, i.e. --master spark://<localhostname>:7077
Dig in the EVENT_LOG_1* file. The last event (on the last line) of the file should be an "Application Complete" event. If it doesn't, it's likely that your application did not call sc.stop(), though the logs should still show up nonetheless.
I had the same error "Did you specify the correct logging directory?" and for me the fix was to add a '/' at the end of the path for 'spark.eventLog.dir' i.e. /root/ephemeral-hdfs/spark-events/
>> cat spark/conf/spark-defaults.conf
spark.eventLog.dir /root/ephemeral-hdfs/spark-events/
spark.executor.memory 5929m

Resources