Writing to a local FS in cluster mode SPARK

Writing to a local FS in cluster mode SPARK - apache-spark

For spark jobs, we are trying to add a logging framework that creates a custom log file on a local FS.
In client mode, everything is fine, the files are created on the local FS with the user who launched the spark-submit.
However in cluster mode, the local files are created with the user yarn who does not have the permission to write to the local directory...
Is there any solution to write a local file in cluster mode with the user who submited the job without changing the permission to 777 everywhere?
Is the cluster mode better in this case (we are on PROD environment), knowing that the job is launched from a node of the cluster (so there is no network issue).
Thank you.

Yes, here is a way: Using shell script to submit spark jobs
We use logger to print all our logs. we always have unique text with the log message
eg: log.info("INFO_CUSTOM: Info message"). Once our application is completed we will Yarn logs command and grep for the unique text.
Get the application id using yarn command with application name.
eg. yarn application -list -appStates FINISHED,FAIED,KILLED | grep <application name>
Run yarn logs command and grep, redirect it to the file you want.
eg. yarn logs -applicationId <application id u got fro step 1> | grep -w "INFO_CUSTOM" >> joblog.log

Related

spark standalone running on docker cleanup not running

I'm running spark on standalone mode as a docker service where I have one master node and one spark worker. I followed the spark documentation instructions:
https://spark.apache.org/docs/latest/spark-standalone.html
to add the properties where the spark cluster cleans itself and I set those in my docker_entrypoint
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=900 -Dspark.worker.cleanup.appDataTtl=900
and verify that it was enables following the logs of the worker node service
My question is do we expect to get all directories located on SPARK_WORKER_DIR directory to be cleaned ? or does it only clean the application files
Because I still see some empty directories holding there

How to get full worker output in Apache Spark

How do I view / download the complete stderr output from a worker in Apache Spark, deployed in cluster mode?
I've deployed a program with spark-submit --deploy-mode cluter foo.jar, and a worker crashed. To investigate, I go to localhost:8081 and access the worker's log (stderr in particular), but then it shows me only the bottom of the file, and I have to click the "Load More" button a hundred times to scroll up to the first error -- clearly, I shouldn't have to do that. Is there a way to download the whole stderr output, or to redirect it to a known location? Which part of Spark's documentation gives me this kind of information?

Get the Application Id of your spark job from yarn URL OR after submitting spark job you will get spark Application Id.
Then use the below command in YARN CLI to view the yarn logs from your edge/gateway node. for more details about YARN CLI refer this link click here
yarn logs -applicationId <Application ID> -log_files stderr

EMR Spark History Server UI accessed through ssh tunnel displays no logs

The /var/log/spark/apps/ folder was deleted on our EMR cluster. I created a new hdfs folder with the same name and changed the permissions to 777. Now each spark application is successfully writing logs to this hdfs folder.
However, something else was in that folder that allowed the Spark History Server that you can connect to through ssh tunneling to display the list of application logs. It worked just fine prior to the folder getting deleted, but now it does not display any spark application logs (complete or incomplete), even though hdfs dfs -ls /var/log/spark/apps/ shows that the folder is full of logs.
The Spark History Server accessed through the EMR AWS Console still works, but this is less ideal as it significantly lags behind the Spark History Server accessed through an ssh tunnel.
What other item do I need to restore to this folder so that the Spark History Server opened through ssh tunneling shows these logs?
On a Windows computer, the following PowerShell code still opens the Spark History Server UI correctly, but the UI does not show any logs:
Start-Process powershell "-noexit", `
"`$host.ui.RawUI.WindowTitle` = 'Spark HistoryServer'; `
Start-Process chrome.exe http://localhost:8158 ; `
ssh -N -L 8158:ip-10-226-66-190.us-east-2.compute.internal:18080 hadoop#10.226.66.190"
Note:
I have also stopped and restarted the Spark History Server.
sudo stop spark-history-server
sudo start spark-history-server
Also:
sudo -s ./$SPARK_HOME/sbin/start-history-server.sh

Changing the permissions fixed it.
hdfs dfs -chmod -R 777 /var/log/spark/apps/

Pyspark write files to local on yarn cluster mode

I am trying to run my pyspark code. My destination directory is a local directory. The user with which I am submitting spark-submit command is the super user and has all privileges to read the file from hdfs and write the files to local.
The job is running without any error but there is no output directory or files getting created.
I have set the HADOOP_USER_NAME as super user in my spark code to avoid permission issue as well.
Can someone please help

If you are running in YARN cluster mode then the YARN ApplicationMaster is actually running on a node so will be writing out local to the node. If you find which node it was then you should find your output directory and files there.

Download file in jenkins job

I have a jenkins job which will execute node application. This job is configured to run on docker only during execution.
Is it possible to download file from node application everytime when job gets executed?
I tried using nodejs plugins to save and download file. File is getting saved in local but not able to download.

If your docker container runs some job and creates a file as the output of the job, and you want it available outside the container after the job is done, my suggestion is that you create the file in a location that is mapped to a host folder via the volume option. Run your docker container as follows:
sudo docker -d -v /my/host/folder:/my/location/inside/container mynodeapp:latest
Ensure that your node application writes the output file to the location /my/location/inside/container. When the job is completed, the output file can be accessed on the host-machine at /my/host/folder.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing to a local FS in cluster mode SPARK - apache-spark

Related

spark standalone running on docker cleanup not running

How to get full worker output in Apache Spark

EMR Spark History Server UI accessed through ssh tunnel displays no logs

Pyspark write files to local on yarn cluster mode

Download file in jenkins job

Categories

Resources