Apache spark: permission denied for files uploaded to job's staging directory - apache-spark

I wrote an apache spark job that uses some configuration file. When I run this job locally, it works fine. But when I submit this job to a YARN cluster, it fails with a java.io.FileNotFoundException: (Permission denied)
I submit my job with the following command:
bin/spark-submit --master yarn --deploy-mode cluster --num-executors 1 --files /home/user/app.conf --class org.myorg.PropTest assembly.jar
It uploads assembly.jar and app.conf file to a subdirectory of the .sparkStaging directory in my home directory on HDFS.
I'm trying to access app.conf file on the following line:
ConfigFactory.parseFile(new File("app.conf"))
When I upload a file with a name other than app.conf, it fails with a FileNotFoundException as expected.
But when I upload app.conf, it also fails with a FileNotFoundException, but with message that permission to ./app.conf is denied. So, it seems that it can gain access to this file, but can't gain required permissions.
What can be wrong?

Ok, I've figured it out. An uploaded file is added to driver's classpath, so it can be accessed as resource:
val config = ConfigFactory.parseResources("app.conf")

Related

Spark Event log directory

I am using PySpark (standalone without hadoop etc) and calling my pyspark jobs below and it works fine:
PYSPARK_PYTHON=python3 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre" SPARK_HOME=~/.local/lib/python3.6/site-packages/pyspark spark-submit job.py --master local
The History Server is running however I am trying to configure the Spark History Server to read the correct directory. The settings I have configured are in /pyspark/conf/spark-env.sh:
....
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=/home/Documents/Junk/logs/ -Dspark.history.fs.logDirectory=/home/Documents/Junk/logs"
....
But when I run jobs, this directory is empty (logs not writing to this directory)
Am I specifying the directory addresses correctly? (thes are local addresses in my file system)
To get it working, do the following. Do not use spark-env.sh and instead edit the conf/spark-defaults.conf file with the following, note the file:// prefix.
spark.eventLog.enabled true
spark.eventLog.dir file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs
spark.history.fs.logDirectory file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

im using ubuntu
im using spark dependency using intellij
Command 'spark' not found, but can be installed with: .. (when i enter spark in shell)
i have two user amine , and hadoop_amine (where hadoop hdfs is set)
when i try to save a dataframe to HDFS (spark scala):
procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json")
i got this error
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/mydata/enedis/POC":hadoop_amine:supergroup:drwxr-xr-x
Try to change the permissions of the HDFS directory or change your spark user simply!
For changing the directory permission you can use hdfs command line like this
hdfs dfs -chmod ...
In spark-submit you can use the proxy-user option
And at last, you can run the spark-submit or spark-shell with the proper user like this command:
sudo -u hadoop_amine spark-submit ...

SPARK Application + HDFS + User Airflow is not the owner of inode=alapati

We are running spark application on Hadoop cluster ( HDP version - 2.6.5 from Hortonworks ).
From the logs we can see the following Diagnostics
User: airflow
Application Type: SPARK
User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied. user=airflow is not the owner of inode=alapati
It is not provided clearly in log what we need to search in HDFS in order to find why we get Permission denied.
Looks line user=airflow don't have access to write data into HDFS.
By default the /user/ directory is owned by "hdfs" with 755 permissions. As a result only hdfs can write to that directory.
You can use two options:
change spark user name from airflow to hdfs or
If you still need to use user=airflow, create a home directory for airflow
sudo -u hdfs hadoop fs -mkdir /user/airflow
sudo -u hdfs hadoop fs -chown root /user/airflow

What specific properties are required to submit job on remote yarn cluster

We are trying to submit a spark/map-red job to a remote yarn cluster and we know to submit that we would require core-site and yarn-site xmls at conf directory.But I am trying to understand what specific properties is need for yarn client and spark client to submit job remotely.I don't want to share all the properties.
Any pointers to this would help.
We generally call it Edge Node or Gateway Node and following files must be available under Hadoop and yarn configuration.
#hadoop folder
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
Yarn configuration
#yarn folder
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
mapred-site.xml
yarn-site.xml
Attaching a sample screenshot.

Custom log4j.properties on AWS EMR

I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j.
--driver-java-options "-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
I have also tried picking from local filesystem using file://// instead of hdfs. None of this seem to work. However, I can get this working when running on my local Yarn setup.
Any ideas?
Basically, after chatting with the support and reading the documentation, I see that there are 2 options available to do this:
1 - Pass the log4j.properties through configuration passed when bringing up EMR. Jonathan has mentioned this on his answer.
2 - Include the --files /path/to/log4j.properties switch to your spark-submit command. This will distribute the log4j.properties file to the working directory of each Spark Executor, then change your -Dlog4jconfiguration to point to the filename only: "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
log4j knows nothing about HDFS, so it can't accept an hdfs:// path as its configuration file. See here for more information about configuring log4j in general.
To configure log4j on EMR, you may use the Configuration API to add key-value pairs to the log4j.properties file that is loaded by the driver and executors. Specifically, you want to add your Properties to the spark-log4j configuration classification.
Here is the simplest solution which worked quite well in my case
ssh to the EMR cluster via terminal
Go to the conf directory (cd /usr/lib/spark/conf)
Replace the log4j.properties file with your custom values.
Make sure you are editing the file with root user access(type sudo -i to login as a root user)
Note: All the spark applications running in this cluster will output the logs defined in the custom log4j.properties file.
For those using terraform, it can be cumbersome to define a bootstrap action to create a new log4j file within in EMR or update the default one /etc/spark/conf/log4j.properties, because this will recreate the EMR cluster.
In this case, it's possible to use s3 paths in --files option so something like --files=s3://my-bucket/log4j.properties is valid. As mentioned by #Kaptrain, EMR will distribute the log4j.properties file to the working directory of each Spark Executor. Then we can pass these two flags to our Spark jobs in order to use the new log4j configuration:
spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

Resources