how to Intialize the spark shell with a specific user to save data to hdfs by apache spark - apache-spark

im using ubuntu
im using spark dependency using intellij
Command 'spark' not found, but can be installed with: .. (when i enter spark in shell)
i have two user amine , and hadoop_amine (where hadoop hdfs is set)
when i try to save a dataframe to HDFS (spark scala):
procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json")
i got this error
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/mydata/enedis/POC":hadoop_amine:supergroup:drwxr-xr-x

Try to change the permissions of the HDFS directory or change your spark user simply!
For changing the directory permission you can use hdfs command line like this
hdfs dfs -chmod ...
In spark-submit you can use the proxy-user option
And at last, you can run the spark-submit or spark-shell with the proper user like this command:
sudo -u hadoop_amine spark-submit ...

Related

Trying to move a csv file from local file system to hadoop file system

I am trying to copy a csv file from my local file system to hadoop. But I am not able to successfully do it. I am not sure which permissions i need to change. As I understand. hdfs super user does not have access to the /home/naya/dataFiles/BlackFriday.csv
hdfs dfs -put /home/naya/dataFiles/BlackFriday.csv /tmp
# Error: put: Permission denied: user=naya, access=WRITE, inode="/tmp":hdfs:supergroup:drwxr-xr-x
sudo -u hdfs hdfs dfs -put /home/naya/dataFiles/BlackFriday.csv /tmp
# Error: put: `/home/naya/dataFiles/BlackFriday.csv': No such file or directory
Any help is highly appreciated. I want to do it via the command line utility. I can do it via cloudera manager from the hadoop side. But I want to understand whats happening behind the commands

Spark Event log directory

I am using PySpark (standalone without hadoop etc) and calling my pyspark jobs below and it works fine:
PYSPARK_PYTHON=python3 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre" SPARK_HOME=~/.local/lib/python3.6/site-packages/pyspark spark-submit job.py --master local
The History Server is running however I am trying to configure the Spark History Server to read the correct directory. The settings I have configured are in /pyspark/conf/spark-env.sh:
....
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=/home/Documents/Junk/logs/ -Dspark.history.fs.logDirectory=/home/Documents/Junk/logs"
....
But when I run jobs, this directory is empty (logs not writing to this directory)
Am I specifying the directory addresses correctly? (thes are local addresses in my file system)
To get it working, do the following. Do not use spark-env.sh and instead edit the conf/spark-defaults.conf file with the following, note the file:// prefix.
spark.eventLog.enabled true
spark.eventLog.dir file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs
spark.history.fs.logDirectory file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs

error when trying to save dataframe spark to a hdfs file

I'm using Ubuntu
When i try to save a dataframe to HDFS (Spark Scala):
processed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json")
I got this error
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/mydata/enedis/POC":hadoop_amine:supergroup:drwxr-xr-x
You are trying to write data as root user but hdfs directory(/mydata/enedis/POC) having permissions to hadoop_amine user to write to the directory.
Change the permissions on the HDFS directory to allow root user to write to /mydata/enedis/POC directory.
#login as hadoop_amine user then execute below command
hdfs dfs –chmod -R 777 /mydata/enedis/POC
(Or)
Intialize the spark shell with hadoop_amine user then no need to change the permissions of the directory.

SPARK Application + HDFS + User Airflow is not the owner of inode=alapati

We are running spark application on Hadoop cluster ( HDP version - 2.6.5 from Hortonworks ).
From the logs we can see the following Diagnostics
User: airflow
Application Type: SPARK
User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied. user=airflow is not the owner of inode=alapati
It is not provided clearly in log what we need to search in HDFS in order to find why we get Permission denied.
Looks line user=airflow don't have access to write data into HDFS.
By default the /user/ directory is owned by "hdfs" with 755 permissions. As a result only hdfs can write to that directory.
You can use two options:
change spark user name from airflow to hdfs or
If you still need to use user=airflow, create a home directory for airflow
sudo -u hdfs hadoop fs -mkdir /user/airflow
sudo -u hdfs hadoop fs -chown root /user/airflow

configure Apache Hive local multi users or in Spark standalone without Hadoop

We run Apache Spark stand alone for teaching purposes on Fedora 27. I have it configured with MySQL for the metastore_db.
I am trying to follow this tip running Hive in local mode, so running the export command:
export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse' before starting hive. Another blog suggested this command:
export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'
So I ran chmod 777 /tmp/warehouse but won't any subsequent users just rewrite over everything in this local file database? Is there a better way to achieve this? Whenever I try to use Hive within Spark or without any export of a local file database I get the localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; errors, i.e. the port that Hadoop runs on. We're trying to do this without Hadoop for teaching purposes. Should each user just specify a different path to either warehouse or databaseName, perhaps in their home directory?

Resources