hdfs dfs -setfacl fails on Azure Data Lake Store - azure

I am trying to change the Access Control List for a file located in an Azure Data Lake Store Gen 1 from an HDInsight 3.6 Cluster with the following command:
hdfs dfs -setfacl -m user:d7de0903-abcabc-44c1-8f1c-31311f5caa69:r-x adl://<myadls>.azuredatalakestore.net/user/pics/sammple.jpeg
It fails with:
-setfacl: Fatal internal error
java.lang.NullPointerException
at
org.apache.hadoop.fs.adl.HdiAdlFileSystem.shouldUseDaemonUserOrGroup(Unknown Source)
at
org.apache.hadoop.fs.adl.HdiAdlFileSystem.getEffectiveAclEntries(Unknown Source)
at org.apache.hadoop.fs.adl.HdiAdlFileSystem.modifyAclEntries(Unknown Source)
at
org.apache.hadoop.fs.shell.AclCommands$SetfaclCommand.processPath(AclCommands.java:240)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
at
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:297)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:356)
As
hdfs dfs -getfacl adl://<myadls>.azuredatalakestore.net/user/pics/sample.jpeg
works fine I am wondering if an ESP (Enterprise Security Package) Configuration is needed.

You can try the below steps to set acls on Azure Data Lake Gen1 as follows.
Before: Getting acls of ADLS folder example and user:bd04d6b3-xxx-xxxx-xxxx-3e30e6c99f3c:r--
hdfs dfs -getfacl adl://<adlsname>.azuredatalakestore.net/clusters/example
Set ACLs on ADLS: Use the following command to set acls for specific user as follows:
hdfs dfs -setfacl -m user:bd04d6b3-xxxx-xxxx-xxxx--3e30e6c99f3c:rwx adl://<adlsname>.azuredatalakestore.net/clusters/example
After: Now you can notice the acls are changed.
Hope this helps.

Related

Unable to read xlsx file to pyspark dataframe from azure blob storage container

I am trying to load data from the Azure storage container to the Pyspark data frame in Azure Databricks. When I read txt or CSV files it is working. But when I try to read .xlsx files I am getting the following issue.
Apache Spark 3.2.0, Scala 2.12
Below are the steps I am performing
spark.conf.set("fs.azure.account.key.teststorage.blob.core.windows.net",
"**********************")
It is working
df = spark.read.format("csv").option("header", "true") \
.option("inferSchema", "true") \
.load("wasbs://testcontainer#teststorage.blob.core.windows.net/data/samplefile.txt")
Not working
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true").option("inferSchema","true") \
.load("wasbs://testcontainer#teststorage.blob.core.windows.net/data/samplefile.xlsx")
Getting below error while loading xlsx files:
: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container producer in account teststorage.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1063)
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:512)
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1384)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at com.crealytics.spark.excel.WorkbookReader$.readFromHadoop$1(WorkbookReader.scala:35)
at com.crealytics.spark.excel.WorkbookReader$.$anonfun$apply$2(WorkbookReader.scala:41)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:49)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:13)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:163)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:162)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:355)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:322)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:322)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:235)
at sun.reflect.GeneratedMethodAccessor338.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Note: I am able to read from dbfs and mount point.
This problem arises due to the public access level of the blob storage container. When the container has Private or Blob public access level, the same error occurs for excel files. But when Container public access level blob storage container is used, you will be able to read the excel files without error. This is what I got while trying to reproduce the issue.
A simple solution would be either to change the public access level of the container to Container, or to mount the blob storage account to the Databricks file system (which is working for you). If you choose to change the public access level of container, go to the container in your blob storage, and you will find the option “Change access level” where you can select Container level.
Navigate to container of storage account and change access level.
Go back to Databricks, run the Dataframe read again which works without any error.
df2 = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true").option("inferSchema","true") \
.load("wasbs://<container>#<storage_acc>.blob.core.windows.net/data.xlsx")
Please refer to the following document to understand more about accessing blob storage account using Databricks.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage

Is there a way to read a file from a Kerberized HDFS into a non kerberized spark cluster given the keytab file, principal and other details?

I need to read data from a Kerberized HDFS cluster using webHDFS in a non Kerberized Spark cluster. I have access to the Keytab file, username/principal, and can access any other details needed to log in. I need to programmatically log in and allow my Spark cluster to read the file(s) from the Kerberized HDFS.
It looks like I can log in using this piece of code:
System.setProperty("java.security.krb5.kdc", "<kdc>")
System.setProperty("java.security.krb5.realm", "<realm>")
val conf = new Configuration()
conf.set("hadoop.security.authentication", "kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("<my-principal>", "<keytab-file-path>/keytabFile.keytab")
From the official documentation of webHDFS here, I see that I can configure the keytab file path and principal in the hadoop configuration like this:
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.principal", "<my-principal>")
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.keytab", "<keytab-file-path>/keytabFile.keytab")
After this, I should be able to read the files from the Kerberized HDFS cluster using:
val irisDFWebHDFS = sparkSession.read
.format("csv")
.option("header", "true")
.csv(s"webhdfs://<namenode-host>:<namenode-port>/user/hadoop/iris.csv")
But it still refuses to read and throws the following exception:
org.apache.hadoop.security.AccessControlException: Authentication required
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:490)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$300(WebHdfsFileSystem.java:135)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:721)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:796)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1741)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:365)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:585)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:608)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:898)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:794)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:1086)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:1097)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1707)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
... 44 elided
Any pointers on what I might be missing?

Client cannot authenticate via: [TOKEN, KERBEROS)

From my spark application I am trying to distcp from hdfs to s3. My app does some processing on data and writes data to hdfs and that data I am trying to push to s3 via distcp. I am facing below error. Any pointer will be helpful.
org.apache.hadoop.security.UserGroupInformation doAs -
PriviledgedActionException as: (auth:SIMPLE) cause:org.apache.hadoop.security.
Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException:
Client cannot authenticate via: [TOKEN, KERBEROS);
I was already doing knit . Doing ugi.doAs new privilege action fixed this issue

Azure Databricks - Unable to read simple blob storage file from notebook

I've set up a cluster with databricks runtime version 5.1 (includes Apache Spark 2.4.0, Scala 2.11) and Python 3. I also installed hadoop azure library (hadoop-azure-3.2.0) to the cluster.
I'm trying to read a blob stored in my blob storage account which is just a text file containing some numeric data delimited by spaces for example. I used the template generated by databricks for reading blob data
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
where file_location is my blob file (https://xxxxxxxxxx.blob.core.windows.net).
I get the following error:
No filesystem named https
I tried using sc.textFile(file_location) to read in an rdd and get the same error.
Your file_location should be in the format:
"wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>"
See: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html
You need to mount the blob with external location to access it via Azure Databricks.
Reference: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs
These three lines of code worked for me:
spark.conf.set("fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.net","BIG_KEY")
df = spark.read.csv("wasbs://CONTAINER#STORAGE_ACCOUNT.blob.core.windows.net/")
df.select('*').show()
NOTE that line 2 ends with .net/ because I do not have a sub-folder.

Write spark event log to local filesystem instead of hdfs

I want to redirect event log of my spark applications to a local directory like "/tmp/spark-events" instead of "hdfs://user/spark/applicationHistory".
I set the "spark.eventLog.dir" variable to "file:///tmp/spark-events" in cloudera manager (Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.con).
But when I restart spark, spark-conf conatains (spark.eventLog.dir=hdfs://nameservice1file:///tmp/spark-eventstmp/spark) and this not works.

Resources