ADLS gen2 compatible filesystem - azure

Is there any ADLS gen2 API compatible filesystem available?
We want to automate our ADLS writer and it will run on our build environment which doesn't have access to azure, is there any API compatible filesystem which we can use for our automation testing, we are using minio for S3 writer testing and looking for the similar tool for ADLS.

Unfortunately, could not find any local setup for testing available for ADLS gen2 writer client using any tools. Please feel free to post your question in All Questions - Microsoft Q&A.

Related

When ingesting into a data lake using ADLS Gen2, should files be stored in File Shares or Containers

When ingesting data and transforming the various layers of our data lake built on top of Azure ADLS gen2 storage account (hierarchical), I can organize files in Containers or File Shares. We currently ingest raw files into a RAW container in their native format ".csv". We then take those files and merge them into a QUERY container in compressed parquet format so that we can virtualize all the data using Polybase in SQL server.
It is my understanding that only files stored within File Shares can be accessed using the typical SMB/UNC paths. When building out a data lake such as this, should Containers within ADLS be avoided in order to gain the additional benefit of being able to access those same files via File Shares?
I did notice that files located under File shares do not appear to support metadata key/values (unless it's just not exposed through the UI). Other than that, I wonder if there are any other real differences between the two types.
Thanks to #Gaurav for sharing the knowledge in comment section.
(Posting the answer using the details provided in comment section to help other community members.)
Earlier, only the files which were stored in Azure storage File Share can be accessed using the typical SMB/UNC paths. But recently, now it is possible to mount Blob Container as well using the NFS 3.0 protocol. This Microsoft official document provides step-by-step guidance.
Limitation: You can mount a container in Blob storage only from a Linux-based Azure Virtual Machine (VM) or a Linux system that runs on-premises. There is no support for Windows and Mac OS.

Copy files to ADLS gen2 from mobile phone

I have few files regularly created on my mobile phone. How can I upload these files on my ADLS gen2 storage account. I generally use azcopy to copy, but how can it be done on android phones
Is there a upload file rest api for ADLS gen2 or any other SDK?
Yes, as #GeorgeChen 's comment said. By now as I known, there is not any SDK for Azure Data Lake Storage Gen2, so the only solution is to use its REST APIs.
There is a very similar SO thread Upload data to the Azure ADLS Gen2 from on-premise using Python or Java which you can refer to, my answer for it to post the Python script which defines 7 functions to help using REST APIs include auth, mkfs, mkdir, touch_file, append_file, flush_file and mkfile.
For using Java, you can refer to my code in Python to write your Java code with okhttp.
Update: I reviewed Azure offical documents and searched the offical GitHub repos for ADLS Gen2, there is a public preview version of ADLS Gen2 SDK named Azure File Data Lake client library for Java. I see it default used the Netty HTTP client, but you can use OkHTTP as the Alternate HTTP client as the content of README said, so I think you can try to use it with the alternate HTTP client OkHTTP for Android.

Uploading Data(csv file) using Azure Functions(Nodejs) To Azure DataLakeGen2

I am currently trying to send a csv file using Azure Function with NodeJs to Azure Data Lake gen2 but unable to do the same, Any suggestions regarding the same would be really helpful.
Thanks.
I have tried to use Credentials of blob storage present in ADLS gen2 using the Blob storage API's but i am getting an error.
For now this could not be implemented with SDK. Please check this known issue:
Blob storage APIs are disabled to prevent feature operability issues that could arise because Blob Storage APIs aren't yet interoperable with Azure Data Lake Gen2 APIs.
And in the table of features, you could find the information about APIs for Data Lake Storage Gen2 storage accounts:
multi-protocol access on Data Lake Storage is currently in public preview. This preview enables you to use Blob APIs in the .NET, Java, Python SDKs with accounts that have a hierarchical namespace. The SDKs don't yet contain APIs that enable you to interact with directories or set access control lists (ACLs). To perform those functions, you can use Data Lake Storage Gen2 REST APIs.
So if you want to implement it, you have to use the REST API:Azure Data Lake Store REST API.

Can Spark write to Azure Datalake Gen2?

It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.
I'm using jupyter with almond to run spark in a notebook locally.
I have imported the hadoop dependencies:
import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`
which allows me to use the wasbs:// protocol when trying to write my dataframe to azure
spark.conf.set(
"fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net",
"?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")
This is where the error comes:
val data = spark.read.json(spark.createDataset(
"""{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))
data
.write
.orc("wasbs://[filesystem]#[datalakegen2storageaccount].blob.core.windows.net/lalalalala")
We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.
So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.
Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client". Those in HD/Insights, Cloudera CDH6.x etc do.
consistently upgrade the hadoop-* JARs to Hadoop 3.2.1. That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.
use abfs:// URLs
Configure the client as per the docs.
ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. Security and permissions are great too.
Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.
I think you have to enable the preview feature to use the Blob API with Azure DataLake Gen2: Data Lake Gen2 Multi-Protocol-Access
Another thing that I found: The endpoint format needs to be updated by exchanging the "blob" to "dfs". See here. But I am not sure if that helps with your problem.
On the other hand, you could use the ABFS driver to access the data. This is not officially supported, but you could start from a hadoop-free spark solution and install a newer hadoop version containing the driver. I think this might be an option depending on your scenario: Adding hadoop ABFS driver to spark distribution

When Will Azure ADLS Gen 2 SDK Be Released?

It seems like the SDKs for Data Lake Storage Gen2 are not available now. Are there other ways / workarounds?
This seems like a questions many others also have: https://github.com/MicrosoftDocs/azure-docs/issues/22913
Any news about an SDK for gen2 datalake?
According to the known issues about ADLS GEN2:
You can use Data Lake Storage Gen2 REST APIs, but APIs in other Blob
SDKs such as the .NET, Java, Python SDKs are not yet available.
So,you could use it by REST API, there are some threads for you reference:
1.https://social.msdn.microsoft.com/Forums/en-US/45be0931-379d-4252-9d20-164261cc64c5/error-while-calling-adls-gen-2-rest-api-to-create-file?forum=AzureDataLake
2.https://social.msdn.microsoft.com/Forums/azure/en-US/dc102604-bdb7-47be-8de4-dc47a42e31a4/azure-data-lake-gen2-rest-api?forum=AzureDataLake
To push the progress of sdk, you could submit your feedback here so that azure team will leave the latest comments.

Resources