How to get MD5 of file stored in ADLS Gen2? - azure

I receive daily files through sFTP to ADLS gen 2 storage account. I need to verify the file by checking the MD5 of the file stored in ADLS gen2.
I tried using the BLOB API , currently its not supporting ADLS gen2. I was able to get Content MD5 from blob properties if the file is stored in Blob storage.
Can someone help how to get the content MD5 from ADLS gen 2?

As of now, Blob api is not supported as you know, but you can take a look at Data Lake Storage Gen2 rest api -> Path - Get Properties, which can be used to fetch properties of files stored in ADLS Gen2.
Here is a sample code(Note that I use the sas token appended to the api url):
using System;
using System.Net;
namespace ConsoleApp3
{
class Program
{
static void Main(string[] args)
{
string sasToken = "?sv=2018-03-28&ss=b&srt=sco&sp=rwdl&st=2019-04-15T08%3A07%3A49Z&se=2019-04-16T08%3A07%3A49Z&sig=xxxx";
string url = "https://xxxx.dfs.core.windows.net/myfilesys1/app.JPG" + sasToken;
var req = (HttpWebRequest)WebRequest.CreateDefault(new Uri(url));
req.Method = "HEAD";
var res = (HttpWebResponse)req.GetResponse();
Console.WriteLine("the status code is: "+res.StatusCode);
var headers = res.Headers;
Console.WriteLine("the count of the headers is: "+headers.Count);
Console.WriteLine("*********");
Console.WriteLine();
//list all the properties if you don't know which correct format of property.
foreach (var h in headers.Keys)
{
Console.WriteLine(h.ToString());
}
Console.WriteLine("*********");
Console.WriteLine();
//take the Content-Type property for example.
var myheader = res.GetResponseHeader("Content-Type");
Console.WriteLine($"the header Content-Type is: {myheader}");
Console.ReadLine();
}
}
}
Result:
If you don't know how to generate sas token, you can nav to azure portal -> your storage account, then follow the screenshot below:

Related

How to copy files from Azure File-Share (not blob) of different storage account using .net core

public void MoveFiles(AzureFileClient srcAzureClient, AzureFileClient destAzureClient, ShareClient srcShareClient, ShareClient destShareClient, string dirName)
{
if (!destAzureClient.ShareClient.GetDirectoryClient(dirName).Exists())
destAzureClient.ShareClient.GetDirectoryClient(dirName).Create();
var fileItems = GetChildNodes(srcShareClient, dirName);
if (fileItems.Count == 0)
return;
foreach (var item in fileItems)
{
if (item.ShareFileItem.IsDirectory)
{
MoveFiles(srcAzureClient, destAzureClient, srcShareClient, destShareClient, $"{dirName}/{item.ShareFileItem.Name}");
}
else
{
var srcFileClient = srcShareClient.GetDirectoryClient(Path.GetDirectoryName(item.FullPath)).GetFileClient(Path.GetFileName(item.FullPath));
var destFileClient = destShareClient.GetDirectoryClient(Path.GetDirectoryName(item.FullPath)).GetFileClient(Path.GetFileName(item.FullPath));
if (srcFileClient.Exists())
{
destFileClient.StartCopy(srcFileClient.Uri);
}
}
}
}
This code is throwing an error at
destFileClient.StartCopy(srcFileClient.Uri)
saying
sourceCopy is not verified, connection strings are given to both source & destination fileShare object
I am able to copy files from the same account storage.
When copying files (or blobs) across storage accounts, the source file (or blob) must be publicly accessible. This restriction does not apply when the source and destination are in the same storage account.
Because Azure Files are inherently private (there's no concept of ACL like we have in Blob Storage), you are getting this error as Azure Storage Service is not able to read the source file.
To fix this, you would need to create a SAS URL with at least read permission on the source file and use that SAS URL as copy source.

How to Read a file from Azure Data Lake Storage with file Url?

Is there a way to read files from the Azure Data Lake. I have the Http url of the file. I want to read it direclty. How can i acheive it because I don't see a way to do it via the SDK.
Thanks for your help.
Regards
Did you check docs?
public async Task ListFilesInDirectory(DataLakeFileSystemClient fileSystemClient)
{
IAsyncEnumerator<PathItem> enumerator =
fileSystemClient.GetPathsAsync("my-directory").GetAsyncEnumerator();
await enumerator.MoveNextAsync();
PathItem item = enumerator.Current;
while (item != null)
{
Console.WriteLine(item.Name);
if (!await enumerator.MoveNextAsync())
{
break;
}
item = enumerator.Current;
}
}
You can also use ADLS Gen2 rest api ,
For example, you can write code like below with sas token authentication(or you can also use the shared key authentication):
string sasToken = "?sv=2018-03-28&ss=b&srt=sco&sp=rwdl&st=2019-04-15T08%3A07%3A49Z&se=2019-04-16T08%3A07%3A49Z&sig=xxxx";
string url = "https://xxxx.dfs.core.windows.net/myfilesys1/app.JPG" + sasToken;
var req = (HttpWebRequest)WebRequest.CreateDefault(new Uri(url));
//you can specify the Method as per your operation as per the api doc
req.Method = "HEAD";
var res = (HttpWebResponse)req.GetResponse();
If you know Blob APIs and Data Lake Storage Gen2 APIs can operate on the same data, then you can directly use the azure blob storage SDK to read file from ADLS Gen2.
First, install this nuget package: Microsoft.Azure.Storage.Blob, version 11.1.6.
Note that, in this case, you should use this kind of url "https://xxx.blob.core.windows.net/mycontainer/myfolder/test.txt" instead of that kind of url "https://xxx.dfs.core.windows.net/mycontainer/myfolder/test.txt".
Here is the sample code which is used to read a .txt file in ADLS Gen2:
var blob_url = "https://xxx.blob.core.windows.net/mycontainer/myfolder/test.txt";
//var blob_url = "https://xxx.dfs.core.windows.net/mycontainer/myfolder/test.txt";
var username = "xxxx";
var password = "xxxx";
StorageCredentials credentials = new StorageCredentials(username, password);
var blob = new CloudBlockBlob(new Uri(blob_url),credentials);
var mystream = blob.OpenRead();
using (StreamReader reader = new StreamReader(mystream))
{
Console.WriteLine("Read file content: " + reader.ReadToEnd());
}
//you can also use other method like below
//string text = blob.DownloadText();
//Console.WriteLine($"the text is: {text}");
The test result:

Delete unflushed file from Azure Data Lake Gen 2

To upload a file to ADL first you need to:
do a put request with the ?resource=file parameters (this creates a file on the ADL)
append data to the file with the ?action=append&position=<N> parameters
lastly, you need to flush the data with ?action=flush&position=<FILE_SIZE>
My question is:
Is there a way to tell the server how long the data should live if it is not flushed(written).
Since you need to create a file first to write data into it, there might be scenarios where the flush does not happen, and you are stuck with an empty file in the data lake.
I could not find anything on the Microsoft documentation about this.
Any info would be appreciated.
Updated 0219:
If you just call the append api, but not call the flush api, then the uncommitted data will be saved in azure within 7 days.
The uncommitted data will be deleted automatically after 7 days and cannot be deleted from the your end.
Origianl:
The SDK for Azure Datalake Storage Gen2 is ready, and you can use it to operate ADLS Gen2 more easier than using rest api.
If you're using .NET/c#, there is a SDK for Azure Datalake Storage Gen2: Azure.Storage.Files.DataLake.
Here is the official doc for how to use this SDK to operate ADLS Gen2, and the c# code below is used to delete a file / upload a file for ADLS Gen2:
static void Main(string[] args)
{
string accountName = "xxx";
string accountKey = "xxx";
StorageSharedKeyCredential sharedKeyCredential =
new StorageSharedKeyCredential(accountName, accountKey);
string dfsUri = "https://" + accountName + ".dfs.core.windows.net";
DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClient
(new Uri(dfsUri), sharedKeyCredential);
DataLakeFileSystemClient fileSystemClient = dataLakeServiceClient.GetFileSystemClient("w22");
DataLakeDirectoryClient directoryClient = fileSystemClient.GetDirectoryClient("t2");
// use this line of code to delete a file
//directoryClient.DeleteFile("22.txt");
//use the code below to upload a file
//DataLakeFileClient fileClient = directoryClient.CreateFile("22.txt");
//FileStream fileStream = File.OpenRead("d:\\foo2.txt");
//long fileSize = fileStream.Length;
//fileClient.Append(fileStream, offset: 0);
//fileClient.Flush(position: fileSize);
Console.WriteLine("**completed**");
Console.ReadLine();
}
For Java, refer to this doc.
For Python, refer to this doc.

Duplicating File Uploading Process - Asp.net WebApi

I created a web API which allows users to send files and upload to Azure Storage. The way it works is, the client app will connect to API to send one or more files to the file upload controller and controller will take care of rest such as
Upload file to Azure storage
Update database
Works great but I don't think it is the right way to do this because now I can see there are two different processes
Upload file from the client's file system to my web API (server)
Upload file to the Azure storage from API (server)
It gives me the feeling that I am duplicating the upload process as the same file first travels to API (server) and then Azure (destination) from the client (file system). I feel the need of showing two progress-bars to the client for file upload progress (from client to server and then the server to Azure) - That just doesn't make sense to me and I feel that my approach is incorrect.
My API accepts up to 250MBs so you can imagine the overload.
What do you guys think?
//// API Controller
if (!Request.Content.IsMimeMultipartContent("form-data"))
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
var provider = new RestrictiveMultipartMemoryStreamProvider();
var contents = await Request.Content.ReadAsMultipartAsync(provider);
int Total_Files = contents.Contents.Count();
foreach (HttpContent ctnt in contents.Contents)
{
await storageManager.AddBlob(ctnt)
}
////// Stream
#region SteamHelper
public class RestrictiveMultipartMemoryStreamProvider : MultipartMemoryStreamProvider
{
public override Stream GetStream(HttpContent parent, HttpContentHeaders headers)
{
var extensions = new[] { "pdf", "doc", "docx", "cab", "zip" };
var filename = headers.ContentDisposition.FileName.Replace("\"", string.Empty);
if (filename.IndexOf('.') < 0)
return Stream.Null;
var extension = filename.Split('.').Last();
return extensions.Any(i => i.Equals(extension, StringComparison.InvariantCultureIgnoreCase))
? base.GetStream(parent, headers)
: Stream.Null;
}
}
#endregion SteamHelper
///// AddBlob
public async Task<string> AddBlob(HttpContent _Payload)
{
CloudStorageAccount cloudStorageAccount = KeyVault.AzureStorage.GetConnectionString();
CloudBlobClient cloudBlobClient = cloudStorageAccount.CreateCloudBlobClient();
CloudBlobContainer cloudBlobContainer = cloudBlobClient.GetContainerReference("SomeContainer");
cloudBlobContainer.CreateIfNotExists();
try
{
byte[] fileContentBytes = _Payload.ReadAsByteArrayAsync().Result;
CloudBlockBlob blob = cloudBlobContainer.GetBlockBlobReference("SomeBlob");
blob.Properties.ContentType = _Payload.Headers.ContentType.MediaType;
blob.UploadFromByteArray(fileContentBytes, 0, fileContentBytes.Length);
var B = await blob.CreateSnapshotAsync();
B.FetchAttributes();
return "Snapshot ETAG: " + B.Properties.ETag.Replace("\"", "");
}
catch (Exception X)
{
return ($"Error : " + X.Message);
}
}
It gives me the feeling that I am duplicating the upload process as
the same file first travels to API (server) and then Azure
(destination) from the client (file system).
I think you're correct. One possible solution would be have your API generate a Shared Access Signature (SAS) token and return that SAS token/URI to the client whenever a client wishes to upload a file.
Using this SAS URI your client can directly upload the file to Azure Storage without sending it to your API first. Once the file is uploaded successfully by the client, it can send a message to the API to update the database.
You can read more about SAS here: https://learn.microsoft.com/en-us/azure/storage/common/storage-dotnet-shared-access-signature-part-1.
I have also written a blog post long time back on using SAS that you may find useful: https://gauravmantri.com/2013/02/13/revisiting-windows-azure-shared-access-signature/.

Azure Storage Search Blobs by Metadata

I have CloudBlockBlobs that have metadata.
CloudBlockBlob blockBlob = container.GetBlockBlobReference("myblob.jpg");
using (var fileStream = System.IO.File.OpenRead(filePath))
{
blockBlob.UploadFromStream(fileStream);
blockBlob.Properties.ContentType = "image/jpg";
blockBlob.Metadata.Add("Title", "Yellow Pear");
blockBlob.SetProperties();
}
I see the Metadata is there:
Debug.WriteLine(blockBlob.Metadata["Title"]);
Now later if I query from storage I see the blobs but the Metadata is missing:
(in the below I know blobItems[0] had Metadata when uploaded but now blobItems[0].Metadata.Count == 0)
var blobItems = container.ListBlobs(
null, false, BlobListingDetails.Metadata);
I also noticed the Metadata is not available when I obtain the blob by itself:
CloudBlockBlob a = container.GetBlockBlobReference("myblob.jpg");
//Below throws an exception
var b = a.Metadata["Title"];
Thank you!
There are some issues with your code :(.
The blob doesn't have any metadata set actually. After setting the metadata, you're calling blob.SetProperties() method which only sets the blob's properties (ContentType in your example). To set the metadata, you would actually need to call blob.SetMetadata() method.
Your upload code is currently making 2 calls to storage service: 1) upload blob and 2) set properties. If you call SetMetadata then it would be 3 calls. IMHO, these can be combined in just 1 call to storage service by doing something like below:
using (var fileStream = System.IO.File.OpenRead(filePath))
{
blockBlob.Properties.ContentType = "image/jpg";
blockBlob.Metadata.Add("Title", "Yellow Pear");
blockBlob.UploadFromStream(fileStream);
}
This will not only upload the blob but also set it's properties and metadata in a single call to storage service.
Regarding
I also noticed the Metadata is not available when I obtain the blob by
itself:
CloudBlockBlob a = container.GetBlockBlobReference("myblob.jpg");
//Below throws an exception
var b = a.Metadata["Title"];
Basically the code above is just creating an instance of the blob on the client side. It doesn't actually fetch the properties (and metadata) of the blob. To fetch details about the blob, you would need to call FetchAttributes method on the blob. Something like:
CloudBlockBlob a = container.GetBlockBlobReference("myblob.jpg");
a.FetchAttributes();
If after that you retrieve blob's metadata, you should be able to see it (provided metadata was created properly).

Resources