Delete unflushed file from Azure Data Lake Gen 2 - azure

To upload a file to ADL first you need to:
do a put request with the ?resource=file parameters (this creates a file on the ADL)
append data to the file with the ?action=append&position=<N> parameters
lastly, you need to flush the data with ?action=flush&position=<FILE_SIZE>
My question is:
Is there a way to tell the server how long the data should live if it is not flushed(written).
Since you need to create a file first to write data into it, there might be scenarios where the flush does not happen, and you are stuck with an empty file in the data lake.
I could not find anything on the Microsoft documentation about this.
Any info would be appreciated.

Updated 0219:
If you just call the append api, but not call the flush api, then the uncommitted data will be saved in azure within 7 days.
The uncommitted data will be deleted automatically after 7 days and cannot be deleted from the your end.
Origianl:
The SDK for Azure Datalake Storage Gen2 is ready, and you can use it to operate ADLS Gen2 more easier than using rest api.
If you're using .NET/c#, there is a SDK for Azure Datalake Storage Gen2: Azure.Storage.Files.DataLake.
Here is the official doc for how to use this SDK to operate ADLS Gen2, and the c# code below is used to delete a file / upload a file for ADLS Gen2:
static void Main(string[] args)
{
string accountName = "xxx";
string accountKey = "xxx";
StorageSharedKeyCredential sharedKeyCredential =
new StorageSharedKeyCredential(accountName, accountKey);
string dfsUri = "https://" + accountName + ".dfs.core.windows.net";
DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClient
(new Uri(dfsUri), sharedKeyCredential);
DataLakeFileSystemClient fileSystemClient = dataLakeServiceClient.GetFileSystemClient("w22");
DataLakeDirectoryClient directoryClient = fileSystemClient.GetDirectoryClient("t2");
// use this line of code to delete a file
//directoryClient.DeleteFile("22.txt");
//use the code below to upload a file
//DataLakeFileClient fileClient = directoryClient.CreateFile("22.txt");
//FileStream fileStream = File.OpenRead("d:\\foo2.txt");
//long fileSize = fileStream.Length;
//fileClient.Append(fileStream, offset: 0);
//fileClient.Flush(position: fileSize);
Console.WriteLine("**completed**");
Console.ReadLine();
}
For Java, refer to this doc.
For Python, refer to this doc.

Related

How to Read a file from Azure Data Lake Storage with file Url?

Is there a way to read files from the Azure Data Lake. I have the Http url of the file. I want to read it direclty. How can i acheive it because I don't see a way to do it via the SDK.
Thanks for your help.
Regards
Did you check docs?
public async Task ListFilesInDirectory(DataLakeFileSystemClient fileSystemClient)
{
IAsyncEnumerator<PathItem> enumerator =
fileSystemClient.GetPathsAsync("my-directory").GetAsyncEnumerator();
await enumerator.MoveNextAsync();
PathItem item = enumerator.Current;
while (item != null)
{
Console.WriteLine(item.Name);
if (!await enumerator.MoveNextAsync())
{
break;
}
item = enumerator.Current;
}
}
You can also use ADLS Gen2 rest api ,
For example, you can write code like below with sas token authentication(or you can also use the shared key authentication):
string sasToken = "?sv=2018-03-28&ss=b&srt=sco&sp=rwdl&st=2019-04-15T08%3A07%3A49Z&se=2019-04-16T08%3A07%3A49Z&sig=xxxx";
string url = "https://xxxx.dfs.core.windows.net/myfilesys1/app.JPG" + sasToken;
var req = (HttpWebRequest)WebRequest.CreateDefault(new Uri(url));
//you can specify the Method as per your operation as per the api doc
req.Method = "HEAD";
var res = (HttpWebResponse)req.GetResponse();
If you know Blob APIs and Data Lake Storage Gen2 APIs can operate on the same data, then you can directly use the azure blob storage SDK to read file from ADLS Gen2.
First, install this nuget package: Microsoft.Azure.Storage.Blob, version 11.1.6.
Note that, in this case, you should use this kind of url "https://xxx.blob.core.windows.net/mycontainer/myfolder/test.txt" instead of that kind of url "https://xxx.dfs.core.windows.net/mycontainer/myfolder/test.txt".
Here is the sample code which is used to read a .txt file in ADLS Gen2:
var blob_url = "https://xxx.blob.core.windows.net/mycontainer/myfolder/test.txt";
//var blob_url = "https://xxx.dfs.core.windows.net/mycontainer/myfolder/test.txt";
var username = "xxxx";
var password = "xxxx";
StorageCredentials credentials = new StorageCredentials(username, password);
var blob = new CloudBlockBlob(new Uri(blob_url),credentials);
var mystream = blob.OpenRead();
using (StreamReader reader = new StreamReader(mystream))
{
Console.WriteLine("Read file content: " + reader.ReadToEnd());
}
//you can also use other method like below
//string text = blob.DownloadText();
//Console.WriteLine($"the text is: {text}");
The test result:

How to get MD5 of file stored in ADLS Gen2?

I receive daily files through sFTP to ADLS gen 2 storage account. I need to verify the file by checking the MD5 of the file stored in ADLS gen2.
I tried using the BLOB API , currently its not supporting ADLS gen2. I was able to get Content MD5 from blob properties if the file is stored in Blob storage.
Can someone help how to get the content MD5 from ADLS gen 2?
As of now, Blob api is not supported as you know, but you can take a look at Data Lake Storage Gen2 rest api -> Path - Get Properties, which can be used to fetch properties of files stored in ADLS Gen2.
Here is a sample code(Note that I use the sas token appended to the api url):
using System;
using System.Net;
namespace ConsoleApp3
{
class Program
{
static void Main(string[] args)
{
string sasToken = "?sv=2018-03-28&ss=b&srt=sco&sp=rwdl&st=2019-04-15T08%3A07%3A49Z&se=2019-04-16T08%3A07%3A49Z&sig=xxxx";
string url = "https://xxxx.dfs.core.windows.net/myfilesys1/app.JPG" + sasToken;
var req = (HttpWebRequest)WebRequest.CreateDefault(new Uri(url));
req.Method = "HEAD";
var res = (HttpWebResponse)req.GetResponse();
Console.WriteLine("the status code is: "+res.StatusCode);
var headers = res.Headers;
Console.WriteLine("the count of the headers is: "+headers.Count);
Console.WriteLine("*********");
Console.WriteLine();
//list all the properties if you don't know which correct format of property.
foreach (var h in headers.Keys)
{
Console.WriteLine(h.ToString());
}
Console.WriteLine("*********");
Console.WriteLine();
//take the Content-Type property for example.
var myheader = res.GetResponseHeader("Content-Type");
Console.WriteLine($"the header Content-Type is: {myheader}");
Console.ReadLine();
}
}
}
Result:
If you don't know how to generate sas token, you can nav to azure portal -> your storage account, then follow the screenshot below:

What is the best way to upload a large number of files to azure file storage?

I want the file storage specifically not the blob storage (I think). This is code for my azure function and I just have a bunch of stuff in my node_modules folder.
What I would like to do is upload a zip of the entire app and then just upload that and have azure unpack it at a given folder. Is this possible?
Right now I'm essentially iterating over all of my files and calling:
var fileStream = new stream.Readable();
fileStream.push(myFileBuffer);
fileStream.push(null);
fileService.createFileFromStream('taskshare', 'taskdirectory', 'taskfile', fileStream, myFileBuffer.length, function(error, result, response) {
if (!error) {
// file uploaded
}
});
And this works its just too slow. So I'm wondering if there is a faster way to upload a bunch of files for use in apps.
And this works its just too slow. So I'm wondering if there is a faster way to upload a bunch of files for use in apps.
If Microsoft Azure Storage Data Movement Library is acceptable, please have a try to use it. The Microsoft Azure Storage Data Movement Library designed for high-performance uploading, downloading and copying Azure Storage Blob and File. This library is based on the core data movement framework that powers AzCopy.
We also could get the demo code from the github document.
string storageConnectionString = "myStorageConnectionString";
CloudStorageAccount account = CloudStorageAccount.Parse(storageConnectionString);
CloudBlobClient blobClient = account.CreateCloudBlobClient();
CloudBlobContainer blobContainer = blobClient.GetContainerReference("mycontainer");
blobContainer.CreateIfNotExists();
string sourcePath = "path\\to\\test.txt";
CloudBlockBlob destBlob = blobContainer.GetBlockBlobReference("myblob");
// Setup the number of the concurrent operations
TransferManager.Configurations.ParallelOperations = 64;
// Setup the transfer context and track the upoload progress
SingleTransferContext context = new SingleTransferContext();
context.ProgressHandler = new Progress<TransferStatus>((progress) =>
{
Console.WriteLine("Bytes uploaded: {0}", progress.BytesTransferred);
});
// Upload a local blob
var task = TransferManager.UploadAsync(
sourcePath, destBlob, null, context, CancellationToken.None);
task.Wait();

Azure Datalake Store Create an empty file with C#

How to create an empty file on Azure Data lake with C#. In one of the thread Create File From Azure Data Lake Store .NET SDK its mentioned to use- FileSystemOperationsExtensions.Create but how to use this to create an empty file.
Below is a snippet sample that accomplishes this:
// to find out how to get the credentials...
// see: https://github.com/Azure-Samples/data-lake-analytics-dotnet-auth-options
var adlCreds = ...;
string adls = "youradlsaccountname";
var adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(adlCreds);
var adlsaccount = adlsAccountClient.Account.Get(rg, adls);
adlsFileSystemClient.FileSystem.Create( adls, "/emptyfile.dat", null, null, null,null,null);

How to copy an Azure File to an Azure Blob?

I've got some files sitting in an Azure File storage.
I'm trying to programatically archive them to Azure Blobs and I'm not sure how to do this efficiently.
I keep seeing code samples about copying from one blob container to another blob container .. but not from a File to a Blob.
Is it possible to do without downloading the entire File content locally and then uploading this content? Maybe use Uri's or something?
More Info:
The File and Blob containers are in the same Storage account.
The storage account is RA-GRS
Here was some sample code I was thinking of doing .. but it just doesn't feel right :( (pseudo code also .. with validation and checks omitted).
var file = await ShareRootDirectory.GetFileReference(fileName);
using (var stream = new MemoryStream())
{
await file.DownloadToStreamAsync(stream);
// Custom method that basically does:
// 1. GetBlockBlobReference
// 2. UploadFromStreamAsync
await cloudBlob.AddItemAsync("some-container", stream);
}
How to copy an Azure File to an Azure Blob?
We also can use CloudBlockBlob.StartCopy(CloudFile). CloudFile type is also can be accepted by the CloudBlockBlob.StartCopy function.
How to copy CloudFile to blob please refer to document. The following demo code is snippet from the document.
// Parse the connection string for the storage account.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(
Microsoft.Azure.CloudConfigurationManager.GetSetting("StorageConnectionString"));
// Create a CloudFileClient object for credentialed access to File storage.
CloudFileClient fileClient = storageAccount.CreateCloudFileClient();
// Create a new file share, if it does not already exist.
CloudFileShare share = fileClient.GetShareReference("sample-share");
share.CreateIfNotExists();
// Create a new file in the root directory.
CloudFile sourceFile = share.GetRootDirectoryReference().GetFileReference("sample-file.txt");
sourceFile.UploadText("A sample file in the root directory.");
// Get a reference to the blob to which the file will be copied.
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("sample-container");
container.CreateIfNotExists();
CloudBlockBlob destBlob = container.GetBlockBlobReference("sample-blob.txt");
// Create a SAS for the file that's valid for 24 hours.
// Note that when you are copying a file to a blob, or a blob to a file, you must use a SAS
// to authenticate access to the source object, even if you are copying within the same
// storage account.
string fileSas = sourceFile.GetSharedAccessSignature(new SharedAccessFilePolicy()
{
// Only read permissions are required for the source file.
Permissions = SharedAccessFilePermissions.Read,
SharedAccessExpiryTime = DateTime.UtcNow.AddHours(24)
});
// Construct the URI to the source file, including the SAS token.
Uri fileSasUri = new Uri(sourceFile.StorageUri.PrimaryUri.ToString() + fileSas);
// Copy the file to the blob.
destBlob.StartCopy(fileSasUri);
Note:
If you are copying a blob to a file, or a file to a blob, you must use a shared access signature (SAS) to authenticate the source object, even if you are copying within the same storage account.
Use the Transfer Manager:
https://msdn.microsoft.com/en-us/library/azure/microsoft.windowsazure.storage.datamovement.transfermanager_methods.aspx
There are methods to copy from CloudFile to CloudBlob.
Add the "Microsoft.Azure.Storage.DataMovement" nuget package
using Microsoft.WindowsAzure.Storage.Blob;
using Microsoft.WindowsAzure.Storage.File;
using Microsoft.WindowsAzure.Storage.DataMovement;
private string _storageConnectionString = "your_connection_string_here";
public async Task CopyFileToBlob(string blobContainer, string blobPath, string fileShare, string fileName)
{
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(_connectionString);
CloudFileShare cloudFileShare = storageAccount.CreateCloudFileClient().GetShareReference(fileShare);
CloudFile source = cloudFileShare.GetRootDirectoryReference().GetFileReference(fileName);
CloudBlobContainer blobContainer = storageAccount.CreateCloudBlobClient().GetContainerReference(blobContainer);
CloudBlockBlob target = blobContainer.GetBlockBlobReference(blobPath);
await TransferManager.CopyAsync(source, target, true);
}

Resources