I'm using the Microsoft.Azure.Storage.DataMovement nuget package to transfer multiple, very large (150GB) files into Azure cold storage using
TransferManager.UploadDirectoryAsync
It works very well, but a choke point in my process is that after upload I am attaching to the FileTransferred event and reading the local file all over again to calculate the md5 checksum and compare it to the remote copy:
private void FileTransferredCallback(object sender, TransferEventArgs e)
{
var sourceFile = e.Source.ToString();
var destinationFile = (ICloudBlob) e.Destination;
var localMd5 = CalculateMd5(e.Source.ToString());
var remoteMd5 = destinationFile.Properties.ContentMD5;
if (localMd5 == remoteMd5)
{
destinationFile.Metadata.Add(Md5VerifiedKey, DateTimeOffset.UtcNow.ToDisplayText());
destinationFile.SetMetadata();
}
}
It is slower than it needs to be since every file is getting double handled - first by the library, then by my MD5 check.
Is this check even necessary or is the library already doing the heavy lifting for me? I can see Md5HashStream but after quickly looking through the source it isn't clear to me if it is being used to verify the entire remote file.
Note that metadata blob.Properties.ContentMD5 of the entire blob is actually set by Microsoft.Azure.Storage.DataMovement library per its local calculation result after uploading all the blocks of this blob, not by Azure Storage Blob Service.
The data integrity of blob uploading is guaranteed by Content-MD5 HTTP header when putting every single block, not by metadata blob.Properties.ContentMD5 of the entire blob, since Azure Storage Blob Service doesn't really validate the value when Microsoft.Azure.Storage.DataMovement library is setting metadata (check the introduction of x-ms-blob-content-md5 HTTP header).
The main purpose of blob.Properties.ContentMD5 is to verify the data integrity when downloading the blob back to local disk via Microsoft.Azure.Storage.DataMovement library (if DownloadOptions.DisableContentMD5Validation is set to false, which is the default behavior).
Is this check even necessary or is the library already doing the heavy lifting for me?
Based on my knowledge, we just need to check the blob whether there is a value for the ContentMD5 propetry.
When using Microsoft.Azure.Storage.DataMovement to upload the large file,it is actually composed of multiple PutBlock requests plus one PutBlockList request.Each PutBlock request uploads only part of the content, so MD5 in such requests may only be for the current upload content, and can not be used as the final blob MD5 value.
The contents of the PutBlockList request is a list of all the above upload Block identity, so the MD5 value of this request can only be done on this list integrity check.
when all of these requests are validated, the integrity of the content is guaranteed. For the sake of performance, the Storage server does not summarize the contents of all the blocks in the previous request to calculate the MD5 value of the entire blob, but provides a special request header, x-ms-blob-content-md5, The end will set this header property value to the blob's MD5 value.So the client as long as the final PutBlockList request set the entire contents of the MD5 value to x-ms-blob-content-md5, then ensure the verification, blob also has the MD5 value.
So the blocking upload MD5 based on the integrity of the work process is:
The uploaded file is divided into pieces
Send each block as a PutBlock request and calculate the MD5 value of the current block to the Content-MD5 header
After all the blocks have been sent, the PutBlockList request is sent
Calculate the MD value of the entire uploaded file and set it to the head of x-ms-blob-content-md5
Make a list of the identities of the blocks sent earlier as the contents of the request
Set the MD5 value for the block ID list to the Content-MD5 header
Then assign the x-ms-blob-content-md5 value in the PutBlockList request to the blob's MD5 attribute
In summary, when blocking upload, it depends on whether x-ms-blob-content-md5 has a value.
Related
I am creating a pipeline in Azure data factory where I am using Function app as one of activity to transform data and store in append blob container as csv format .As I have taken 50 batches in for loop so 50 times my function app is to process data for each order.I am appending header in csv file with below logic.
//First I am creating file as per business logic //
//csveventcontent is my source data //
var dateAndTime = DateTime.Now.AddDays(-1);
string FileDate = dateAndTime.ToString("ddMMyyyy");
string FileName = _config.ContainerName + FileDate + ".csv";
StringBuilder csveventcontent = new StringBuilder();
OrderEventService obj = new OrderEventService();
//Now I am checking if todays file exists and if it doesn't we create it.//
if (await appBlob.ExistsAsync() == false)
{
await appBlob.CreateOrReplaceAsync(); //CreateOrReplace();
//Append Header
csveventcontent.AppendLine(obj.GetHeader());
}
Now the problem is header is appending so many times in csv file .Sometimes it is not appending at top.Probably due to parralel function app is running 50 times.
How I can fixed header at top only at one time.
I have tried with data flow and logic app also but unable to do it.If it can be handled through code that would be easier I guess.
I think you are right there. Its the concurrency of the function app that is causing the problem. The best approach would be to use a queue and process messages one by one. Or you could use a distributed lock to ensure only one function writes to the file at a time. You can use blob leases for this.
The Lease Blob operation creates and manages a lock on a blob for write and delete operations. The lock duration can be 15 to 60 seconds, or can be infinite.
Refer: Lease Blob Request Headers
I currently have a Timer triggered Azure Function that checks a data endpoint to determine if any new data has been added. If new data has been added, then I generate an output blob (which I return).
However, returning output appears to be mandatory. Whereas I'd only like to generate an output blob under specific conditions, I must do it all of the time, clogging up my storage.
Is there any way to generate output only under specified conditions?
If you have the blob output binding set to your return value, but you do not want to generate a blob, simply return null to ensure the blob is not created.
You're free to execute whatever logic you want in your functions. You may need to remove the output binding from your function (this is what is making the output required) and construct the connection to blob storage in your function instead. Then you can conditionally create and save the blob.
Hi Guys I am building a Client which Interact with Azure Storage Rest API.
I was going through documentation https://learn.microsoft.com/ru-ru/rest/api/storageservices/fileservices/list-containers2:
And didn't understood the use of parameter prefix and marker which can be send along with Azure request.
It says:
prefix
Optional. Filters the results to return only containers whose name
begins with the specified prefix.
marker
Optional. A string value that identifies the portion of the list of
containers to be returned with the next listing operation. The
operation returns the NextMarker value within the response body if the
listing operation did not return all containers remaining to be listed
with the current page. The NextMarker value can be used as the value
for the marker parameter in a subsequent call to request the next page
of list items.
The marker value is opaque to the client.
With Prefix, I think:
If i have dir structure:
file01.txt
images/image01.jpg
images/folder/image001.jpg
fightVideo/subFolder/current/video001.mpg
fightVideo/subFolder/current/video002.mpg
If I give prefix container name as "fight". It should return
fightVideo.
But I am not sure.
And for Marker I don't understand whats its use?
Please can someone explain the use of Prefix and Marker with examples?
In context of listing containers, if you specify prefix parameter it will list the containers names of which start with that prefix value. It has nothing to do with listing blobs.
List blobs operation also supports this prefix parameter and when you specify this parameter, it will list the blobs names of which start with that prefix value.
So the example you have given is for listing blobs and when you specify flight as prefix there, you will get back fightVideo/subFolder/current/video001.mpg and fightVideo/subFolder/current/video002.mpg in response but not when you call list containers with this prefix.
Regarding marker, Kalyan's explanation is correct but let me add a little bit more to that.
Essentially Azure Storage Service is a shared service and you simply can't ask it to return all the results in one go (if we were to take an analogy from SQL world, you simply can't do SELECT * FROM TABLE kind of thing). Each request to the service is assigned a predefined timeout and the response would include the items fetched in that time + optionally a token if the service thinks that there's more data available. This token is called continuation token. In order to get the next set of items, you would need to pass this continuation token in the marker parameter in your next request.
Each call to storage service will try to return a predefined maximum number of items. For listing blob containers/blobs, this limit is 5000 items. For listing tables/entities, this limit is 1000 items. If there are more items in your account, then apart from this data storage service returns you a continuation token which tells you that there's more data available.
Please note that even though the limit is there but you can't always assume that you will get these number of records. Based on a number of conditions, it is quite possible that you don't get back any data but still receive a continuation token. So your code need to handle this condition as well.
If there are too many blobs to be listed, then the response contains the NextMarker element.
<?xml version="1.0" encoding="utf-8"?>
<EnumerationResults ServiceEndpoint="https://myaccount.blob.core.windows.net">
<Prefix>string-value</Prefix>
<Marker>string-value</Marker>
<MaxResults>int-value</MaxResults>
<Containers>
<Container>
<Name>container-name</Name>
<Properties>
<Last-Modified>date/time-value</Last-Modified>
<Etag>etag</Etag>
<LeaseStatus>locked | unlocked</LeaseStatus>
<LeaseState>available | leased | expired | breaking | broken</LeaseState>
<LeaseDuration>infinite | fixed</LeaseDuration>
<PublicAccess>container | blob</PublicAccess>
</Properties>
<Metadata>
<metadata-name>value</metadata-name>
</Metadata>
</Container>
</Containers>
<NextMarker>marker-value</NextMarker>
</EnumerationResults>
The REST API documentation mentions that the marker value can be used in a subsequent call to request the next set of list items.
You can imagine marker as a paginatator index.
i am able to upload file to azure blob using REST api provided by Azure.
i want to set metadata at the time i am doing request for put blob, when i am setting it into header as shown here i am unble to upload file and getting following exception org.apache.http.client.ClientProtocolException.
from the last line of the code below
HttpPut req = new HttpPut(uri);
req.setHeader("x-ms-blob-type", blobType);
req.setHeader("x-ms-date", date);
req.setHeader("x-ms-version", storageServiceVersion);
req.setHeader("x-ms-meta-Cat", user);
req.setHeader("Authorization", authorizationHeader);
HttpEntity entity = new InputStreamEntity(is,blobLength);
req.setEntity(entity);
HttpResponse response = httpClient.execute(req);
regarding the same, i have two questions.
can setting different metadata, avoid overwriting of file? See my question for the same here
if Yes for first question, how to set metadata in REST request to put blob into Azure?
please help
So a few things are going here.
Regarding the error you're getting, it is because you're not adding your metadata header when calculating authorization header. Please read Constructing the Canonicalized Headers String section here: http://msdn.microsoft.com/en-us/library/windowsazure/dd179428.aspx.
Based on this, you would need to change the following line of code (from your blog post)
String canonicalizedHeaders = "x-ms-blob-type:"+blobType+"\nx-ms-date:"+date+"\nx-ms-version:"+storageServiceVersion;
to
String canonicalizedHeaders = "x-ms-blob-type:"+blobType+"\nx-ms-date:"+date+"\nx-ms-meta-cat"+user+"\nx-ms-version:"+storageServiceVersion;
(Note: I have just made these changes in Notepad so they may not work. Please go to the link I mentioned above for correctly creating the canonicalized headers string.
can setting different metadata, avoid overwriting of file?
Not sure what you mean by this. You can update metadata of a blob by performing Set Blob Metadata operation on a blog.
This issue in a nutshell:
A block blob can be created with a single PUT request. This will create a blob with committed content but the blob will not have any committed blocks!
This means that you cannot assume that the concatenation of committed blocks is the same as the committed content.
When working with block blobs you'll have to pay extra attention to blobs with empty block lists, because such blobs may or may not be empty!
The original question:
One of our storage blobs in an Azure account has an empty block list, although it is non-empty.
I'm retrieving the block list like this (C#):
foreach (var block in _cloudBlob.DownloadBlockList(
BlockListingFilter.Committed,
AccessCondition.GenerateLeaseCondition(_leaseId)))
{
// ...
}
The code in the foreach block is NOT executed. The returned list is empty.
However, the blob reports that it has a non-zero length when I check: _cloudBlob.Properties.Length
I can also download the blob and see that it is not empty.
Am I missing something? How can the block list be empty when the blob is not?!
It does not matter whether I use BlockListingFilter.Committed, BlockListingFilter.Uncommitted or BlockListingFilter.All; the list is still empty!
UPDATE
I have copied this blob to a public container so that this issue can be reproduced by anyone.
Here's how to reproduce what I'm unable to understand:
First get blob properties from Azure using the REST API:
HEAD http://dfdev.blob.core.windows.net/pub/test HTTP/1.1
Host: dfdev.blob.core.windows.net
Response:
HTTP/1.1 200 OK
Content-Length: 66
Content-Type: application/octet-stream
Last-Modified: Sat, 02 Feb 2013 09:37:19 GMT
ETag: 0x8CFCF40075A5F31
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 4b149a7e-2fcd-4ab4-8d53-12ef047cbfa1
x-ms-version: 2009-09-19
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Sat, 02 Feb 2013 09:40:54 GMT
The response headers tell us that this is a block blob and that it has a length of 66 bytes.
Now retrieve the block list from:
http://dfdev.blob.core.windows.net/pub/test?comp=blocklist
Response body:
<?xml version="1.0" encoding="utf-8"?><BlockList><CommittedBlocks /></BlockList>
So, the blob does not have any committed blocks, still it has a length of 66 bytes!
Is this a bug or have I misunderstood something?
Please help me out!
UPDATE 2
I've found that if I upload the blob like this:
container.GetBlockBlobReference("put-only")
.UploadFromStream(File.OpenRead("test-blob"));
...then a single PUT request is sent to Azure and the blob gets an empty block list (just like above).
However, if I upload the blob like this:
var blob = container.GetBlockBlobReference("put-block");
string blockId = Convert.ToBase64String(Guid.NewGuid().ToByteArray());
blob.PutBlock(blockId, File.OpenRead("test-blob"), null);
blob.PutBlockList(new string[] { blockId });
...then two requests are sent to Azure (one for putting the block and another for putting the block list).
The second blob gets a non-empty block list.
Why won't a single PUT yield a block list?
Can't we rely on that the concatenation of a blob's committed blocks are equal to the blob's actual content?!
If not, how shall we determine when the block list is OK and when it's not??
UPDATE 3
I've implemented a workaround for this that I think suffice in the case where we encountered this problem. In case we discover an empty block list AND a blob length that is greater than zero, then we'll assume that everything is OK (although it really isn't) and go ahead and rewrite that data using Put Block and Put Block List at the next opportunity.
However, although this will do the trick in our case, it is still very confusing that a non-empty block blob can have an empty list of committed blocks!!
Is this by-design in Azure? Can anyone explain what's going on?
UPDATE 4
Microsoft confirmed this issue on the MSDN forums too. Quote from Allen Chen:
I've confirmed with the product team. This is a normal behavior. The x-ms-blob-content-length header is the size of the committed blob. In your case you use Put Blob API so all content is uploaded in a single API and is committed in the same request. As a result in the Get Block List API's response you see the x-ms-blob-content-length header has value of 66 which means the committed blob size.
We have been aware of the issue that the MSDN document of the Get Block List API is not quite clear on this and will work on it.
As you also identified with your tests, querying the list of blocks of a block blob uploaded using Put Blob will return an empty list. This is by design.
UploadFromStream API in the Storage Client Library makes a couple of checks before deciding whether to upload a blob using a single Put Blob operation or a sequence of Put Block operations followed by a Put Block List. One property that changes this behavior is SingleBlobUploadThresholdInBytes.