Azure DML Data slices? - azure

I have 40 mil blobs of 10 TB in blob storage. I am using DML CopyDirectory to copy these into another storage account for backup purpose. It took nearly 2 weeks to complete. Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
Does DML uses anything like data slices ?

Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
As far as I know, when you start the CopyDirectory method, it will just send the request to tell the azure storage account to copy files from another storage account. All the copy operation is azure storage.
If we run the method to start copy the directory, the azure storage will firstly create the file with 0 size as below:
After the job finished, you will find it has change the size as below:
So the result is if the job started it will create the file in the target directory, but the file size is 0. You could see image1's file last modify time.
The azure storage will continue copy the file content to the target directory.
If the job finished, it will change the file last modify time.
So the DML SDK just tell the storage to copy files, then it will continue send the request to the azure storage to check each file's copy status.
Like below:
Thanks. But what happens if the files added to the source directory during this copy operation ? Does the new files as well get copied to the target directory ?
In short answer Yes.
The DML won't get the whole blob list and send request to copy all the file at one time.
It will firstly get a part of your file name list and send request to tell the storage copy file.
The list is sort by the file name.
For example.
If the DML have already copied the file name like 0 file as below.
This target blob folder
If you add the 0 start file to your folder,it will not copy.
This is copy from blob folder.
Copy completely blob folder:
If you add the file at the end of your blob folder and the DML doesn't scan it, it will be copied to the new folder.
so during that 2 weeks at least a million blobs must have been added to the container with very random names. So i think DML doesn't work in the case of large containers ?
As far as I know, the DML is designed for high-performance uploading, downloading and copying Azure Storage Blob and File.
When you using the DML CopyDirectoryAsync to copy the blob file.It will firstly send a request to list the folder's current file, then it will send the request to copy the file.
The default of the operation sending a request to list the folder's current file number is 250.
After get list it will generate a marker which is the next blob search file names. It will start to list the next file name in the folder and start copy again.
And by default, the .Net HTTP connection limit is 2. This implies that only two concurrent connections can be maintained.
It means if you don't set the .Net HTTP connection limit, the CopyDirectoryAsync will just get 500 record and start copy.
After copy completely, the operation will start to copy next files.
You could see this images:
The marker:
I suggest you could firstly set the max http connections to detect more blob files.
ServicePointManager.DefaultConnectionLimit = Environment.ProcessorCount * 8;
Besides, I suggest you could create multiple folder to store the files.
For example, you could create a folder which stores one week files.
Next week, you could start a new folder.
Then you could backup the old folder's file without new files store into that folder.
Finally, you could also write your own code to achieve your requirement, you need firstly get the list of the folder's files.
The max result of one request to get the list is 5000.
Then you could send the request to tell the storage copy each files.
If the file upload to the folder after you get the list, it will not copy to the new folder.

Related

Copy n number of files from Azure Datalake to SFTP location using Logic Apps

I have a scenario-
I have some files in Azure Datalake. A job is pushing these file at an ADLS location.
This files needs to be uploaded at a SFTP location as a input data.
An application will consume this input file and will perform some operation and later save the processed data in an output directory as a output file at the same SFTP location.
With the help of Logic Apps I want to upload this output file at an ADLS location.
The application which consume this input file having some limitations- It can not consume more than 10000 records at a time.
Viz if my source file having more than 10000 records then I have to split it into multiple files (depends on number of rows), and then I have to replicate these files to SFTP location.
This replication has to be perform in such a way than after completion of one job only then another file should be copied at the SFTP location.
To upload the files I want to use to Azure Logic Apps.
As per my understanding till now Azure Logic Apps does not provide any Trigger to test 'a file added or modified' the ADLS location but Logic Apps having similar type of feature for the blob storage so I decided to use blob container.
Once my raw file are uploaded to the in ADLS location I will upload a file to blob location,
as my Logic Apps would keep polling this specific directory so whenever any new file arrives immediately it will trigger the file copy job via the Logic Apps.
Now the problem-
My ADLS directory may have one or more file(s).
How do I create a copy activity in Logic Apps to replicate these file(s) to the SFTP location.
How do I identify that how many csv type file are available in ADLS directory so that my logic apps decides number of iteration to copy the file.
Thanks In advance.
You can use List File Action on ADLS..
Output of this action is documented here - https://learn.microsoft.com/en-us/connectors/azuredatalake/#folderresponse
This is basically an array of filestatus class objects.. you can loop through this array and extract information from status objects and use it to copy data to where ever you want..
FileStatus has info -
TABLE 14
Name Path Type Description
File name pathSuffix string
File or folder name.
Type type string
Type of item (directory or file).
Block Size blockSize integer
Block size of folder or file.
Access Time accessTime integer
Unix (Epoch) time when the item last accessed.
Modification Time modificationTime integer
Unix (Epoch) time when the item last modified.
Sample LA would look like -

Azure ADF how to ensure that the same files that are copied, are also deleted?

Using Azure ADF and currently my setup is as follows:
Event based triggerd on a input BLOB on file upload. File upload triggers a copy activity to output BLOB, and this action is followed by a delete operation on the input BLOB. The input BLOB can take 1 or many files at once (not sure how often the file is scanned/how quickly the event triggers the pipeline). Reading up on the delete function documentation it says:
Make sure you are not deleting files that are being written at the same time.
Would my current setup delete files that are being written?
Event based trigger on file upload >> Write from input Blob to Output Blob >> Delete input Blob
I've made an alternative solution which does a get metadata activity based on event in the beginning of the pipeline, and then does a for loop which deletes the files at the end, not sure if this is necessary though. Would my original solution suffice in an unlikely event where I'm receiving files every 15seconds or so?
Also while I'm at it, in a get metadata activity how can I get the actual path to the file, not just the file name?
Thank you for the help.
Delete Active says:
Make sure you are not deleting files that are being written at the
same time.
Your settings are:
Event based trigger on file upload >> Write from input Blob to Output
Blob >> Delete input Blob
Only after the active Write from input Blob to Output Blob finished(the deleting files are not being written), then the Delete input Blob can works.
Your questions: Would my current setup delete files that are being written?
So did you test these steps? You must test by yourself and you will get the answer.
Please notice:
Delete activity does not support deleting list of folders described by wildcard.
Any other suggestions:
You don't need to using delete actives to delete the input blob after Write from input Blob to Output Blob finished.
You can learn from Data flow, it's Source settings support delete the source file(input blob) after the copy active completed.
Hope this helps.
I could not use Leon Yue solution because my source dataset was a sftp one, which is not supported by Azure dataflows.
To deal with this problem, I used the Filter by last modified of the dataset. I set the End Time to the time the pipeline has started.
With this solution, only the files added to the source before the pipeline started will be consumed by both the copy and delete activities.

How can I attach all content files from folders in my blob container with a Logic App?

I have a blob container "image-blob", and I create a folder blob with OCR image text and the image (two files, image.txt (with the text of an image) and image.png). The container have multiple folders, and inside each folder both files. How can I make a Logic App in which it sends an email with both files of every folder? (this would be an email for each folder with 2 files). The name of the folder is generated randomly and every file has the name of the folder + extension.
I've tried making a condition and if isFolder() method, but nothing happens.
This is how my container looks like:
This is files each folder have:
You could try with List blobs in root folder if your folders are in the root of the container or if not you could use List blobs.
If you try List blobs in root folder, your flow would be like the below pic shows. After List blobs you will get all blob info and you could add action like Get blob content using path.
And if you use List blobs, only the first step is different. And you need specify the container path. The other steps just like the List blobs in root folder.
In my test, I add the get blob content using path action and here is the result.
It did get all blob , however due to the For each action, you could only get one by one, so in your situation, you maybe need to store the info you need into a file then get the whole information sheet from the file.
Hope this could help you, if you still have other questions, please let me know.
How can I make a Logic App in which it sends an email with both files of every folder?
It's hard to put two files in an email. The following snapshot shows that send an email with each files of every folder.
If you still have any problem, please feel free to let me know.

Azure function resize image in place

I'm trying to resize the image from blob storage using the Azure function - the easy task, lots of samples, works great, but. works only when resized image is saved to a different file. My problem is that I would like to replace the original image with resized one - with the sane location and name.
when I set output blob to be the same as input blob, it is triggered over and over again without the finish.
is there any way I could change blob using azure function and store result in the same file?
The easiest option is to accept two invocations for the same file, but add a check of the size of the incoming file. If the size is already OK, do nothing and quit without changing the file again. This should break you out of the loop.
Blob trigger uses Storage Logs to watch for new or changed blobs. It then compares the changed blob against Blob Receipts in a container named azure-webjobs-hosts in the Azure storage account. Each receipt has ETag associated with it, so when you change a blob, the ETag changes and the Blob is submitted to the function again.
Unless you want to go fancy and update ETag's in receipts from within a function (not sure if it's feasible), your changed files will go for re-processing.

Avoid over-writing blobs AZURE

if i upload a file on azure blob in the same container where the file is existing already, it is over-writing the file, how to avoid overwriting the same? below i am mentioning the scenario...
step1 - upload file "abc.jpg" on azure in container called say "filecontainer"
step2 - once it gets uploaded, try uploading some different file with the same name to the same container
Output - it will overwrite existing file with the latest uploaded
My Requirement - i want to avoid this overwrite, as different people may upload files having same name to my container.
Please help
P.S.
-i do not want to create different containers for different users
-i am using REST API with Java
Windows Azure Blob Storage supports conditional headers using which you can prevent overwriting of blobs. You can read more about conditional headers here: http://msdn.microsoft.com/en-us/library/windowsazure/dd179371.aspx.
Since you want that a blob should not be overwritten, you would need to specify If-None-Match conditional header and set it's value to *. This would cause the upload operation to fail with Precondition Failed (412) error.
Other idea would be to check for blob's existence just before uploading (by fetching it's properties) however I would not recommend this approach as it may lead to some concurrency issues.
You have no control over the name your users upload their files with. You, however, have control over the name you store those files with. The standard way is to generate a Guid and name each file accordingly. The chances of conflict is almost zero.
A simple pseudocode looks like this:
//generate a Guid and rename the file the user uploaded with the generated Guid
//store the name of the file in a dbase or what-have-you with the Guid
//upload the file to the blob storage using the name you generated above
Hope that helps.
Let me put it that way:
step one - user X uploads file "abc1.jpg" and you save it io a local folder XYZ
step two - user Y uploads another file with same name "abc1.jpg", and now you save it again in a local folder XYZ
What do you do now?
With this I am illustrating that your question does not relate to Azure in any way!
Just do not rely on original file names when saving files. Where-ever you are saving them. Generate random names (GUIDs for example) and "attach" the original name as meta-data.

Resources