Copy latest files from S3 to Azure Blob (using Azure Factory V2) - azure

I'm still new to Azure Data Factory and am trying to move files that are dumped in my S3 folder/bucket daily to Azure blob. I already created datasets (for source and sink) and linked services in Data Factory.
But since my S3 bucket receives new file every day, I'm wondering how to move the latest file that was dropped in the S3 (say at 5am EST) on a daily basis. I have looked through most of the answers online like this, this, this and this. But none of them explains how to figure out which is the latest file in S3 (maybe based on last modified date/time or by matching the file name pattern that goes like this 'my_report_YYYYMMDD.csv.gz') and only copy that file to the destination blob.
Thank you in advance for your help/answer!

My idea as below:
1.Firstly,surely,configure your pipeline execution in the schedule trigger.Refer to this link.
2.Use Get metadata activity ,which supports Amazon S3 Connector,to get the files in your S3 dataset.
Get the last modified and file name etc. metadata.
3.Put these metadata array which contains lastModified Time and file name into a Web Activity or Azure Function Activity. In that rest api or function method,you could do a sort logical business to get the latest modified file.
4.Get the fileName from Web Activity or Azure Function Activity ,then copy it into Azure Blob Storage.
Another idea is using Custom-Activity.You could implement your requirements with .net code.

(Side note: thanks to Jay Gong above for suggesting a solution)
I found the answer. It's simpler than I expected. There's dynamic content/expression that we can add to 'Filter by last modified' field of the S3 dataset. Please see the screenshot below where I show how I picked files that are no more than 5 hours old by using dynamic expression. More about these expressions can be read here.
Hope this is helpful.

Related

Move data from Sharepoint through a Logic App

We are using Logic App to move data from a Sharepoint folder to an Azure Blob Storage.
We were using the Sharepoint trigger "When a file is created or modified in a folder". Unfortunately, this trigger has been deprecated and does not work anymore (i.e., when a file is indeed created or modified, no further action is done after running the trigger).
No file is moved around anymore. The trigger does not execute the Logic App even though a file is created or modified in the Sharepoint origin folder. I have been through the various other Sharepoint triggers but they do not seem to fit our use case. We cannot create a Logic App for each file. We are not using Sharepoint lists but classic folders. We could use several triggers pointing directly at each existin file, but as we have many files to move in the same folder, we would have to create many Logic Apps and that is not how we want to do it. Moreover, some new files may be created in the future.
What could we do to keep the same architecture of moving data around from Sharepoint to Blob Storage through the non-deprecated Logic App triggers?
Thank you in advance,
Alexis
You can use When a file is created or modified (properties only) and get the properties of the file that is getting created or updated. Then you can use Get file content using the properties from the previous step. Finally, you can create a blob using the previous steps. Below is the flow of my logic app.
RESULTS:

Use Azure Data Factory to copy files and place a csv of files copied

I am trying to implement the following flow in an Azure Data Factory pipeline:
Copy files from an SFTP to a local folder.
Create a comma separated file in the local folder with the list of files and their
sizes.
The first step was easy enough, using a 'Copy Data' step with 'SFTP' as source and 'File System' as sink.
The files are being copied, but in the output of this step, I don't see any file information.
I also don't see an option to create a file using data from a previous step.
Maybe I'm using the wrong technology?
One of the reasons I'm using Azure Data Factory, is because of the integration runtime, which allows us to have a single fixed IP to connect to the external SFTP. (easier firewall configuration)
Is there a way to implement step 2?
Thanks for any insight!
There is no built-in feature to achieve this.
You need to use ADF with other service, I suppose you to first use azure function to check the files and then do copy.
The structure should be like this:
You can get the size of the files and save them to the csv file:
Get size of files(python):
How to fetch sizes of all SFTP files in a directory through Paramiko
And use pandas to save the messages as csv(python):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
Writing a pandas DataFrame to CSV file
Simple http trigger of azure function(python):
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook-trigger?tabs=python
(Put the processing logic in the body of the azure function. Basically, you can do anything you want in the body of the azure function except for the graphical interface and some unsupported things. You can choose the language you are familiar with, but in short, there is not a feature in ADF that satisfies your idea.)

How to uncompress rar files using Azure DataFactory

We have a new client, while landing the project we gave them a blob storage for them to leave files so we could later automate and process the information.
The idea is to use Azure Datafactory but we find no way of dealing with .rar files, and even .zip, being it files from windows, are giving us trouble. And since it is the clien giving the .rar format, we wanted to make absolutely sure there is no way to process before asking them to change it, or deploying a databricks or similar service just for the purpose of transforming the file.
Is there any way to get a .rar file from a blob storage, uncompress it, then process it?
I have been looking in posts like this and related official documentation and closest we have come is using ZipDeflate, but it does not seem to fill our requirement.
Thanks in advance!
Data factory compression only supported types are GZip, Deflate, BZip2, and ZipDeflate.
For the Unsupported file types and compression formats, Data Factory provides some workarounds for us:
You can use the extensibility features of Azure Data Factory to transform files that aren't supported. Two options include Azure Functions and custom tasks by using Azure Batch.
You can see a sample that uses an Azure function to extract the contents of a tar file. For more information, see Azure Functions activity.
You can also build this functionality using a custom dotnet activity. Further information is available here.
Next way, you may need to figure out how to using Azure function to extract the contents of a rar file.
you can use logic apps
you can use webhook activity calling a runbook
both are easiee than using a custom activity

Build a pipeline in azure data factory to load Excel files, format content, transform in csv and send to azure sql DB

I'm approaching to Azure environment and watching tutorials/reading documents, but I'm trying to figure out how to setup a flow that enables the process that I will describe hereunder. The starting point are reports in .xlsx format produced monthly by Mktg Dept: the requirements are to bring them in Azure SQL DB so that data can be stored and analysed. Sofar I managed to put those files (previously manually converted in .csv format) in a BLOB storage and build an ADF pipeline that copy each file in a table on the SQL DB.
The problem is that as far as I understood with ADF it's not possible to directly manage xlsx files, and I'm wondering how to set up an automated procedure that enables the conversion from .xlsx to .csv and save them on BLOB storage. I was thinking about adding to the pipeline a python script/Databricks notebook to convert format, but I'm not sure this could be the best solution. Any hint/reference to existing tutorial or resources would be very appreciated
I found a tutorial which uses Logic Apps to do the conversion.
Datanovice indirectly suggested using a Custom activity to run either a C# or Python application to do the conversion for you.
The least expensive solution would be to do the conversion before uploading to blob, like Datanovice said.

azcopy list function gives a different count (almost double) of objects than Storage Explorer

I am uploading files with AZCOPY (one by one as and when they are provided) to Azure Datalakes gen 2 and keep a track with Storage explorer and individual log of each file.
There have been 6253 file uploads and Storage explorer shows the same along with number of logs for each file upload
But when i use AZCOPY LIST it gives me 11254.
Making it difficult to script and automate.
Is there a logical explanation for this?
There is no access issue, in fact the same AZCOPY is working on copying the files
I have tried to redownload if that makes sense
This is a known bug, scheduled for fixing in our next release: https://github.com/Azure/azure-storage-azcopy/issues/692

Resources