Build a pipeline in azure data factory to load Excel files, format content, transform in csv and send to azure sql DB - azure

I'm approaching to Azure environment and watching tutorials/reading documents, but I'm trying to figure out how to setup a flow that enables the process that I will describe hereunder. The starting point are reports in .xlsx format produced monthly by Mktg Dept: the requirements are to bring them in Azure SQL DB so that data can be stored and analysed. Sofar I managed to put those files (previously manually converted in .csv format) in a BLOB storage and build an ADF pipeline that copy each file in a table on the SQL DB.
The problem is that as far as I understood with ADF it's not possible to directly manage xlsx files, and I'm wondering how to set up an automated procedure that enables the conversion from .xlsx to .csv and save them on BLOB storage. I was thinking about adding to the pipeline a python script/Databricks notebook to convert format, but I'm not sure this could be the best solution. Any hint/reference to existing tutorial or resources would be very appreciated

I found a tutorial which uses Logic Apps to do the conversion.
Datanovice indirectly suggested using a Custom activity to run either a C# or Python application to do the conversion for you.
The least expensive solution would be to do the conversion before uploading to blob, like Datanovice said.

Related

Use Azure Data Factory to copy files and place a csv of files copied

I am trying to implement the following flow in an Azure Data Factory pipeline:
Copy files from an SFTP to a local folder.
Create a comma separated file in the local folder with the list of files and their
sizes.
The first step was easy enough, using a 'Copy Data' step with 'SFTP' as source and 'File System' as sink.
The files are being copied, but in the output of this step, I don't see any file information.
I also don't see an option to create a file using data from a previous step.
Maybe I'm using the wrong technology?
One of the reasons I'm using Azure Data Factory, is because of the integration runtime, which allows us to have a single fixed IP to connect to the external SFTP. (easier firewall configuration)
Is there a way to implement step 2?
Thanks for any insight!
There is no built-in feature to achieve this.
You need to use ADF with other service, I suppose you to first use azure function to check the files and then do copy.
The structure should be like this:
You can get the size of the files and save them to the csv file:
Get size of files(python):
How to fetch sizes of all SFTP files in a directory through Paramiko
And use pandas to save the messages as csv(python):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
Writing a pandas DataFrame to CSV file
Simple http trigger of azure function(python):
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook-trigger?tabs=python
(Put the processing logic in the body of the azure function. Basically, you can do anything you want in the body of the azure function except for the graphical interface and some unsupported things. You can choose the language you are familiar with, but in short, there is not a feature in ADF that satisfies your idea.)

How to uncompress rar files using Azure DataFactory

We have a new client, while landing the project we gave them a blob storage for them to leave files so we could later automate and process the information.
The idea is to use Azure Datafactory but we find no way of dealing with .rar files, and even .zip, being it files from windows, are giving us trouble. And since it is the clien giving the .rar format, we wanted to make absolutely sure there is no way to process before asking them to change it, or deploying a databricks or similar service just for the purpose of transforming the file.
Is there any way to get a .rar file from a blob storage, uncompress it, then process it?
I have been looking in posts like this and related official documentation and closest we have come is using ZipDeflate, but it does not seem to fill our requirement.
Thanks in advance!
Data factory compression only supported types are GZip, Deflate, BZip2, and ZipDeflate.
For the Unsupported file types and compression formats, Data Factory provides some workarounds for us:
You can use the extensibility features of Azure Data Factory to transform files that aren't supported. Two options include Azure Functions and custom tasks by using Azure Batch.
You can see a sample that uses an Azure function to extract the contents of a tar file. For more information, see Azure Functions activity.
You can also build this functionality using a custom dotnet activity. Further information is available here.
Next way, you may need to figure out how to using Azure function to extract the contents of a rar file.
you can use logic apps
you can use webhook activity calling a runbook
both are easiee than using a custom activity

U-SQL User Defined Function in Azure Data Factory

I'm currently using Azure Data Factory for an ETL job and in the end I want to start a U-SQL job. I've created my datalake.usql script and the UDF's the script uses are in the datalake.usql.cs file, the same structure a U-SQL project has in Visual Studio (which is where I developed the U-SQL job and succesfully ran it).
After that, I uploaded them both to Azure Blob Storage and set up the U-SQL step in Azure Data Factory to use the U-SQL script, but it doesn't see the datalake.usql.cs with my UDF's.
How can I do this?
I figured this one out so I'll post the answer here if someone else has this problem in the future.
When you're developing locally, you have a script.usql and a script.usql.cs file. When you run it, Visual Studio does all the heavy lifting, compiles it all in a usable way and your script runs. But when you're trying to run a script from Azure Data Factory, you can't perform that compilation on the fly, like Visual Studio does.
The solution to this problem is to make an Assembly, a .dll file, from your script.usql.cs, upload it to the data storage you're using and register it in the U-SQL Catalog. Then, you can reference this assembly and use it normally, as you would on your local machine.
All the steps needed for this are presented in these short guides:
https://saveenr.gitbooks.io/usql-tutorial/content/usql-catalog/intro.html
https://saveenr.gitbooks.io/usql-tutorial/content/usql-catalog/usql-databases.html
https://saveenr.gitbooks.io/usql-tutorial/content/usql-catalog/assemblies.html
https://www.c-sharpcorner.com/UploadFile/1e050f/creating-and-using-dll-class-library-in-C-Sharp/

Copy latest files from S3 to Azure Blob (using Azure Factory V2)

I'm still new to Azure Data Factory and am trying to move files that are dumped in my S3 folder/bucket daily to Azure blob. I already created datasets (for source and sink) and linked services in Data Factory.
But since my S3 bucket receives new file every day, I'm wondering how to move the latest file that was dropped in the S3 (say at 5am EST) on a daily basis. I have looked through most of the answers online like this, this, this and this. But none of them explains how to figure out which is the latest file in S3 (maybe based on last modified date/time or by matching the file name pattern that goes like this 'my_report_YYYYMMDD.csv.gz') and only copy that file to the destination blob.
Thank you in advance for your help/answer!
My idea as below:
1.Firstly,surely,configure your pipeline execution in the schedule trigger.Refer to this link.
2.Use Get metadata activity ,which supports Amazon S3 Connector,to get the files in your S3 dataset.
Get the last modified and file name etc. metadata.
3.Put these metadata array which contains lastModified Time and file name into a Web Activity or Azure Function Activity. In that rest api or function method,you could do a sort logical business to get the latest modified file.
4.Get the fileName from Web Activity or Azure Function Activity ,then copy it into Azure Blob Storage.
Another idea is using Custom-Activity.You could implement your requirements with .net code.
(Side note: thanks to Jay Gong above for suggesting a solution)
I found the answer. It's simpler than I expected. There's dynamic content/expression that we can add to 'Filter by last modified' field of the S3 dataset. Please see the screenshot below where I show how I picked files that are no more than 5 hours old by using dynamic expression. More about these expressions can be read here.
Hope this is helpful.

Azure Data Factory v2 - SSIS lift and shift

I'm in the process of evaluating the possibilites of lifting and shifting my ssis packages to ADFv2 but without testing I'm finding it hard to see if all SSIS functionalities are supported.
For example my package unzips files, modifies contents of files (script task) saving new version in different directory, loads modified files to DB and update data etc
What I'm not sure about is unzipping the files (I dont want to transfer unzipped files from on prem) and also modifying files with script task. I believe these would have to be moved outside of SSIS and created as an activity of ADF? And leave only the load of files, updating data etc as my SSIS package? Probably with the files stored in Blob storage?
Or can all this still be done directly in SSIS?
Thanks
What you currently do using SSIS on premises, you could also do using SSIS in ADF. For example, you could install additional (un)zip programs using custom setup and utilize the %TEMP% folder/current working directory (".") of your SSIS IR to modify files, see
https://learn.microsoft.com/en-us/azure/data-factory/how-to-configure-azure-ssis-ir-custom-setup
https://learn.microsoft.com/en-us/sql/integration-services/lift-shift/ssis-azure-files-file-shares?view=sql-server-2017

Resources