SSIS Connector for Azure File Storage - azure

I have a directory on a local machine that holds various source files containing data that I need to load into an Azure SQL Server instance. The source files are in a variety of formats including xlsx, xls, csv, txt, dat. I built a solution a while back that transforms and loads these files into a local sql server and ssis instance (developers edition).
Now that development has concluded I would like to deploy the db and packages to Azure. With an Azure account I created SQL Server and SSIS instances, then I created a file storage account in Azure and copied the source file directory into the file store. My intention was that I would be able to simply take the old solution, change the sources from local files to an Azure data lake store source and the destinations from local db to the Azure Sql instance.
However, I am having a lot of complications with active directory authentication and it also appears that the datalake and blob source tools in SSIS only work for text and azro files. Is there not a ways for SSIS to easily access files in an Azure file store?

Related

Parsing files DAT files, CSV files and Image files using Azure services

I have 5 types of EDI files namely - *.DAT, *.XML, *.TXT, *.CSV and Image files which contains data in them whose data are not in standard format.
I need to parse them and extract required data from them and persist them in SQL Database.
Currently, I'm spending time writing parser class libraries for each type of EDI file and not scalable .
I need to know if there are any azure services which can do the parsing work for me and is scalable.
Can I expect a solution on this regards?
I need to parse them and extract required data from them and persist them in SQL Database.
Yes, You can use Azure Functions to process a Files like CSV an import data into Azure SQL Or Azure Data Factory is also helpful to read or copy many file formats and store them in SQL Server Database in specified formats, there is an practical example provided by Microsoft, Please refer here.
To do with Azure Functions, the following steps are:
Create Azure Functions (Stack: .Net 3.1) of type Blob Trigger and define the local storage account connection string in local.settings.json like below:
In the Function.cs, there will be some boilerplate code which gives the logic of showing the uploaded blob name and its size.
In the Run function, you can define your parsing logic of the uploaded blob files.
Create the Azure SQL Database, configure the server with location, pricing tier and the required settings. After that, Select Set Server Firewall on the database overview page. Click Add Client IP to add your IP Address and Save. Test the database whether you're able to connect.
Deploy the project to Azure Functions App from Visual Studio.
Open your Azure SQL Database in the Azure portal and navigate to Connection Strings. Copy the connection string for ADO.NET.
Paste that Connection String in Azure Function App Settings in the portal.
Test the function app from portal and the remaining steps of uploading files from storage to SQL Database were available in this GitHub documentation
Also for parsing the files like CSV, etc. to JSON Format through Azure Functions, please refer here.
Consider using Azure Data Factory. It supports a range of file types.

Connection manger in SSIS for formatted Excel files in ADLS

Scenario:I have formatted Excel files in ADLS, I want to access them in SSIS package and perform simple transformations and load them in to SQL DB.
What connection manager to use for fetching Excel files since these are not CSV files.
Using the Flexible File Source component on the data flow, you can connect SSIS to Azure data lake storage.
Install Azure feature pack for Integration Services (SSIS) extension to get the components required for Azure resources.
Provide all the details in Flexible File Source Editor properties to connect to Azure data lake as mentioned here.
Currently the source file formats supported are Text, Avro, ORC, Parquet.
Also, refer to this MS document for Configure the Azure Data Lake Store Connection Manager and this link for example.

Ingest Data From On-Premise SFTP Folder To Azure SQL Database (Azure Data Factory)

Usecase: I have data files of varying size copied to a specific SFTP folder periodically (Daily/Weekly). All these files needs to be validated and processed. Then write them to related tables in Azure SQL. Files are of CSV format and are actually a flat text file which directly corresponds to a specific Table in Azure SQL.
Implementation:
Planning to use Azure Data Factory. So far, from my reading I could see that I can have a Copy pipeline in-order to copy the data from On-Prem SFTP to Azure Blob storage. As well, we can have SSIS pipeline to copy data from On-Premise SQL Server to Azure SQL.
But I don't see a existing solution to achieve what I am looking for. can someone provide some insight on how can I achieve the same?
I would try to use Data Factory with a Data Flow to validate/process the files (if possible for your case). If the validation is too complex/depends on other components, then I would use functions and put the resulting files to blob. The copy activity is also able to import the resulting CSV files to SQL server.
You can create a pipeline that does the following:
Copy data - Copy Files from SFTP to Blob Storage
Do Data processing/validation via Data Flow
and sink them directly to SQL table (via Data Flow sink)
Of course, you need an integration runtime, that can access the on-prem server - either by using VNet integration or by using the self hosted IR. (If it is not publicly accessible)

Azure Databricks, how to auto download csv into local network drives?

My job currently implemented Azure Databricks. Is it possible to have my dataframes be automatically downloaded as csv to a local network drive path on a recurring basis?
For example, our company have recurring reports and was hoping I could automate this by creating the dataframe in databricks and somehow have azure download the csv into a specific path in our company network folder. Would this be possible?
FYI, I understand i could save the csv file to filestore (dbfs), but the main problem is..how can I or azure have the csv be AUTO-populated into our company network on a recurring basis?
Write the file to blob storage, or a data lake rather than dbfs.
Use Azure Data Factory to run the notebook and then copy the output file to your in prem network.
You will need an integration runtime to be installed somewhere in your network for the file copy to access your network.

Loading only latest files data to Azure SQL Datawarehouse

Step#1: WE are supposed to copy the CSV Files from On-Premise File Server to Azure Blob Storage (say - 'Staging' Container in Blob Storage).
Step#2: Applying Polybase, we will load these files data to Azure SQL Datawarehouse.
We are maintaining the same file name (sync with the Staging DB Tables), every time it loads to Azure Blob from On-Prem file server.
We are facing challenge while loading data to Azure Datawarehouse from blob storage as during each batch cycle execution(using ADF pipeline run), we have to process & load all the files from staging to Azure SQL DWH. We are running 4 batch cycle every day. For each cycle, we are processing the latest files as well as the old files which are already processed. Is there any way, we can only load the currently available files at On-prem file server for each individual batch job. (I mean, we will load these files to staging & will process only these files to sql dwh without touching others).
Same issue occurred with me. What I did was added a column ExtractDate in CSV file and then selected only those records from PolyBase for the ExtractDate I want. Currently PolyBase doesn't support delta file detection from blob. So, this workaround worked for me.

Resources