Loading only latest files data to Azure SQL Datawarehouse - azure

Step#1: WE are supposed to copy the CSV Files from On-Premise File Server to Azure Blob Storage (say - 'Staging' Container in Blob Storage).
Step#2: Applying Polybase, we will load these files data to Azure SQL Datawarehouse.
We are maintaining the same file name (sync with the Staging DB Tables), every time it loads to Azure Blob from On-Prem file server.
We are facing challenge while loading data to Azure Datawarehouse from blob storage as during each batch cycle execution(using ADF pipeline run), we have to process & load all the files from staging to Azure SQL DWH. We are running 4 batch cycle every day. For each cycle, we are processing the latest files as well as the old files which are already processed. Is there any way, we can only load the currently available files at On-prem file server for each individual batch job. (I mean, we will load these files to staging & will process only these files to sql dwh without touching others).

Same issue occurred with me. What I did was added a column ExtractDate in CSV file and then selected only those records from PolyBase for the ExtractDate I want. Currently PolyBase doesn't support delta file detection from blob. So, this workaround worked for me.

Related

Moving data from Teradata to Snowflake

Trying to move data from Teradata to Snowflake. Have created a process to run TPT scripts for each table to generate files for each table.
Files are also split to achieve concurrency while running COPY INTO in snowflake.
Need to understand what is the best way to move those Files from On Prem Linux Machine to Azure ADLS. Considering files in Terabyte size.
Does Azure provide any mechanism to move these files or can we directly create files on ADLS from Teradata?
The best approach to load data to snowflake via external table if you have the Azure Blob Storage or ADLS Gen2. Load data to blob storage and create external table and then load data data to snowflake.

Azure Databricks, how to auto download csv into local network drives?

My job currently implemented Azure Databricks. Is it possible to have my dataframes be automatically downloaded as csv to a local network drive path on a recurring basis?
For example, our company have recurring reports and was hoping I could automate this by creating the dataframe in databricks and somehow have azure download the csv into a specific path in our company network folder. Would this be possible?
FYI, I understand i could save the csv file to filestore (dbfs), but the main problem is..how can I or azure have the csv be AUTO-populated into our company network on a recurring basis?
Write the file to blob storage, or a data lake rather than dbfs.
Use Azure Data Factory to run the notebook and then copy the output file to your in prem network.
You will need an integration runtime to be installed somewhere in your network for the file copy to access your network.

I'm getting continuous blob files in blob storage. I have to load in Databricks and put in Azure SQL DB. Data factory for orchestrating this pipeline

I receive data continuously in blob storage. I have initially 5 blob files in the blob storage I'm able to load from blob to Azure SQL DB using Databricks and automated it using Data factory, But the problem is when newer files come in blob storage the databricks loads these files along with older files and sends it into Azure SQL DB. I don't want these old files, every time I want only the newer one's, so that same data is not loaded again and again in the Azure SQL DB.
Easiest way to do that is to simply archive the file that you just read into a new folder name it archiveFolder. Say, your databricks is reading from the following directory:
mnt
sourceFolder
file1.txt
file2.txt
file3.txt
You run your code, you ingested the files and loaded them in SQL server. Then what you can simply do is to archive these files (move them from the sourceFolder into archiveFolder. This can simply be done in databricks using the following command
dbutils.fs.mv(sourcefilePath, archiveFilePath, True)
So, next time your code runs, you will only have the new files in your sourceFolder.

SSIS Connector for Azure File Storage

I have a directory on a local machine that holds various source files containing data that I need to load into an Azure SQL Server instance. The source files are in a variety of formats including xlsx, xls, csv, txt, dat. I built a solution a while back that transforms and loads these files into a local sql server and ssis instance (developers edition).
Now that development has concluded I would like to deploy the db and packages to Azure. With an Azure account I created SQL Server and SSIS instances, then I created a file storage account in Azure and copied the source file directory into the file store. My intention was that I would be able to simply take the old solution, change the sources from local files to an Azure data lake store source and the destinations from local db to the Azure Sql instance.
However, I am having a lot of complications with active directory authentication and it also appears that the datalake and blob source tools in SSIS only work for text and azro files. Is there not a ways for SSIS to easily access files in an Azure file store?

Incremental loading of files from On-prem file server to Azure Data Lake

We would like to do incremental loading of files from our on-premises file server to Azure Data Lake using Azure Data Factory v2.
Files are supposed to store on daily basis in the on-prem fileserver and we will have to run the ADFv2 pipeline on regular intervals during the day and only the new un-processed files from the folder should be captured.
Our recommendation is to put the set of files for daily ingestion into /YYYY/MM/DD directories. You can refer to this example on how to use system variables (#trigger().scheduledTime) to read files from the corresponding directory:
https://learn.microsoft.com/en-us/azure/data-factory/how-to-read-write-partitioned-data
In the source dataset, you can do file filter.You can do that by time for example (calling datetime function in expression language) or something else what will define new file.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions
Then with a scheduled trigger, you can execute pipeline n times during the day.

Resources