I'm looking for a way to process SAP IDoc files via Azure Data Factory or Azure Synapse pipeline. The solution should be capable of extracting data from the raw files taken from the SFTP (in IDoc format) and writing the processed data to a SQL Database. I know there are SAP connectors available, but the issue is that the files (IDocs) should be extracted not from the SAP system directly but rather from the SFTP folder in which they're stored in thousands daily. I'm not an expert in Azure Data Factory/Synapse, but these services allow to build a descent processing pipeline for all sorts of ETL processing. So, I wonder if they can also handle the data parsing or if I'm looking at the wrong place.
Please advice.
Related
Usecase: I have data files of varying size copied to a specific SFTP folder periodically (Daily/Weekly). All these files needs to be validated and processed. Then write them to related tables in Azure SQL. Files are of CSV format and are actually a flat text file which directly corresponds to a specific Table in Azure SQL.
Implementation:
Planning to use Azure Data Factory. So far, from my reading I could see that I can have a Copy pipeline in-order to copy the data from On-Prem SFTP to Azure Blob storage. As well, we can have SSIS pipeline to copy data from On-Premise SQL Server to Azure SQL.
But I don't see a existing solution to achieve what I am looking for. can someone provide some insight on how can I achieve the same?
I would try to use Data Factory with a Data Flow to validate/process the files (if possible for your case). If the validation is too complex/depends on other components, then I would use functions and put the resulting files to blob. The copy activity is also able to import the resulting CSV files to SQL server.
You can create a pipeline that does the following:
Copy data - Copy Files from SFTP to Blob Storage
Do Data processing/validation via Data Flow
and sink them directly to SQL table (via Data Flow sink)
Of course, you need an integration runtime, that can access the on-prem server - either by using VNet integration or by using the self hosted IR. (If it is not publicly accessible)
I have been tasked to create an Azure Data Factory pipeline that will process messages being generated from an MQ Farm and that are stored in Data Storage in .xml format and then ingest them in a SharePoint Table.
The question is how would your approach be in that scenario to slice the .xml files in smaller pieces? The .xml files are nesting a lot of records in one file (with a valid separator on each record) and I wish to discard some while process the valid ones.
P.S.: For receiving and storing the MQ Farm messages I am using a logic app before Azure Data Factory
OK the solution was more obvious than previously thought... Solved from the logic app designer and saving to blob
So i have data providers who will upload files using Power Apps, than a ETL job will run and read the content of file and save to DB in cloud. I want a solution in which when the schema of the file changes (column added, removed or changed ) the ETL job will handle it itself.
Is it possible in ADF?
Yes. You will use ADF Data Flows for this ETL scenario. https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-schema-drift
I am just going through some Microsoft Document and doing handOn for Data engineering related things.
I have couple of queries for a scenrerio - "copy CSV file(s) from Blob storage to Synapse analytics (stage table(s)):
I read that we can do direct data pull in Synapse with the process of creating external tables. (https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/load-data-wideworldimportersdw)
If above is possible, then in what cases we do use Azure Data factory Copy or data flow method?
While working with Azure data factory, is it a good idea to use Polybase, because it will use Blob storage again as staging in this scenrerio (i.e. I am copying file from Blob only and again using blob for staging)?
I searched for answers to my queries but haven't found any satisfactory answer yet.
If you're just straight loading data from CSV into DW, use Copy. Polybase is recommended, but not always needed for small files.
If you need to transform that data or perform updates, then use data flows.
I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools
I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics