Azure Data Factory - MQ data flow - azure

I have been tasked to create an Azure Data Factory pipeline that will process messages being generated from an MQ Farm and that are stored in Data Storage in .xml format and then ingest them in a SharePoint Table.
The question is how would your approach be in that scenario to slice the .xml files in smaller pieces? The .xml files are nesting a lot of records in one file (with a valid separator on each record) and I wish to discard some while process the valid ones.
P.S.: For receiving and storing the MQ Farm messages I am using a logic app before Azure Data Factory

OK the solution was more obvious than previously thought... Solved from the logic app designer and saving to blob

Related

Processing SAP IDoc files in Azure Synapse/Data Factory

I'm looking for a way to process SAP IDoc files via Azure Data Factory or Azure Synapse pipeline. The solution should be capable of extracting data from the raw files taken from the SFTP (in IDoc format) and writing the processed data to a SQL Database. I know there are SAP connectors available, but the issue is that the files (IDocs) should be extracted not from the SAP system directly but rather from the SFTP folder in which they're stored in thousands daily. I'm not an expert in Azure Data Factory/Synapse, but these services allow to build a descent processing pipeline for all sorts of ETL processing. So, I wonder if they can also handle the data parsing or if I'm looking at the wrong place.
Please advice.

Process event files into Azure EventHub

I am fairly new to Azure .
I have a requirement where Source will send the event data in flat files. File will contain header and trailer records and events as data records. Each file will be in 10MB size and can contains about 50000-60000 events.
I want to process this file using python/scala and send the data into Azure eventhub. Can someone suggest me is this the best solution and how can I achieve this please?
Its an architectural question but you can use either Azure Logic Apps or Azure Functions.
First of all you should trigger whatever you choose by upload a file to Blob Storage. The file will gets picked and processed and then sent.
Use Azure Logic apps if you can simply parse the files for instance because they are JSON files and then simply repeat for each event and direct it to the event hub you want.
If the parsing of the files is more complex use Azure Functions, write up the code and output it to an event hub.

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Azure Data Factory and SharePoint

I have some Excel files stored in SharePoint online. I want copy files stored in SharePoint folders to Azure Blob storage.
To achieve this, I am creating a new pipeline in Azure Data factory using Azure Portal. What are possible ways to copy files from SharePoint to Azure blob store using Azure Data Factory pipelines?
I have looked at all linked services types in Azure data factory pipeline but couldn't find any suitable type to connect to SharePoint.
Rather than directly accessing the file in SharePoint from Data Factory, you might have to use an intermediate technology and have Data Factory call that. You have a few of options:
Use a Logic App to move the file
Use an Azure Function
Use a custom activity and write your own C# to copy the file.
To call a Logic App from ADF, you use a web activity.
You can directly call an Azure Function now.
We can create a linked service of type 'File system' by providing the directory URL as 'Host' value. To authenticate the user, provide username and password/AKV details.
Note: Use Self-hosted IR
You can use the logic app to fetch data from Sharepoint and load it to azure blob storage and now you can use azure data factory to fetch data from blob even we can set an event trigger so that if any file comes into blob container the azure pipeline will automatically trigger.
You can use Power Automate (https://make.powerautomate.com/) to do this task automatically:
Create an Automated cloud flow trigger whenever a new file is dropped in a SharePoint
Use any mentioned trigger as per your requirement and fill in the SharePoint details
Add an action to create a blob and fill in the details as per your use case
By using this you will be pasting all the SharePoint details to the BLOB without even using ADF.
My previous answer was true at the time, but in the last few years, Microsoft has published guidance on how to copy documents from a SharePoint library. You can copy file from SharePoint Online by using Web activity to authenticate and grab access token from SPO, then passing to subsequent Copy activity to copy data with HTTP connector as source.
I ran into some issues with large files and Logic Apps. It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.
I successfully pulled the files with the web activity and copy activity. But I found the SharePoint permissions configuration to be a bit tricky. I blogged my process here.
You can use a binary dataset if you just want to copy the full file rather than read the data.
If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.
Be careful about when you get your auth token. Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error.

Azure IoT data warehouse updates

I am building Azure IoT solution for my BI project. For now I have an application that once per set time window sends a .csv blob to Azure Blob Storage with incremental number in name. So after some time I will have in my storage files such as 'data1.csv', 'data2.csv', 'data3.csv', etc.
Now I will need to load these data into a database which will be my warehouse with the use of Azure Stream Analytics job. The issue might be that .CSV files will have overlapping data. They will be send every 4h and contain data for past 24h. I need to always read only last file (with highest number) and prepare lookup so it properly updates data in the warehouse. What will be the best approach to make Stream Analytics read only latest file and for updating records in DB?
EDIT:
TO clarify - I am fully aware that ASA is not capable of being an ETL job. My question is what would be best approach for my case with using IoT tools
I would suggest one of these 2 ways:
use ASA to write in a temporary SQL table, and the use a SQL trigger
to update the main table of the DW with the diff.
Or remove duplicates by adding a unique constraint as described here:
https://blogs.msdn.microsoft.com/streamanalytics/2017/01/13/how-to-achieve-exactly-once-delivery-for-sql-output/
Thanks,
JS - Azure Stream Analytics

Resources