How to archive my on premise files once they are processed into azure data lake? - azure

I have a pipeline activity which processes my on premise files inside file share into azure data lake. And then I have to automate the process of whenever the data is processed inside the data lake, I want to archive my source files.
STEP1: I have builtin a logic app which can automatically copy any new data from IN folder to OUT directory and delete the old files in IN directory.
STEP2: I have builtin my pipeline which processes data inside my Data lake.
Now how can I trigger my logic app inside my pipeline through some activity which will automatically delete my on premise files by calling my logic app ?
Please suggest

ADF Web Activity can be used to call a custom REST endpoint from a Data Factory pipeline. enter link description here
And as far as I know,logic app can listen http request. So maybe you could chain a web activity after your data processing activity?
ADF custom activity runs your customized code logic on an Azure Batch pool of virtual machines. So I think you could also put your trigger logic into custom activity?
Another way I could think is, maybe you could invoke your logic app based on some schedule? like daily or weekly?

Related

How to get notified when a file is uploaded in ftp location using Azure and start copy job?

I need to start a copy job whenever a file(desired) is uploaded in an FTP location. so for notifying that the file is available in that location is there any way( like if the file is available then run this copy job) other than logic apps using ADF? can anyone please post some suggestions?
Azure DataFactory event based triggers will not support FTP. So we cannot do directly from Azure DataFactory.
As you said, you might need to relay on other process such as logic apps or Azure functions or Azure Automation, etc. to check for file landing on FTP and kick off ADF pipeline execution.

Use of Azure Grid Events to trigger ADF Pipe to move On-premises CSV files to Azure database

We have series of CSV files landing every day (daily Delta) then these need to be loaded to Azure database using Azure Data Factory (ADF). We have created a ADF Pipeline which moves data straight from an on-premises folder to an Azure DB table and is working.
Now, we need to make this pipeline executed based on an event, not based on a scheduled time. Which is, based on creation of a specific file on the same local folder. This file is created when the daily delta files landing is completed. Let's call this SRManifest.csv.
The question is, how to create a Trigger to start the pipeline when SRManifest.csv is created? I have looked into Azure event grid. But it seems, it doesn't work in on-premises folders.
You're right that you cannot configure an Event Grid trigger to watch local files, since you're not writing to Azure Storage. You'd need to generate your own signal after writing your local file content.
Aside from timer-based triggers, Event-based triggers are tied to Azure Storage, so the only way to use that would be to drop some type of "signal" file in a well-known storage location, after your files are written locally, to trigger your ADF pipeline to run.
Alternatively, you can trigger an ADF pipeline programmatically (.NET and Python SDKs support this; maybe other ones do as well, plus there's a REST API). Again, you'd have to build this, and run your trigger program after your local content has been created. If you don't want to write a program, you can use PowerShell (via Invoke-AzDataFactoryV2Pipeline).
There are other tools/services that integrate with Data Factory as well; I wasn't attempting to provide an exhaustive list.
Have a look at the Azure Logic Apps for File System connector Triggers. More details here.

Using Azure Data Factory to ingest incoming data from a REST API

Is there a way to create an Azure ADF Pipeline to ingest the incoming POST requests? I have this gateway app (outside Azure) that is able to publish data via REST as it arrives from the application and this data needs to be ingested into a Data Lake. I am utilizing the REST calls from another pipeline to pull the data but this basically needs to do the reverse - the data will be pushed and i need to be constantly 'listening' to those calls...
Is this something an ADF pipeline should do or maybe there are any other Azure components able to do it?
Previous comment is right and is one of the approach to get it working but would need bit of coding (for azure function).
There could also be an alternate solution to cater to your requirement is with Azure Logic Apps and Azure data factory.
Step 1: Create a HTTP triggered logic app which would be invoked by your gateway app and data will be posted to this REST callable endpoint.
Step 2: Create ADF pipeline with a parameter, this parameter holds the data that needs to be pushed to the data lake. It could be raw data and can be transformed as a step within the pipeline before pushing it to the data lake.
Step 3: Once logic app is triggered, you can simply use Azure data factory actions to invoke the data factory pipeline created in step 2 and pass the posted data as a pipeline parameter to your ADF pipeline.
This should be it, with this - you can spin up your code-less solution.
If your outside application is already pushing via REST, why not have it make calls directly to the Data Lake REST APIs? This would cut out the middle steps and bring everything under your control.
Azure Data Factory is a batch data movement service. If you want to push the data over HTTP, you can implement a simple Azure Function to accept the data and write it to the Azure Data Lake.
See Azure Functions HTTP triggers and bindings overview

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Azure Data Factory and SharePoint

I have some Excel files stored in SharePoint online. I want copy files stored in SharePoint folders to Azure Blob storage.
To achieve this, I am creating a new pipeline in Azure Data factory using Azure Portal. What are possible ways to copy files from SharePoint to Azure blob store using Azure Data Factory pipelines?
I have looked at all linked services types in Azure data factory pipeline but couldn't find any suitable type to connect to SharePoint.
Rather than directly accessing the file in SharePoint from Data Factory, you might have to use an intermediate technology and have Data Factory call that. You have a few of options:
Use a Logic App to move the file
Use an Azure Function
Use a custom activity and write your own C# to copy the file.
To call a Logic App from ADF, you use a web activity.
You can directly call an Azure Function now.
We can create a linked service of type 'File system' by providing the directory URL as 'Host' value. To authenticate the user, provide username and password/AKV details.
Note: Use Self-hosted IR
You can use the logic app to fetch data from Sharepoint and load it to azure blob storage and now you can use azure data factory to fetch data from blob even we can set an event trigger so that if any file comes into blob container the azure pipeline will automatically trigger.
You can use Power Automate (https://make.powerautomate.com/) to do this task automatically:
Create an Automated cloud flow trigger whenever a new file is dropped in a SharePoint
Use any mentioned trigger as per your requirement and fill in the SharePoint details
Add an action to create a blob and fill in the details as per your use case
By using this you will be pasting all the SharePoint details to the BLOB without even using ADF.
My previous answer was true at the time, but in the last few years, Microsoft has published guidance on how to copy documents from a SharePoint library. You can copy file from SharePoint Online by using Web activity to authenticate and grab access token from SPO, then passing to subsequent Copy activity to copy data with HTTP connector as source.
I ran into some issues with large files and Logic Apps. It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.
I successfully pulled the files with the web activity and copy activity. But I found the SharePoint permissions configuration to be a bit tricky. I blogged my process here.
You can use a binary dataset if you just want to copy the full file rather than read the data.
If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.
Be careful about when you get your auth token. Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error.

Resources