Use of Azure Grid Events to trigger ADF Pipe to move On-premises CSV files to Azure database - azure

We have series of CSV files landing every day (daily Delta) then these need to be loaded to Azure database using Azure Data Factory (ADF). We have created a ADF Pipeline which moves data straight from an on-premises folder to an Azure DB table and is working.
Now, we need to make this pipeline executed based on an event, not based on a scheduled time. Which is, based on creation of a specific file on the same local folder. This file is created when the daily delta files landing is completed. Let's call this SRManifest.csv.
The question is, how to create a Trigger to start the pipeline when SRManifest.csv is created? I have looked into Azure event grid. But it seems, it doesn't work in on-premises folders.

You're right that you cannot configure an Event Grid trigger to watch local files, since you're not writing to Azure Storage. You'd need to generate your own signal after writing your local file content.
Aside from timer-based triggers, Event-based triggers are tied to Azure Storage, so the only way to use that would be to drop some type of "signal" file in a well-known storage location, after your files are written locally, to trigger your ADF pipeline to run.
Alternatively, you can trigger an ADF pipeline programmatically (.NET and Python SDKs support this; maybe other ones do as well, plus there's a REST API). Again, you'd have to build this, and run your trigger program after your local content has been created. If you don't want to write a program, you can use PowerShell (via Invoke-AzDataFactoryV2Pipeline).
There are other tools/services that integrate with Data Factory as well; I wasn't attempting to provide an exhaustive list.

Have a look at the Azure Logic Apps for File System connector Triggers. More details here.

Related

How to create a trigger to ( to data factory or azure function or databricks ) when there are 500 new files land in a azure storage blob

I have a azure storage container where I will be getting many files on daily basis.My requirement is that I need a trigger in azure data factory or databricks when each time 500 new files are arrived so I can process them.
In Datafatory we have event trigger which will trigger for each new files(with filename and path). but is it possible to get multiple new files and their details at same time?
What services in azure I can use for this scenario.. event hub? azure function.. queues?
One of the characteristics of a serverless architecture is to execute something whenever a new event occurs. Based on that, you can't use those services alone.
Here's what I would do:
#1 Azure Functions with Blob Trigger, to execute whenever a new file arrives. This would not start the processing of the file, but just 'increment' the count of files that would be stored on Cosmos DB.
#2 Azure Cosmos DB, also offers a Change Feed, which is like an event sourcing whenever something changes in a collection. As the document created / modified on #1 will hold the count, you can use another Azure Function #3 to consume the change feed.
#3 This function will just contain an if statement which will "monitor" the current count, and if it's above the threshold, start the processing.
After that, you just need to update the document and reset the count.

Is there a way to know if a file has been uploaded to a network drive in Azure

I have a network location where every hour a csv file gets dumped. I need to copy that file to an azure blob. How do I know that a file has been uploaded to that network drive. Is there something like a file watcher in azure which monitors this network location? Also, is it possible to copy a file from network location to an azure blob through code?
I'm using .net core APIs deployed to an Azure App Service.
Please suggest a possible solution.
You can use Azure Event Grid but as of today Event Grid does not support Azure File Share.
As your fileshare in on-prem the only way I see is that you can write a custom publisher which can run on-prem and uses Azure Event Grid to send the event to Azure Event Grid and a subscriber which can be Azure Function does the work you want it to do.
https://learn.microsoft.com/en-us/azure/event-grid/custom-event-quickstart-portal
But it will only be an Event and not the file itself which has been added\changed and to do that you will have to then upload the file itself into Azure for processing as well. As the above way requires you to do two things I would recommend run a custom code on-prem which runs CRON job like and looks for the new or edited file and then uploads to Azure BLOB Storage and then execute Azure Function to do your processing task.
Since the files are on-prem you can use powershell to monitor a folder for new files. Then fire an event to upload the file to an Azure blob.
There is a video showing how to do this here: https://www.youtube.com/watch?v=Usih7UywZYA
The changes you need to make are:
replace the action with an upload to azure https://argonsys.com/microsoft-cloud/library/how-to-upload-files-to-azure-blob-storage-using-powershell-and-azcopy/
Run powershell in the context of a user that can upload files

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Azure Data Factory and SharePoint

I have some Excel files stored in SharePoint online. I want copy files stored in SharePoint folders to Azure Blob storage.
To achieve this, I am creating a new pipeline in Azure Data factory using Azure Portal. What are possible ways to copy files from SharePoint to Azure blob store using Azure Data Factory pipelines?
I have looked at all linked services types in Azure data factory pipeline but couldn't find any suitable type to connect to SharePoint.
Rather than directly accessing the file in SharePoint from Data Factory, you might have to use an intermediate technology and have Data Factory call that. You have a few of options:
Use a Logic App to move the file
Use an Azure Function
Use a custom activity and write your own C# to copy the file.
To call a Logic App from ADF, you use a web activity.
You can directly call an Azure Function now.
We can create a linked service of type 'File system' by providing the directory URL as 'Host' value. To authenticate the user, provide username and password/AKV details.
Note: Use Self-hosted IR
You can use the logic app to fetch data from Sharepoint and load it to azure blob storage and now you can use azure data factory to fetch data from blob even we can set an event trigger so that if any file comes into blob container the azure pipeline will automatically trigger.
You can use Power Automate (https://make.powerautomate.com/) to do this task automatically:
Create an Automated cloud flow trigger whenever a new file is dropped in a SharePoint
Use any mentioned trigger as per your requirement and fill in the SharePoint details
Add an action to create a blob and fill in the details as per your use case
By using this you will be pasting all the SharePoint details to the BLOB without even using ADF.
My previous answer was true at the time, but in the last few years, Microsoft has published guidance on how to copy documents from a SharePoint library. You can copy file from SharePoint Online by using Web activity to authenticate and grab access token from SPO, then passing to subsequent Copy activity to copy data with HTTP connector as source.
I ran into some issues with large files and Logic Apps. It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.
I successfully pulled the files with the web activity and copy activity. But I found the SharePoint permissions configuration to be a bit tricky. I blogged my process here.
You can use a binary dataset if you just want to copy the full file rather than read the data.
If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.
Be careful about when you get your auth token. Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error.

How to archive my on premise files once they are processed into azure data lake?

I have a pipeline activity which processes my on premise files inside file share into azure data lake. And then I have to automate the process of whenever the data is processed inside the data lake, I want to archive my source files.
STEP1: I have builtin a logic app which can automatically copy any new data from IN folder to OUT directory and delete the old files in IN directory.
STEP2: I have builtin my pipeline which processes data inside my Data lake.
Now how can I trigger my logic app inside my pipeline through some activity which will automatically delete my on premise files by calling my logic app ?
Please suggest
ADF Web Activity can be used to call a custom REST endpoint from a Data Factory pipeline. enter link description here
And as far as I know,logic app can listen http request. So maybe you could chain a web activity after your data processing activity?
ADF custom activity runs your customized code logic on an Azure Batch pool of virtual machines. So I think you could also put your trigger logic into custom activity?
Another way I could think is, maybe you could invoke your logic app based on some schedule? like daily or weekly?

Resources