We have a few third-party companies sending us emails with CSV/excel data files attached to the emails. I want to build a pipeline (preferably in ADF) to get the attachments, load the raw files (attachments) to blob, process/transform them, and finally load the processed files to another dir in the blob.
To get the attachment, I think I can use the instructions (using Logic App) in this link. Then, trigger an ADF pipeline using storage trigger, get the file and process it and do the rest of the stuff.
However, first, I'm not sure how reliable storage triggers are?
Second, although it seems ok, this approach makes it difficult to monitor the runs and make sure things are working properly. For example, if the logic app doesn't read/load the attachments for any reason and fails, you can't pick it up in ADF as nothing has written in the blob to trigger the pipeline.
Anyway, is this approach good, or there are better ways to do this?
Thanks
If you are able to save attachments into a blob or something, you can schedule a ADF pipeline that imports every file in the blob every minute og 5 minute or so.
Does the files have same data structure everytime? (that makes things much easier)
It is most common to schedule imports in ADF, rather that trigger based on external events.
Related
In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?
This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.
I have a Google Apps Script that builds a new csv file any time someone makes an edit to one of our shared google sheets (via a trigger). The file gets saved off to a dedicated Shared Drive folder (which is first cleaned out by trashing all existing files before writing the updated one). This part works splendidly.
I need to take that CSV and consume it in SSIS so I can datestamp it and load it into a MSSQL table for historical tracking purposes, but aside from paying for some third-party apps (i.e. CDATA, COZYROC), I can't find a way to do this. Eventually this package will be deployed on our SQL Agent to run on a daily schedule, so it will be attached to a service account that wouldn't have any sort of access to the Google Shared drive. If I can get that CSV over to one of the shared folders on our SQL server, I will be golden...but that is what I am struggling with.
If something via Apps Script isn't possible, is there someone that can direct me as to how I might then be able to programmatically get an Excel spreadsheet to open, refresh its dataset, then save itself and close? I can get the data I need into Excel out of the Google Sheet directly using a Power Query, but need it to refresh itself in an unattended process on a daily schedule.
I found that CData actually has a tool called Sync which got us what we needed out of this. There is a limited-options free version of the tool (that they claim it's "free forever") that runs as a local service. On a set schedule, it can query all sorts of sources, including Google Sheets, and will write out to various destinations.
The free version has limited availability in terms of the sources and destinations you can use (though there are quite a few), but it only allows 2 connection definitions. That said, you can define multiple source files, but only 1 source type (i.e. I can define 20 different Google Sheets to use in 20 different jobs, but can only use Google Sheets as my source).
I have it setup to read my shared google sheet and output the CSV to our server's share. A SSIS project reads the local CSV, processes it as needed, and then writes to our SQL server. Seems to work pretty well if you don't mind having an additional service running, and don't need a series of different sources and destinations.
Link to their Sync landing page: https://www.cdata.com/sync/
Use the Buy Now button and load up the free version in your cart, then proceed to check out. They will email you a key and a link to get the download from.
I have 200 + SFTP Sub folders and it will be dynamically adding 10 folders every month. We created List rows in a table through Onedrive and started monitoring the SFTP location, but somehow this approach is missing some files at certain point.. Is there better way or different approach to tackle this problem.. Has anyone came across in the past?
The first thing that comes to my mind is that, is the Onedrive file with the table perhaps in a locked state and can't be accessed by logic apps? That is the most similar issue I have had in the past.
Is this one Logic App or multiple Logic Apps? If it is one logic app you could try to make this into an app that only checks the list of dynamically added folders which would kick of an Azure automation job that would deploy new logic apps for each line(folder) which would monitor the folders, though with this approach you would end up with 200+ logic apps, not that it is anything wrong with that, besides the limit per resource group is 800.
I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.
Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.
Any insight appreciated - a Google turned up little.
So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.
For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.
Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.
I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.
One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!
Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.
Hope this helps.
The scenario is as follows: A large text file is put somewhere. At a certain time of the day (or manually or after x number of files), a Virtual Machine with Biztalk installed is about to start automatically for processing of these files. Then, the files should be put in some output place and the VM should be shut down. I donĀ“t know the time it takes for processing these files.
What is the best way to build such a solution? The solution is preferably to be used for similar scenarios in the future.
I was thinking of Logic Apps for the workflow, blob storage or FTP for input/output of the files, an API App for starting/shutting down the VM. Can Azure Functions be used in some way?
EDIT:
I also asked the question elsewhere, see link.
https://social.msdn.microsoft.com/Forums/en-US/19a69fe7-8e61-4b94-a3e7-b21c4c925195/automated-processing-of-large-text-files?forum=azurelogicapps
Just create an Azure Runbook with a Schedule, make that Runbook check for specific files in a Storage Account, if they exist, start up a VM and wait till the files are gone, once the files are gone (so BizTalk processed them, deleted and put them in some place where they belong), Runbook would stop the VM.