In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?
This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.
Related
I have a Google Apps Script that builds a new csv file any time someone makes an edit to one of our shared google sheets (via a trigger). The file gets saved off to a dedicated Shared Drive folder (which is first cleaned out by trashing all existing files before writing the updated one). This part works splendidly.
I need to take that CSV and consume it in SSIS so I can datestamp it and load it into a MSSQL table for historical tracking purposes, but aside from paying for some third-party apps (i.e. CDATA, COZYROC), I can't find a way to do this. Eventually this package will be deployed on our SQL Agent to run on a daily schedule, so it will be attached to a service account that wouldn't have any sort of access to the Google Shared drive. If I can get that CSV over to one of the shared folders on our SQL server, I will be golden...but that is what I am struggling with.
If something via Apps Script isn't possible, is there someone that can direct me as to how I might then be able to programmatically get an Excel spreadsheet to open, refresh its dataset, then save itself and close? I can get the data I need into Excel out of the Google Sheet directly using a Power Query, but need it to refresh itself in an unattended process on a daily schedule.
I found that CData actually has a tool called Sync which got us what we needed out of this. There is a limited-options free version of the tool (that they claim it's "free forever") that runs as a local service. On a set schedule, it can query all sorts of sources, including Google Sheets, and will write out to various destinations.
The free version has limited availability in terms of the sources and destinations you can use (though there are quite a few), but it only allows 2 connection definitions. That said, you can define multiple source files, but only 1 source type (i.e. I can define 20 different Google Sheets to use in 20 different jobs, but can only use Google Sheets as my source).
I have it setup to read my shared google sheet and output the CSV to our server's share. A SSIS project reads the local CSV, processes it as needed, and then writes to our SQL server. Seems to work pretty well if you don't mind having an additional service running, and don't need a series of different sources and destinations.
Link to their Sync landing page: https://www.cdata.com/sync/
Use the Buy Now button and load up the free version in your cart, then proceed to check out. They will email you a key and a link to get the download from.
We have a few third-party companies sending us emails with CSV/excel data files attached to the emails. I want to build a pipeline (preferably in ADF) to get the attachments, load the raw files (attachments) to blob, process/transform them, and finally load the processed files to another dir in the blob.
To get the attachment, I think I can use the instructions (using Logic App) in this link. Then, trigger an ADF pipeline using storage trigger, get the file and process it and do the rest of the stuff.
However, first, I'm not sure how reliable storage triggers are?
Second, although it seems ok, this approach makes it difficult to monitor the runs and make sure things are working properly. For example, if the logic app doesn't read/load the attachments for any reason and fails, you can't pick it up in ADF as nothing has written in the blob to trigger the pipeline.
Anyway, is this approach good, or there are better ways to do this?
Thanks
If you are able to save attachments into a blob or something, you can schedule a ADF pipeline that imports every file in the blob every minute og 5 minute or so.
Does the files have same data structure everytime? (that makes things much easier)
It is most common to schedule imports in ADF, rather that trigger based on external events.
I have a tricky question about the "Copy Activity" in ADF. Assume the following scenario:
Source: an external API or an non-Azure database using hosted integration runtime.
Sink: an Azure SQL Server database.
The "pre-copy Script" field has a command to delete some data from the sink table (why deleting is out of scope of the discussion).
When the pipeline runs, the connection with the source fails (due to a time-out, network issue, authentication, etc.)
The question is: will the pre-copy script run in this case? Or the script only runs after ADF successfully connected the source data store? I couldn't find any reference about it.
I can just try to simulate it and see what happens, but I'm hopping someone can save my time. :)
Thanks in advance!
Per my experience about Data Factory, the pre-copy script won't run.
As I understand, we can consider it as a workflow, connect to source--> get data from source-->connect to sink-->run the pre-copy script-->write data to sink. No mater which step failed, data factory will stop run.
Is it possible to build a data pipeline in AWS to transfer data between two different RDS MySQL instances? The transfer would be taking place once per day (although not necessarily at the same time every day).
I am interested in copying full datatables from one instance to another, but the documentation for the data pipeline service doesn't seem to consider this use case.
Thanks in advance.
If one is a copy of the other, you can use Data Migration Services (a different Amazon service).
If you choose "ongoing replication" then the service will update your target database throughout the day with changes from the source database.
I suspect if you start making changes to the target database that make it different to the source database then you will have problems.
We are using Azure Blob Storage in all our projects. Through lifetime of a project the naming convention for files in Azure change: sometimes we would like to rename containers, remove extra folders and other clean-up operations.
But Azure does not allow easily to rename things, we have to do copy-delete.
Also we can change naming convention locally, during development. But we need to remember do the exact operation on production storage when we deploy new versions.
At the same time we use Entity Framework migrations: we updated database, migration script is created. Then we run "update-database" and DB is updated. The same is run automatically by deployment scripts: check if production DB needs to be updated, and update it if needed.
What would be good if we can do the same migration goodness for Azure storage: check if all the migration scripts have been applied, execute processes for missing scripts. Somewhere in the containers keep a reference to a latest executed script.
Does such thing exist? or should I have a go on it and try implementing something myself.
No, such functionality/behavior does not exists. And do remember that EF migrations are supported and are part of the EF itself, not the Data Base! So when you talk about Azure Blob Storage - it, as a service does not provide such functionality, the same way SQL Server itself does not do it.
To the question if such a library/code exists - no there isn't.
You are raising a very interesting question though!
I personally am not a big fan of "migrations". You can do it while in early stages of development life cycle. But once you hit GA/Production, you have to be very careful what you are doing. Even EF migrations might be good with small database sizes, but are you willing to run migrations on a DB which has tables with millions of records production data? Same with blobs. If you have 100 or 1000 blobs might be fine. How about 2M blobs? Are you really willing to put some code that would go through 2M entities and do some operations over it, and run this code as part of your build/deploy process? I would not.