I am currently developing an Azure ML pipeline that is fed data and triggered using Power Automate and outputs to a couple of SQL Tables in Azure SQL. One of the tables that is generated by the pipeline needs to be refreshed each time the pipeline is run, and as such I need to be able to drop the entire table from the SQL database so that the only data present in the table after the run is the newly calculated data.
Now, at the moment I am dropping the table as part of the Power Automate flow that feeds the data into the pipeline initially. However, due to the size of the dataset, this means that there is a 2-6 hour period during which the analytics I am calculating are not available for the end user while the pipeline I created runs.
Hence, my question; is there any way to perform the "DROP TABLE" SQL command from within my Azure ML Pipeline? If this is possible, it would allow me to move the drop to immediately before the export, which would be a great improvement in performance.
EDIT: From discussions with Microsoft Support, it does appear that this is not possible due to how the current ML Platform is designed. Not answering this question in case someone does solve it, but adding this note so that people who come along with the same problem know.
Yes you can do anything you want inside an Azure ML Pipeline with a Python Script Step. I'd recommend using the pyodbc library, and you'd just have to pass the credentials to your script as environment variables or script arguments.
Related
Imagine we have a Data source, which can be a blob storage or table.
When new data comes into the data source, the main objective is to create a mechanism so that we can first check the data quality of the new data using certain statistical tests, then if it passes these tests, we should be able to combine the new data with the previous data source. The Data source must be versioned.
Also if the new data fails the statistical tests, then we should have a mechanism where it alerts a developer, then if the developer decides to override and then we should be able to combine the new data with the previous data source.
This specific part must be triggered manually, the starting point where we check the new delta. After doing so we need to trigger an Azure DevOps Pipeline.
What tools can we use for this? Are there any reference guides we can follow for this? I need to implement this in Azure.
Key Concerns:
Dataset: Being able to version.
Way to detect delta and store it in a separate place before the tests.
Way to allow a developer to have an override.
Performing statistical tests.
Assuming that the steps in your entire workflow can be broken down into discreet steps, are relatively idempotent or can be checkpointed at each step, and are not long running, then yes, you could explore using Durable Functions, an advanced orchestration framework for Azure Functions.
Suggestions to match your goals:
Dataset: Being able to version - You should version this explicitly in your dataset during generation. If that's not feasible, you may derive the version based on a hash of a combination of metadata for the dataset.
Way to detect delta and store it in a separate place before the tests - depends on what delta means to your dataset. You can make the code check the previous hash from an entry in table storage, and compare with the current hash.
Way to allow a developer to have an override - Yes, see Human interaction in Durable Functions.
Performing statistical tests - If there are multiple tests to run for each pass, then consider using the fan-out/fan-in pattern.
Let's say I have raw data in azure storage that I need to read and refine it for the azure ml pipeline training step. Running pipeline for different models/params and experimenting with code, prepared data won't change for a while. I think about caching it, since preparation steps takes good amount of resources (Regexes involved). I'll still need to re-run the preparation step when raw data has changed, but only once in a while. I wonder what are good practices for doing it right from Python code and with help of Azure SDK
I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases
I am looking for the proper kind of architecture to get the data loading done from text files, csv etc... (format is not decided yet) and load it into Azure SQL Database table(s).
It should not be done manually by the user, I want this workflow to happen automatically, provide input for the components, and I need use a service so that it will be a fail safe process.
Please provide approach or architecture which does the similar job.
There are a number of ways to do this an it really depends on your goals and how much you want to invest.
Since you mention CSV by default you could script a SQL bcp (bulk copy).
https://msdn.microsoft.com/en-us/library/ms162802.aspx
You could script it so it isn't manual, but you mentioned it should be fail safe. Any group insert of row into an existing table could fail for a number of reasons (referential integrity checks, hardware issues, locks, etc).
Most integration architectures would bulk copy the data to a clean staging location (empty or new table in the same or other database) using either BCP or SqlBulkCopy (https://msdn.microsoft.com/en-us/library/7ek5da1a(v=vs.110).aspx). Then they would loop through those rows inserting them to the final destination table. That way you can build in either retry or report failures on a row by row basis, and handle accordingly.
Since this is in Azure, there are a number of ways to do the final loop through rows process, like Azure Batch, or Azure Jobs, or even Azure Logic Apps (this is basically a BizTalk like workflow/integration support engine), or custom worker role.
For a project, I am using both SQL Azure and Azure table. A requirement here is that for the first 7 days, all data are stored in SQL Azure. After the first 7 days, the data are migrated into Azure table.
Is there any reliable project to achieve this goal? Or any idea to implement this?
thanks,
I think your best best is to have a set of SQL queries (or sprocs) that return data older than 7 days. Then have table-insertion code that writes this data to one or more tables, with appropriate partition/row key based on your query needs. Then, just build some type of background operation to perform the read+write+delete. There's no tool to do this (that I know of), since one is a relational database and the other is a NoSQL variant with no specific schema.
To optimize your writes, see if you can write batches of rows at the same time (this is called an Entity Group Transaction). It optimizes # of transactions, plus the rows in a group will be written atomically. See more info on entity group transactions, here.
You also may want to consider using a queue for workload assignment. That is, maybe once a day (or hour, whenever), push a queue message telling some background process to transfer data from SQL to Table Storage. This way, in case something fails during the operation, you can process it again later, since the queue message will still be there (you'd only delete the message if the operation succeeded).
If you're looking for a tool to do so, take a look at Cloud Storage Studio (http://www.cerebrata.com/products/cloudstoragestudio) which has a feature to import data from SQL Server to Azure Table Storage. I haven't checked for a long time but I believe ClumsyLeaf's TableXplorer (http://www.clumsyleaf.com) also has this feature. Long time back, we also built an open source tool to do the same. You can find it here: http://azuredatabaseupload.codeplex.com/.
As David mentioned, you could basically write some views in your database to fetch data older than 7 days. The idea is simple: You fetch the data, map the SQL Server data types to Azure data types, choose appropriate PartitionKey/RowKey values, convert the data into entities and then upload entities in batches.