Let's say I have raw data in azure storage that I need to read and refine it for the azure ml pipeline training step. Running pipeline for different models/params and experimenting with code, prepared data won't change for a while. I think about caching it, since preparation steps takes good amount of resources (Regexes involved). I'll still need to re-run the preparation step when raw data has changed, but only once in a while. I wonder what are good practices for doing it right from Python code and with help of Azure SDK
Related
Imagine we have a Data source, which can be a blob storage or table.
When new data comes into the data source, the main objective is to create a mechanism so that we can first check the data quality of the new data using certain statistical tests, then if it passes these tests, we should be able to combine the new data with the previous data source. The Data source must be versioned.
Also if the new data fails the statistical tests, then we should have a mechanism where it alerts a developer, then if the developer decides to override and then we should be able to combine the new data with the previous data source.
This specific part must be triggered manually, the starting point where we check the new delta. After doing so we need to trigger an Azure DevOps Pipeline.
What tools can we use for this? Are there any reference guides we can follow for this? I need to implement this in Azure.
Key Concerns:
Dataset: Being able to version.
Way to detect delta and store it in a separate place before the tests.
Way to allow a developer to have an override.
Performing statistical tests.
Assuming that the steps in your entire workflow can be broken down into discreet steps, are relatively idempotent or can be checkpointed at each step, and are not long running, then yes, you could explore using Durable Functions, an advanced orchestration framework for Azure Functions.
Suggestions to match your goals:
Dataset: Being able to version - You should version this explicitly in your dataset during generation. If that's not feasible, you may derive the version based on a hash of a combination of metadata for the dataset.
Way to detect delta and store it in a separate place before the tests - depends on what delta means to your dataset. You can make the code check the previous hash from an entry in table storage, and compare with the current hash.
Way to allow a developer to have an override - Yes, see Human interaction in Durable Functions.
Performing statistical tests - If there are multiple tests to run for each pass, then consider using the fan-out/fan-in pattern.
I am currently developing an Azure ML pipeline that is fed data and triggered using Power Automate and outputs to a couple of SQL Tables in Azure SQL. One of the tables that is generated by the pipeline needs to be refreshed each time the pipeline is run, and as such I need to be able to drop the entire table from the SQL database so that the only data present in the table after the run is the newly calculated data.
Now, at the moment I am dropping the table as part of the Power Automate flow that feeds the data into the pipeline initially. However, due to the size of the dataset, this means that there is a 2-6 hour period during which the analytics I am calculating are not available for the end user while the pipeline I created runs.
Hence, my question; is there any way to perform the "DROP TABLE" SQL command from within my Azure ML Pipeline? If this is possible, it would allow me to move the drop to immediately before the export, which would be a great improvement in performance.
EDIT: From discussions with Microsoft Support, it does appear that this is not possible due to how the current ML Platform is designed. Not answering this question in case someone does solve it, but adding this note so that people who come along with the same problem know.
Yes you can do anything you want inside an Azure ML Pipeline with a Python Script Step. I'd recommend using the pyodbc library, and you'd just have to pass the credentials to your script as environment variables or script arguments.
Summarize the problem
I've seeing particularly slow performance out of Azure Data Factory. Searching for similar questions on StackOverFlow turns up nothing except for the advice to contact support.
I'm rolling the dice here to see if anyone has seen something similar and knows how to fix it.
In short, every operation I try in ADF results in excruciatingly slow performance.
This includes:
Extracting a zip in blob storage to blob storage
Copying a number of small compressed files into Azure Data Explorer
Copying a number of small uncompressed json files into Azure Data Explorer
Extracting ZIP
Copying to ADX
In both cases the performance is in the kilobytes per second range.
In both cases the copy/import will eventually work but it can take hours.
Describe what you've tried
I've tried:
using different regions
creating and using my own Integration Runtime
playing with different parameters that could potentially affect performance such as parallel connections etc.
Contacting Microsoft support (who sent me here)
Show some code
Not really any code to share. To reproduce just try extracting a zip to and from blob storage. I get ~400KB/s.
In summary, any advice would be gratefully received. If I can't get this bit working I have to implement a the ingestion factory manually, which on reflection sounds like fun than I've been having with ADF.
Thease 'deep' folders will affect copy speed. We should minimize the depth and increase the amount of copy activity. You can reference this document to troubleshoot copy activity performance. Or you can send a feedback to Microsoft Azure.
I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases
I am trying to move data from On premises Postgres to blob incrementally, but the data is moving very slow, are there any performance tuning steps to be followed?
Welcome to Stackoverflow..!
I would suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity:
Establish a baseline: During the development phase, test your pipeline by using Copy Activity against a representative data sample. Collect execution details and performance characteristics following Copy activity monitoring.
Diagnose and optimize performance: If the performance you observe doesn't meet your expectations, you need to identify performance bottlenecks. Then, optimize performance to remove or reduce the effect of bottlenecks.
Expand the configuration to your entire data set: When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire data set.
For more details, refer Performance tuning steps to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Hope this helps.