I´m inmerse in a DataLake project where an old oracle instance will be migrated to. Data is not an issue, it will be ingest it to the lake. But as part of this migration, some PL/SQL will be migrated too to a the cloud. Tool for replacing the current data flow orchestation will be Spark.
Two questions here:
Is there any tool for migrating PL/SQL to any Spark´s language?
based on the comunity experience, is there any average time consuming about performing this task manually? ex. 100h per 1000 lines of PL/SQL code.
PL/SQL is not particulary complex.
Many thanks!
JP
Related
Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!
One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.
Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?
I am currently developing an Azure ML pipeline that is fed data and triggered using Power Automate and outputs to a couple of SQL Tables in Azure SQL. One of the tables that is generated by the pipeline needs to be refreshed each time the pipeline is run, and as such I need to be able to drop the entire table from the SQL database so that the only data present in the table after the run is the newly calculated data.
Now, at the moment I am dropping the table as part of the Power Automate flow that feeds the data into the pipeline initially. However, due to the size of the dataset, this means that there is a 2-6 hour period during which the analytics I am calculating are not available for the end user while the pipeline I created runs.
Hence, my question; is there any way to perform the "DROP TABLE" SQL command from within my Azure ML Pipeline? If this is possible, it would allow me to move the drop to immediately before the export, which would be a great improvement in performance.
EDIT: From discussions with Microsoft Support, it does appear that this is not possible due to how the current ML Platform is designed. Not answering this question in case someone does solve it, but adding this note so that people who come along with the same problem know.
Yes you can do anything you want inside an Azure ML Pipeline with a Python Script Step. I'd recommend using the pyodbc library, and you'd just have to pass the credentials to your script as environment variables or script arguments.
I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases
i want to migrate unidata database which is multivalue to sql using dotnet code.IS this possible,one of the possibility is through SSIS but this will consume lot of time becouse we have to do ETL process to all the tables in DB .So was looking for a dot net code where i can connect to Unidatadb and migrate data to sql
You're probably getting downvoted because this is an awfully general question, and it's not particularly a programming question, but rather a big project.
One piece of advice is to flip things around and extract the information from the Unidata side "exploding" out the multivalues into flat tables that your ETL process can consume. And the challenge there (apart from writing Unibasic code) is identifying which multivalued fields are associated with each other. Unless you have very good documentation that can be tough to do.
From the beginning of an application, you plan ahead and denormalize data at write-time for faster queries at read-time. Using Cassandra "BATCH" commands, you can ensure atomic updates across multiple tables.
But, what about when you add a new feature, and need a new denormalized table? Do you need to run a temporary script to populate this new table with data? Is this how people normally do it? Is there a feature in Cassandra that will do this for me?
I can't comment yet hence the new answer. The answer is yes, you'd have to write a migration script and run that when you deploy your software upgrade with the new feature. That's fairly a typical devops release process from my experience.
I've not seen anything like Code First Migrations (for MS SQL Server & Entity Framework) for Cassandra, which does the migration script automatically for you.