U-SQL Tables vs SQL Data Warehouse - azure

So here's where I'm at.
I'm storing huge amounts of data in Data Lake Store. But when I want to make a report (it can be a month's worth), I want to schematize it into a table to refer to over and over again when querying upon it.
Should I just use the built in database feature that Data Lake Analytics provides by creating U-SQL tables (https://msdn.microsoft.com/en-us/library/azure/mt621301.aspx) or should I create this table in SQL Data Warehouse? I guess what I really want to know is what are the pros and cons of either case and when is it best to use either?
By the way, I'm a noob in this Microsoft Azure world. Still actively learning.

At this point it depends on what you want to do with the data.
If you need interactive report queries, then moving the data into a SQL DB or DW schema is recommended at this point until ADLA provides interactive query capabilities.
If you need the tables during your data preparation steps, want to use partitioning to manage data life cycles, need to run U-SQL queries that can benefit from the clustering and data distribution offered by U-SQL tables, you should use U-SQL tables.

Related

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!
One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.
Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

Best practice - Storage options for external reference data that is queried in different ways

We have a cloud platform with various Health Care applications. Each application needs what we call reference data. Reference data is always external data coming from a provider on a daily or some regular schedule. An example of reference data is FDB MedKnowledge which includes a comprehensive compendium of consumer medication monographs, along with drug images and imprints.
Various applications will query the reference data to present it to their target customers (who can be physicians, nurses, technicians, procurement department etc...). A common global API will be developed to return the requested data.
Historical information is required ( for ex: FDB in 2017 had NDC1 which then got deleted from the FDB feed in 2019. So a physician who prescribed NDC1 should be able to query the information of that drug going through history).
Daily we receive the feed from the external provider and use it as input source to merge ( update, insert, delete) our reference data copy such that its live table reflects the latest external feed.
In Azure, we have the following storage options:
Blob storage
Cosmos Db
Azure sql database with system versioning
Azure Datawarehouse
Azure Data lake
What is the best practice to store external reference data? We are leaning toward azure sql database with system versioning. Have any of you worked with external reference data? If yes, what is your storage decision and has it worked well for you? I would like to hear your comments and opinions. Thank you!
You need to base your choice on the type of data you are trying to store, and how you need to reference it. It sounds like you might actually need a few different technologies here.
For example, Azure SQL is great for storing relational data. So if your data is tabular in form and needs to have relationships between it, then this is a good choice. However, if you're going to be storing millions and millions of rows then performance might suffer in a relational database. In that sort of scenario, or one where you have lots of transactional data you might want to look at Cosmos DB.
You mentioned images at one point, putting these in a database is not a good idea, in this sort of scenario you are going to want to look at using blob storage.
"Reference Data" really doesn't mean anything, look at the individual types of data you need to store, and how this data is used, and make decisions based on this. For lots of different types of data, there is unlikely to be a one size fits all solution.

Options to copy multiple tables with one million rows from on-prem SQL 2012 to SQL Azure

I have around 20 tables in my On-prem SQL server 2012 that I like to copy them to a SQL Azure instance on a schedule basis.
In each table, I have around one millions records.
I need to schedule the table copy five times a day.
What are my options to copy this volume of data from SQL2012 to SQL Azure?
Considering the bandwidth limitations between on-prem data center and azure, Is this a feasible requirement?
Thank you,
Assuming
A) you have some .NET programming knowledge and B) the data should completely replace what is currently in the azure tables I would do the following.
Write some .NET code to read the data from the local db into a IDataReader, you don't mention about FK dependencies so you may have to take this into account.
Use SQLBulkCopy to insert the data into holding tables in azure such as
_myrealtablename. We use it we this in entity Framework EntityFramework.BulkInsert
then when the data is uploaded, rename the tables within a transaction so the new tables have replaced the previous.
We see performance of ~100000 rows per second uploading in this manner on fairly wide tables. A few caveats would be how much data is changing, as timestamps and just updating changed and new entities may be easy.
If you wanted something more rudimentary you could use BCP to export and import the data
Check out transactional replication: https://azure.microsoft.com/en-us/blog/transactional-replication-to-azure-sql-db/.
Mihaela

Resources