Polymorphic data transformation techniques / data lake/ big data - apache-spark

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!

One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.

Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?

Related

What is Azure Data Explorer? A Datalake? A datawarehouse?

As the title says, I'm confused about the role Azure Data Explorer has in the Azure data ecosystem. The documentation states that it's an analytics tool, but technically it ingests data from different sources such as kafka, spark and go on.
Is it some kind of enhanced datawarehouse?
TIA
"For our own troubleshooting needs we wanted to run ad-hoc queries on
the massive telemetry data stream produced by our service. Finding no
suitable solution, we decided to create one"
                            - Ziv Caspi Architect, Azure Data Explorer -
Once we established the need, we can discuss the implementation.
Here are some key features:
The service is distributed and can be easily scaled out (or in) which makes it good fit for big data (big as you need).
The data is ingested into the service in batch/stream and stored in a propriety format.
The data is stored in tables (columns & rows).
Columns' data types include bool, int, long, real, decimal, datetime & timespan as well as native support for JSON (the dynamic data type).
Everything is indexed, including free text that is tokenized and indexed with Full-text search index, which mean we can find rows with specific tokens in sub-seconds - seconds.
The data is stored in a columnar format which makes it great for aggregations on large volumes.
ADX has its own highly intuitive query language, KQL (Kusto Query Language), which supports numerous analytical features including distributed joins.
ADX has native support for time-series with a lot of built-in functionality around it (forecast, anomaly detection etc.).
Since the service was created to handle telemetry and telemetry does not change over time, the service was created as append only (inserts) + built-in support for data retention.
Later on, soft & hard deletes were added.
As of today, updates are not supported.
Here is some additional reading:
Ziv Caspi: Azure Data Explorer Technology 101
Brian Harry: Introducing Application Insights Analytics
Uri Barash: Azure Announcements: Azure Data Explorer

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

How can I decide, if I should use the Power BI API to push data into my streaming dataset or Azure Stream Analytics?

I am very new to Azure. I need to create a Power BI dashboard to visualize some data produced by a sensor. The dashboard needs to get updated "almost" real-time. I have identified that I need a push data set, as I want to visualize some historic data on a line chart. However, from the architecture point of view, I could use the Power BI REST APIs (which would be completely fine in my case, as we process the data with a Python app and I could use that to call Power BI) or Azure Stream Analytics (which could also work, I could dump the data to the Azure Blob storage from the Python app and then stream it).
Can you tell me generally speaking, what are the advantages/disadvantages of the two approaches?
Azure stream analytics lets you have multiple sources and define multiple targets and one of those targets could be Power-BI and Blob ... and at the same time you can use windowing function on the data as it comes in. It also provides you a visual way of managing your pipeline including windowing function.
In your case you are kind of replicating the incoming data to Blob first and secondly to power-BI. But if you have a use case to apply windowing function(1 minutes or so) as your data is coming in from multiple sources e.g. more than one sensor or a senor and other source, you have to fiddle around a lot to get it working manually, where as in stream analytics you can easily do it.
Following article highlights some of the pros and cons of Azure Analytics...
https://www.axonize.com/blog/iot-technology/the-advantages-and-disadvantages-of-using-azure-stream-analytics-for-iot-applications/
If possible, I would recommend streaming data to IoT Hub first, and then ASA can pick it up and render the same on Power BI. It will provide you better latency than streaming data from Blob to ASA and then Power BI. It is the recommended IoT pattern for remote monitoring, predictive maintenance etc , and provides you longer term options to add a lot of logic in the real-time pipelines (ML scoring, windowing, custom code etc).

Application insight -> export -> Power BI Data Warehouse Architecture

Our team have just recently started using Application Insights to add telemetry data to our windows desktop application. This data is sent almost exclusively in the form of events (rather than page views etc). Application Insights is useful only up to a point; to answer anything other than basic questions we are exporting to Azure storage and then using Power BI.
My question is one of data structure. We are new to analytics in general and have just been reading about star/snowflake structures for data warehousing. This looks like it might help in providing the answers we need.
My question is quite simple: Is this the right approach? Have we over complicated things? My current feeling is that a better approach will be to pull the latest data and transform it into a SQL database of facts and dimensions for Power BI to query. Does this make sense? Is this what other people are doing? We have realised that this is more work than we initially thought.
Definitely pursue Michael Milirud's answer, if your source product has suitable analytics you might not need a data warehouse.
Traditionally, a data warehouse has three advantages - integrating information from different data sources, both internal and external; data is cleansed and standardised across sources, and the history of change over time ensures that data is available in its historic context.
What you are describing is becoming a very common case in data warehousing, where star schemas are created for access by tools like PowerBI, Qlik or Tableau. In smaller scenarios the entire warehouse might be held in the PowerBI data engine, but larger data might need pass through queries.
In your scenario, you might be interested in some tools that appear to handle at least some of the migration of Application Insights data:
https://sesitai.codeplex.com/
https://github.com/Azure/azure-content/blob/master/articles/application-insights/app-insights-code-sample-export-telemetry-sql-database.md
Our product Ajilius automates the development of star schema data warehouses, speeding the development time to days or weeks. There are a number of other products doing a similar job, we maintain a complete list of industry competitors to help you choose.
I would continue with Power BI - it actually has a very sophisticated and powerful data integration and modeling engine built in. Historically I've worked with SQL Server Integration Services and Analysis Services for these tasks - Power BI Desktop is superior in many aspects. The design approaches remain consistent - star schemas etc, but you build them in-memory within PBI. It's way more flexible and agile.
Also are you aware that AI can be connected directly to PBI Web? This connects to your AI data in minutes and gives you PBI content ready to use (dashboards, reports, datasets). You can customize these and build new reports from the datasets.
https://powerbi.microsoft.com/en-us/documentation/powerbi-content-pack-application-insights/
What we ended up doing was not sending events from our WinForms app directly to AI but to the Azure EventHub
We then created a job that reads from the eventhub and send the data to
AI using the SDK
Blob storage for later processing
Azure table storage to create powerbi reports
You can of course add more destinations.
So basically all events are send to one destination and from there stored in many destinations, each for their own purposes. We definitely did not want to be restricted to 7 days of raw data and since storage is cheap and blob storage can be used in many analytics solutions of Azure and Microsoft.
The eventhub can be linked to stream analytics as well.
More information about eventhubs can be found at https://azure.microsoft.com/en-us/documentation/articles/event-hubs-csharp-ephcs-getstarted/
You can start using the recently released Application Insights Analytics' feature. In Application Insights we now let you write any query you would like so that you can get more insights out of your data. Analytics runs your queries in seconds, lets you filter / join / group by any possible property and you can also run these queries from Power BI.
More information can be found at https://azure.microsoft.com/en-us/documentation/articles/app-insights-analytics/

Resources