As the title says, I'm confused about the role Azure Data Explorer has in the Azure data ecosystem. The documentation states that it's an analytics tool, but technically it ingests data from different sources such as kafka, spark and go on.
Is it some kind of enhanced datawarehouse?
TIA
"For our own troubleshooting needs we wanted to run ad-hoc queries on
the massive telemetry data stream produced by our service. Finding no
suitable solution, we decided to create one"
- Ziv Caspi Architect, Azure Data Explorer -
Once we established the need, we can discuss the implementation.
Here are some key features:
The service is distributed and can be easily scaled out (or in) which makes it good fit for big data (big as you need).
The data is ingested into the service in batch/stream and stored in a propriety format.
The data is stored in tables (columns & rows).
Columns' data types include bool, int, long, real, decimal, datetime & timespan as well as native support for JSON (the dynamic data type).
Everything is indexed, including free text that is tokenized and indexed with Full-text search index, which mean we can find rows with specific tokens in sub-seconds - seconds.
The data is stored in a columnar format which makes it great for aggregations on large volumes.
ADX has its own highly intuitive query language, KQL (Kusto Query Language), which supports numerous analytical features including distributed joins.
ADX has native support for time-series with a lot of built-in functionality around it (forecast, anomaly detection etc.).
Since the service was created to handle telemetry and telemetry does not change over time, the service was created as append only (inserts) + built-in support for data retention.
Later on, soft & hard deletes were added.
As of today, updates are not supported.
Here is some additional reading:
Ziv Caspi: Azure Data Explorer Technology 101
Brian Harry: Introducing Application Insights Analytics
Uri Barash: Azure Announcements: Azure Data Explorer
Related
Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different versions and as a result the data is ingested in different but similar schema in the data lake. For instance a startDate field can be string or an object containing date. ) Our goal is to visualise accumulated information in a BI tool.
Questions:
What are the best practices for dealing with polymorphic data?
Process and transform required piece of data (reduced version) to a uni-schema file using a programming language and then process it in spark and databricks and consume in a BI tool.
Decompose data to the meaningful groups and process and join (using data relationships) them with spark and databricks.
I appreciate your comments and sharing opinions and experiences on this topic especially from subject matter experts and data engineers. That would be siper nice if you could also share some useful resources about this particular topic.
Cheers!
One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.
The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.
Landing Zone
Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.
Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.
Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.
Staging Area
Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.
The transformation options you will have are:
Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.
As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.
ODS Area
Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.
I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.
The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.
Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.
Source Control
Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.
Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.
Version1:{
'theProperty': 8
}
Version2:{
'data':{
'myProperty': 10
}
}
Processing =>
Refined version: [{
'property: 8
},
{
'property: 10
}]
So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?
I am very new to Azure. I need to create a Power BI dashboard to visualize some data produced by a sensor. The dashboard needs to get updated "almost" real-time. I have identified that I need a push data set, as I want to visualize some historic data on a line chart. However, from the architecture point of view, I could use the Power BI REST APIs (which would be completely fine in my case, as we process the data with a Python app and I could use that to call Power BI) or Azure Stream Analytics (which could also work, I could dump the data to the Azure Blob storage from the Python app and then stream it).
Can you tell me generally speaking, what are the advantages/disadvantages of the two approaches?
Azure stream analytics lets you have multiple sources and define multiple targets and one of those targets could be Power-BI and Blob ... and at the same time you can use windowing function on the data as it comes in. It also provides you a visual way of managing your pipeline including windowing function.
In your case you are kind of replicating the incoming data to Blob first and secondly to power-BI. But if you have a use case to apply windowing function(1 minutes or so) as your data is coming in from multiple sources e.g. more than one sensor or a senor and other source, you have to fiddle around a lot to get it working manually, where as in stream analytics you can easily do it.
Following article highlights some of the pros and cons of Azure Analytics...
https://www.axonize.com/blog/iot-technology/the-advantages-and-disadvantages-of-using-azure-stream-analytics-for-iot-applications/
If possible, I would recommend streaming data to IoT Hub first, and then ASA can pick it up and render the same on Power BI. It will provide you better latency than streaming data from Blob to ASA and then Power BI. It is the recommended IoT pattern for remote monitoring, predictive maintenance etc , and provides you longer term options to add a lot of logic in the real-time pipelines (ML scoring, windowing, custom code etc).
We have a cloud platform with various Health Care applications. Each application needs what we call reference data. Reference data is always external data coming from a provider on a daily or some regular schedule. An example of reference data is FDB MedKnowledge which includes a comprehensive compendium of consumer medication monographs, along with drug images and imprints.
Various applications will query the reference data to present it to their target customers (who can be physicians, nurses, technicians, procurement department etc...). A common global API will be developed to return the requested data.
Historical information is required ( for ex: FDB in 2017 had NDC1 which then got deleted from the FDB feed in 2019. So a physician who prescribed NDC1 should be able to query the information of that drug going through history).
Daily we receive the feed from the external provider and use it as input source to merge ( update, insert, delete) our reference data copy such that its live table reflects the latest external feed.
In Azure, we have the following storage options:
Blob storage
Cosmos Db
Azure sql database with system versioning
Azure Datawarehouse
Azure Data lake
What is the best practice to store external reference data? We are leaning toward azure sql database with system versioning. Have any of you worked with external reference data? If yes, what is your storage decision and has it worked well for you? I would like to hear your comments and opinions. Thank you!
You need to base your choice on the type of data you are trying to store, and how you need to reference it. It sounds like you might actually need a few different technologies here.
For example, Azure SQL is great for storing relational data. So if your data is tabular in form and needs to have relationships between it, then this is a good choice. However, if you're going to be storing millions and millions of rows then performance might suffer in a relational database. In that sort of scenario, or one where you have lots of transactional data you might want to look at Cosmos DB.
You mentioned images at one point, putting these in a database is not a good idea, in this sort of scenario you are going to want to look at using blob storage.
"Reference Data" really doesn't mean anything, look at the individual types of data you need to store, and how this data is used, and make decisions based on this. For lots of different types of data, there is unlikely to be a one size fits all solution.
Our team have just recently started using Application Insights to add telemetry data to our windows desktop application. This data is sent almost exclusively in the form of events (rather than page views etc). Application Insights is useful only up to a point; to answer anything other than basic questions we are exporting to Azure storage and then using Power BI.
My question is one of data structure. We are new to analytics in general and have just been reading about star/snowflake structures for data warehousing. This looks like it might help in providing the answers we need.
My question is quite simple: Is this the right approach? Have we over complicated things? My current feeling is that a better approach will be to pull the latest data and transform it into a SQL database of facts and dimensions for Power BI to query. Does this make sense? Is this what other people are doing? We have realised that this is more work than we initially thought.
Definitely pursue Michael Milirud's answer, if your source product has suitable analytics you might not need a data warehouse.
Traditionally, a data warehouse has three advantages - integrating information from different data sources, both internal and external; data is cleansed and standardised across sources, and the history of change over time ensures that data is available in its historic context.
What you are describing is becoming a very common case in data warehousing, where star schemas are created for access by tools like PowerBI, Qlik or Tableau. In smaller scenarios the entire warehouse might be held in the PowerBI data engine, but larger data might need pass through queries.
In your scenario, you might be interested in some tools that appear to handle at least some of the migration of Application Insights data:
https://sesitai.codeplex.com/
https://github.com/Azure/azure-content/blob/master/articles/application-insights/app-insights-code-sample-export-telemetry-sql-database.md
Our product Ajilius automates the development of star schema data warehouses, speeding the development time to days or weeks. There are a number of other products doing a similar job, we maintain a complete list of industry competitors to help you choose.
I would continue with Power BI - it actually has a very sophisticated and powerful data integration and modeling engine built in. Historically I've worked with SQL Server Integration Services and Analysis Services for these tasks - Power BI Desktop is superior in many aspects. The design approaches remain consistent - star schemas etc, but you build them in-memory within PBI. It's way more flexible and agile.
Also are you aware that AI can be connected directly to PBI Web? This connects to your AI data in minutes and gives you PBI content ready to use (dashboards, reports, datasets). You can customize these and build new reports from the datasets.
https://powerbi.microsoft.com/en-us/documentation/powerbi-content-pack-application-insights/
What we ended up doing was not sending events from our WinForms app directly to AI but to the Azure EventHub
We then created a job that reads from the eventhub and send the data to
AI using the SDK
Blob storage for later processing
Azure table storage to create powerbi reports
You can of course add more destinations.
So basically all events are send to one destination and from there stored in many destinations, each for their own purposes. We definitely did not want to be restricted to 7 days of raw data and since storage is cheap and blob storage can be used in many analytics solutions of Azure and Microsoft.
The eventhub can be linked to stream analytics as well.
More information about eventhubs can be found at https://azure.microsoft.com/en-us/documentation/articles/event-hubs-csharp-ephcs-getstarted/
You can start using the recently released Application Insights Analytics' feature. In Application Insights we now let you write any query you would like so that you can get more insights out of your data. Analytics runs your queries in seconds, lets you filter / join / group by any possible property and you can also run these queries from Power BI.
More information can be found at https://azure.microsoft.com/en-us/documentation/articles/app-insights-analytics/
I am choosing database technology for my new project. I am wondering what are the key differences between Azure DocumentDB and Azure Table Storage?
It seems that main advantage of DocumentDB is full text search and rich query functionality. If I understand it correctly, I would not need separate search engine library such as Lucene/Elasticsearch.
On the other hand Table Storage is much cheaper.
What are the other differences that could influence my decision?
I consider Azure Search an alternative to Lucene. I used Lucene.net in a worker role and simply the idea of not having to deal with the infrastructure, ingestion, etc.. issues make the Azure Search service very appealing to me.
There is a scenario I approached with Azure storage in which I see DocumentDB
as a perferct fit, and it might explain my point of view.
I used Azure storage to prepare and keep daily summaries of the user activities in my solution outside of Azure SQL Database, as the summaries are requested frequently by a large number of clients with good chances to experience spikes on certain times of the day. A simple write once read many scenario usage pattern (my schema) Azure SQL db found it difficult to cope with while it perfectly fit the capacity of storage (btw daily summaries were not in cache because of size) .
This scenario evolved over time and now I happen to keep more aggregated and ready to use data in those summaries, and updates became more complex.
Keeping these daily summaries in DocumentDB would make the write once part of the scenario more granular, updating only the relevant data in the complex summary, and ease the read part, as the capability of getting parts of more summaries becomes a trivial quest, for example.
I would consider DocumentDB in scenarios in which data is unstructured and rather complex and I need rich query capability (Table storage is lagging on this part).
I would consider Azure Search in scenarios in which a high throughput full-text search is required.
I did not find the quotas/expected perf to precisely compare DocumentDB to Search but I highly suspect Search is the best fit to replace Lucene.
HTH, Davide