Using ksqldb to join data from multiple types of source connectors

Using ksqldb to join data from multiple types of source connectors - azure

We are evaluating ksqldb as a ETL tool for our organization. Our entire app is hosted on Microsoft Azure and mostly PaaS offerings are preferable in our organization. However 1 use case is that we have multiple microservices with their own databases and we want to join the tables in the databases together to produce some data in a denormalized format for some other tasks. An example would be Users table containing user data whereas Orders table contains all the orders. Users maybe in SQL format in MySQL whereas Orders maybe in NoSQL format in MongoDB. Now we need to generate some report on by joining Orders and Users tables together based on user_id. This can be done in ksqldb by using some joins on streams/tables and adding source connectors to each of the databases. Then we can write a sink connector to a new MongoDB database that can have the joined Users_Orders info. So if new data is added and the connectors and joins are running our joined data in Users_Orders will also get updated.
With Azure Event Hub I read that using ksqldb in production will not be possible because of some licensing issues. So my questions are:
Before going into some other products like Azure HDInsights or Confluent Cloud is there any way of running ksqldb to achieve the same solution (perhaps like managing your own Kafka cluster)?

You don't necessarily need ksql; you should be able to do something similar with SparkSQL, offered in Azure (Databricks). You don't necessarily need Kafka / EventHub either since Spark could read, join, and write Mongo/JDBC data all on its own (with the appropriate plugins).
The main reason ksqlDB isn't a hosted service by Azure, is that it conflicts with Confluent Licensing, but that does not prevent you from running it yourself, as long as you also adhere to the licensing restrictions of not publicly offering the ksqlDB REST API as a publicly available / paid API. I've not personally tried, but ksqlDB should work against EventHubs on its own, I don't think you need to self manage Kafka as the documentation suggests.

Related

Multi-tenant IoT application with Azure

I'm building an Azure IoT Hub application. I have several customers, each with a set of devices. Do you think all those customers should be connected to the same hub or a different one(s)?
I would like to populate a multi tenant db (single db, multiple schemas) via azure stream analytics. The idea is to use a job that partitions the data by customer and saves it in a table of a specific schema (schema associated to a specific customer) on my db. It's possible to do it, or the only way to keep customer data separate is to have several db's (instead of having one db and multiple schemas)?

I'm building an Azure IoT Hub application. I have several customers,
each with a set of devices. Do you think all those customers should be
connected to the same hub or a different one(s)?
It really depends on the data which is processed and also your actual requirements. If sharing the IoT Hub resource details with other customers is not an issue, then you can use the same IoT Hub. Else, choose individual IoT Hubs.
I would like to populate a multi tenant db (single db, multiple
schemas) via azure stream analytics. The idea is to use a job that
partitions the data by customer and saves it in a table of a specific
schema (schema associated to a specific customer) on my db. It's
possible to do it, or the only way to keep customer data separate is
to have several db's (instead of having one db and multiple schemas)?
SQL output in Azure Stream Analytics supports writing in parallel as an option. This option allows for fully parallel job topologies, where multiple output partitions are writing to the destination table in parallel. Enabling this option in Azure Stream Analytics however may not be sufficient to achieve higher throughputs, as it depends significantly on your database configuration and table schema. The choice of indexes, clustering key, index fill factor, and compression have an impact on the time to load tables. For more information about how to optimize your database to improve query and load performance based on internal benchmarks, see SQL Database performance guidance. Ordering of writes is not guaranteed when writing in parallel to SQL Database.
See Increase throughput performance to Azure SQL Database from Azure Stream Analytics for more details.

Storing IOT Data in Azure: SQL vs Cosmos vs Other Methods

The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time, I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha

First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.

Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.

Relationships in Azure synapse (DWH)

I'm currently working in Azure synapse DWH and I have some theoretical questions:
How I can create relationships between tables (Dim's and Fact's) and what implications I would have If I want to create those relationships.
I read that To create a primary key, I would need to set a nonclustered table, but what that means?

Azure Synapse Analytics (ASA) has three engines:
serverless SQL pools (was SQL on-demand)
dedicated SQL pools (the next step on from Azure SQL Data Warehouse)
Apache Spark pools
None of these currently support database relationships, as at today. I suspect you mean dedicated SQL pools and just to confirm it does not support the FOREIGN KEY syntax. Relationships is more of an OLTP concept and not common in big data platforms, which ASA is.
Therefore your options are to enforce these relationships downstream or on import to your warehouse. A common method is to identify unknown values and substitute them with a -1 / Unknown value on import. This will ensure there are no NULLs in your key columns.
Additionally, enforce your relationships downstream eg in an Azure Analysis Services tabular model or Power BI model.
If you really need relationships then depending on your data volumes you might consider Azure SQL Database which supports data volumes up to 4TB alongside columnstore indexes which give great compression.

having a similar issue:
I cannot find an automated solution thus far;
I'm importing 'entities' from D365 to datalake; and it does NOT come with the relationships.
it will also NOT suggest the "Related Tables"
Introduce; ETL of 'entities' using T-SQL and Spark.
Governance of:: py.spark, notebooks, Schema, linting T-SQL. orchestration of activities and pipelines, workflows. Etc...
OR
For small datasets and projects:
Reverse look-up each table needed.
In Azure Synapse create a new DataFlow; and download the .PBIX ;
Do your ETL: Create Primary fact and dimension tables; (by whatever means), such as Using PowerPivot Unique/distinct DAX expression on a Customer.Table).
Once complete; if you like; import the newly ETL primary tables to the datalake.
Repeat step 2.
Create the relationships with PowerBI. (Ideally if ETL is done correctly PBI will auto find the relationships)
RE-Publish the .PBIX with the relationships as a “DataFlow”.
a. You must create relationships for every Dataflow; dataflows cannot be combined.
Measures and Dataflows will consume resources and require performance analysis if they grow.
at some point 'dataverse' may allow D365 data making this easier.
depending on your 'cost/spend' cloning all of D365 still doesn't solve your relationship needs.

Two solutions I'm aware of thus far:
Import the serverless DBO's to PowerBI; Model and Create the Dataset there. you can do massive ETL, including Foreign Key creation, and Filtering of NULL values to create primary keys for Dimensions. Aggregate data and create Fact tables, etc...
Its far easier then using the Synapse GUI. Drawbacks are PBI licensing related.
Create a "Lake Database" (map as you go, great for 5 or less entities.tables.) ETL is low-code. But I'm skeptical that after 40 hours of training; I should have just learned how to scrip this in Workbook/Spark.
Do BOTH; use PowerBI to develop your model and test it. Then go back to synapse and deploy the working model as a pipeline or lake database.
Points of Clarity from the top posted solution:
Do not trust the auto-relationship of PowerBI; stay away from pre-made REFID relationships in PBI unless you know for sure this is what you want. (step 6: original poster; if ETL is correct its a 1:M)
Publishing with .PBIX has its limitations with sharing and other issues the OP mentioned. Lake Database might be the workaround if you have Tabelau, Python, or Qlik as your solution.
DataVerse is coming; and PBI Analytics as well as predictive analysis with HD Insights will be embedded into D365. You will also be able to create drag and drop dashboards. As of 08-05-2022 this is already working in its infancy; even thought they want you to go modular; with hybrid serverless setup you can STILL Pull the aggregate measures from D365 into synapse and Reverse engineer them.

Can you do distributed transactions across Azure SQL and Azure Cosmos Db?

I have an C# application where my domain transactions are stored in Azure SQL. My event store I would like to utilize Azure Cosmos DB. I am wondering if a distributed transaction across them will work?

Of course distributed transactions work across whatever systems, if implemented correctly. But actually implementing a full architecture for a distributed atomic commit protocol with transaction managers is a daunting task.
If you mean built-in support in Azure - no, there is no support for that. There is only very limited support for transactions even within a single Cosmos DB container.

Which Azure products are needed for a staging database?

I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.

Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.

Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string