Multi-tenant IoT application with Azure - azure

I'm building an Azure IoT Hub application. I have several customers, each with a set of devices. Do you think all those customers should be connected to the same hub or a different one(s)?
I would like to populate a multi tenant db (single db, multiple schemas) via azure stream analytics. The idea is to use a job that partitions the data by customer and saves it in a table of a specific schema (schema associated to a specific customer) on my db. It's possible to do it, or the only way to keep customer data separate is to have several db's (instead of having one db and multiple schemas)?

I'm building an Azure IoT Hub application. I have several customers,
each with a set of devices. Do you think all those customers should be
connected to the same hub or a different one(s)?
It really depends on the data which is processed and also your actual requirements. If sharing the IoT Hub resource details with other customers is not an issue, then you can use the same IoT Hub. Else, choose individual IoT Hubs.
I would like to populate a multi tenant db (single db, multiple
schemas) via azure stream analytics. The idea is to use a job that
partitions the data by customer and saves it in a table of a specific
schema (schema associated to a specific customer) on my db. It's
possible to do it, or the only way to keep customer data separate is
to have several db's (instead of having one db and multiple schemas)?
SQL output in Azure Stream Analytics supports writing in parallel as an option. This option allows for fully parallel job topologies, where multiple output partitions are writing to the destination table in parallel. Enabling this option in Azure Stream Analytics however may not be sufficient to achieve higher throughputs, as it depends significantly on your database configuration and table schema. The choice of indexes, clustering key, index fill factor, and compression have an impact on the time to load tables. For more information about how to optimize your database to improve query and load performance based on internal benchmarks, see SQL Database performance guidance. Ordering of writes is not guaranteed when writing in parallel to SQL Database.
See Increase throughput performance to Azure SQL Database from Azure Stream Analytics for more details.

Related

Using ksqldb to join data from multiple types of source connectors

We are evaluating ksqldb as a ETL tool for our organization. Our entire app is hosted on Microsoft Azure and mostly PaaS offerings are preferable in our organization. However 1 use case is that we have multiple microservices with their own databases and we want to join the tables in the databases together to produce some data in a denormalized format for some other tasks. An example would be Users table containing user data whereas Orders table contains all the orders. Users maybe in SQL format in MySQL whereas Orders maybe in NoSQL format in MongoDB. Now we need to generate some report on by joining Orders and Users tables together based on user_id. This can be done in ksqldb by using some joins on streams/tables and adding source connectors to each of the databases. Then we can write a sink connector to a new MongoDB database that can have the joined Users_Orders info. So if new data is added and the connectors and joins are running our joined data in Users_Orders will also get updated.
With Azure Event Hub I read that using ksqldb in production will not be possible because of some licensing issues. So my questions are:
Before going into some other products like Azure HDInsights or Confluent Cloud is there any way of running ksqldb to achieve the same solution (perhaps like managing your own Kafka cluster)?
You don't necessarily need ksql; you should be able to do something similar with SparkSQL, offered in Azure (Databricks). You don't necessarily need Kafka / EventHub either since Spark could read, join, and write Mongo/JDBC data all on its own (with the appropriate plugins).
The main reason ksqlDB isn't a hosted service by Azure, is that it conflicts with Confluent Licensing, but that does not prevent you from running it yourself, as long as you also adhere to the licensing restrictions of not publicly offering the ksqlDB REST API as a publicly available / paid API. I've not personally tried, but ksqlDB should work against EventHubs on its own, I don't think you need to self manage Kafka as the documentation suggests.

Azure Architecture Implementation Ideas

We've designed a Data Architecture for our client on Azure wherein, We ingest the sources into a Raw Layer consisting of a Azure SQL Database. This Azure SQL Database acts as a Source Mirror and Has Near Real time sync enabled.
We also have an ODS Layer which is populated from the Previously mentioned Azure SQL Database (Source Mirror) as per the given Data Model. This Layer should ideally take anywhere between 30mins and 1 Hour to Load.
May I Know How Do I Handle the Concurrent Writes and Reads from the Raw Layer (Azure SQL Database, Source Mirror) ? It Syncs every 5 mins with the Sources but also read every 30mins - 1 Hour for ODS Layer.
I've to Use Azure Data Factory to Implement my Data Loads
Yes, Azure Data Factory platform is best fit for such scenarios. Its a cloud-based ETL and data integration tool that lets you build data-driven processes for managing data transportation and data transformation at scale. You can use Azure Data Factory to design and plan data-driven processes (also known as pipelines) that may consume data from a variety of sources. With data flows or computing services like Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database, you can design sophisticated ETL processes that convert data graphically.
When using control flow, you can use a GetMetadata activity to get a
list of files in a storage account, then pass that list to a for each
activity with the Sequential flag set to false to process all files
concurrently (in parallel) up to the maximum batch size according to
the activities defined in the for each loop.
Here, is the Microsoft Official Document for the Azure Datafactory Connectors Overview | Docs

Storing IOT Data in Azure: SQL vs Cosmos vs Other Methods

The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time,  I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha
First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.
Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.

Provision throughput on Database level using Table API in cosmos db

I have come across the requirement where I have to choose the API for Cosmos DB.
I have gone through with all API's like SQL,Graph, Mongo and Table. Since my current project structure is based on Table storage where I am storing IoT Device data.
In Current structure (Table storage) :
I have a separate Table for each Device with payload like below
{
Timestamp,
Parameter name,
value
}
Now If I plan to use Cosmos DB then I can see that I have to Provision RU/throughput against each table which I think going to be big cost. I have not found any way to assign RU on database level so that my allocated RU can be shared across all tables.
Please let me know in case we have something here.... or is it the limitation i can treat for CosmosDB with Table API?
As far as I can see SQL API and consider my use case I can create a single data base and then multiple collection (with the name of Table) and then I have both option for RU provision like on Database as well as on Device level which give me more control on cost.
You can set the throughput on the account level.
You can optionally provision throughput at the account level to be shared by all tables in this account, to reduce your bill. These settings can be changed ONLY when you don't have any tables in the account. Note, throughput provisioned at the account level is billed for, whether you have tables created or not. The estimate below is approximate and does not include any discounts you may be entitled to.
Azure Cosmos DB pricing
The throughput configured on the database is shared across all the containers of the database. You can choose to explicitly exclude certain containers from database provisioning and instead provision throughput for those containers at container level.
A Cosmos DB database maps to the following: a database while using SQL or MongoDB APIs, a keyspace while using Cassandra API or a database account while using Gremlin or Table storage APIs.
You can embed Cerebrata into the situation where the tools allow you to assign any number of throughput values post assigning the throughput type (fixed, auto-scale, or no throughput)
Disclaimer: It’s purely based on my experience

Maintaining relationships between data stored in Azure SQL and Table Storage

Working on a project in which the high volume data will be moved from SQL Server (running on an Azure VM) to Azure Table Storage for scaling and cheaper storage reasons. There are several foreign-keys within the data that is being moved to Table Storage, which are GUIDs (Primary-key) in SQL tables. Obviously, there is no means to ensure referential integrity, as transactions do not span different Azure storage types. I would like to know if anyone has had any success with this storage design. Are there any transaction management solutions that allow transactions to be created that span SQL Server and Azure Table Storage? What are the implications of having queries that read from both databases (SQL and Table Storage)?
If you're trying to perform a transaction that spans both SQL Server and Azure Tables, your best bet will be using the eventually consistent transaction pattern.
In a nutshell, you will put your updates into a queue message, then have a worker process (whether it be a WebJob, Worker Role, something running on your VM) dequeue the message with peeklock, make sure that all steps within the transaction are executed, then call complete on the message.
If you're looking to perform a transaction on just an Azure Table, you can do so with batch updates as long as your entities live within the same partition.

Resources