I have come across the requirement where I have to choose the API for Cosmos DB.
I have gone through with all API's like SQL,Graph, Mongo and Table. Since my current project structure is based on Table storage where I am storing IoT Device data.
In Current structure (Table storage) :
I have a separate Table for each Device with payload like below
{
Timestamp,
Parameter name,
value
}
Now If I plan to use Cosmos DB then I can see that I have to Provision RU/throughput against each table which I think going to be big cost. I have not found any way to assign RU on database level so that my allocated RU can be shared across all tables.
Please let me know in case we have something here.... or is it the limitation i can treat for CosmosDB with Table API?
As far as I can see SQL API and consider my use case I can create a single data base and then multiple collection (with the name of Table) and then I have both option for RU provision like on Database as well as on Device level which give me more control on cost.
You can set the throughput on the account level.
You can optionally provision throughput at the account level to be shared by all tables in this account, to reduce your bill. These settings can be changed ONLY when you don't have any tables in the account. Note, throughput provisioned at the account level is billed for, whether you have tables created or not. The estimate below is approximate and does not include any discounts you may be entitled to.
Azure Cosmos DB pricing
The throughput configured on the database is shared across all the containers of the database. You can choose to explicitly exclude certain containers from database provisioning and instead provision throughput for those containers at container level.
A Cosmos DB database maps to the following: a database while using SQL or MongoDB APIs, a keyspace while using Cassandra API or a database account while using Gremlin or Table storage APIs.
You can embed Cerebrata into the situation where the tools allow you to assign any number of throughput values post assigning the throughput type (fixed, auto-scale, or no throughput)
Disclaimer: It’s purely based on my experience
Related
I'm building an Azure IoT Hub application. I have several customers, each with a set of devices. Do you think all those customers should be connected to the same hub or a different one(s)?
I would like to populate a multi tenant db (single db, multiple schemas) via azure stream analytics. The idea is to use a job that partitions the data by customer and saves it in a table of a specific schema (schema associated to a specific customer) on my db. It's possible to do it, or the only way to keep customer data separate is to have several db's (instead of having one db and multiple schemas)?
I'm building an Azure IoT Hub application. I have several customers,
each with a set of devices. Do you think all those customers should be
connected to the same hub or a different one(s)?
It really depends on the data which is processed and also your actual requirements. If sharing the IoT Hub resource details with other customers is not an issue, then you can use the same IoT Hub. Else, choose individual IoT Hubs.
I would like to populate a multi tenant db (single db, multiple
schemas) via azure stream analytics. The idea is to use a job that
partitions the data by customer and saves it in a table of a specific
schema (schema associated to a specific customer) on my db. It's
possible to do it, or the only way to keep customer data separate is
to have several db's (instead of having one db and multiple schemas)?
SQL output in Azure Stream Analytics supports writing in parallel as an option. This option allows for fully parallel job topologies, where multiple output partitions are writing to the destination table in parallel. Enabling this option in Azure Stream Analytics however may not be sufficient to achieve higher throughputs, as it depends significantly on your database configuration and table schema. The choice of indexes, clustering key, index fill factor, and compression have an impact on the time to load tables. For more information about how to optimize your database to improve query and load performance based on internal benchmarks, see SQL Database performance guidance. Ordering of writes is not guaranteed when writing in parallel to SQL Database.
See Increase throughput performance to Azure SQL Database from Azure Stream Analytics for more details.
The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time, I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha
First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.
Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.
I Have data getting pushed from Azure IoT Hub -> Stream Analytics -> CosmosDB
I had 1 simulated device and my cosmos DB collection was of 1000 RU/s working fine .
now i have made it 10 simulated devices and my Cosmos DB collection scaled to 15000 RU/s still my stream analytics getting degraded .
Is there i need to increase number of parallel connections to collection ?
can we make it more optimal As Azure pricing of Cosmos DB , Depend on throughput and RU
Can we make it more optimal as Azure pricing of Cosmos DB, depend on
throughput and RUs?
I just want to share some thoughts with you about improving write performance of Cosmos db here.
1.Consistency Level
Based on the document:
Depending on what levels of read consistency your scenario needs
against read and write latency, you can choose a consistency level on
your database account.
You could try to set Consistency Level as Eventually. Details please refer to here.
2.Indexing:
Based on the document:
by default, Azure Cosmos DB enables synchronous indexing on each CRUD
operation to your collection. This is another useful option to control
the write/read performance in Azure Cosmos DB.
Please try to set index lazy. Also, remove useless index.
3.Partition:
Based on the document:
Azure Cosmos DB unlimited are the recommended approach for
partitioning your data, as Azure Cosmos DB automatically scales
partitions based on your workload. When writing to unlimited
containers, Stream Analytics uses as many parallel writers as previous
query step or input partitioning scheme.
Please partition your collection and pass the partition key in output to improve write performance.
what is the best way to limit latency for SQL Azure in global applications?
My Application uses SQL Azure and would like to know based on the network location of users if its possible to connect SQL Azure near to users.
So Logically would need to have SQL Azure database with global replication but not geo-replication as each copy would serve as Master and not secondary.
Thank you in advance.
You may want to try CosmosDB to distribute data globally and obtain low latency as explained on this article and this documentation.
For replicating data using SQL Data Sync with Azure SQL Database, take in consideration paired regions which may reduce latency. With SQL Data Sync a hub database can be defined and many member database on another region, and data can be synched on both ways between the hub and any member database.
We are considering partitioning our data through multiple collections, or creating a separate collection per key.
The key range is in tens of thousands, less than 100k. My concern here is the collections limit per each SQL (DocumentDB) database.
According to Azure Cosmos DB limits, it seems that there is no limit about the count of collection.
If you have any questions about the scale Azure Cosmos DB provides, you could send email to askcosmosdb#microsoft.com.
Azure Cosmos DB is a global scale database in which throughput and storage can be scaled to handle whatever your application requires.
But I don't recommand that you create so huge number of collections, because collections are billing entities. For more information, you could refer to Azure cosmos FAQ.
Collections are also the billing entities for Azure Cosmos DB. Each collection is billed hourly, based on the provisioned throughput and used storage space