Reduce storage in cosmos db - azure

I just realized that some of the tables which I moved from parquet to cosmos db, have pretty big size, as obviously there is not the same level of compression like in parquet. That is obviously resulting in big cost. Eventually RUs don't cost me much, but storage is a bit high. Any good recommendations how to reduce the size of collections in Cosmos db. Apart from the excluding not needed fields and indexes?

Cosmos DB is not designed to be a cold store for massive amounts of data that isn't actively queried. If you have huge amounts of data that is infrequently queried, one suggestion would be to enable Synapse Link and let it write that data from Cosmos DB into analytical storage on a remote blob store in parquet format. With your data in analytical store, you can then TTL the data from Cosmos DB that you are not actively using and querying for OLTP operations.
If you need to query the older data, you can provision a new Workspace and Notebooks and use SQL or Spark to query the data. If you don't need to query it then you can just let the data remain there. Best of all the storage costs are the same as regular blob storage, definitely less expensive than the price for storage in Cosmos DB which is .25c/GB due it being on cluster SSD storage.

Maybe someone could find it useful, but I have resolved this problem by applying "high storage low throughput program" https://learn.microsoft.com/en-us/azure/cosmos-db/set-throughput#high-storage-low-throughput-program

Related

Is Azure table storage data retrieval faster than Sql Azure

There is a requirement to store data (xml data) in some storage. As it is mentioned it stores large XML data, each record (row) size nearly 1MB. The doubt is which storage we are going to use to store the data means Azure Table storage (Storage Account) or Sql Azure.
So which storage will help data store and retrieval faster?
When looking at sheer volume, Table Storage is today far more scalable than SQL Azure. Given a storage account (storage accounts hold blobs, queues and tables) is allowed to be 100TB in size, in theory your table could consume all 100TB. At first glance, a 100TB chunk of data may seem overwhelming. However, Table Storage can be partitioned. Each partition of Table Storage can be moved to a separate server by the Azure controller thereby reducing the load on any single server. As demand lessens, the partitions can be reconsolidated. Reads of Azure Table Storage are load balanced across three replicas to help performance.
Entities in Table Storage are limited to 1MB each with no more than 255 properties (3 of which are required partition key, row key, and timestamp).
Today, SQL Azure databases are limited to 1GB or 10GB. However, sometime this month (June 2010), a 50GB limit is supposed to be available. What happens if your database is larger than 10GB today (or 50GB tomorrow)? Options include repartitioning your database into multiple smaller databases or sharding (Microsoft’s generally recommended approach). Without getting into the database details of both of these database design patterns, both of these approaches are not without issue and complexity, some of which must be resolved at the application level.
It's hard to say Azure table storage data retrieval must be faster than Sql Azure. It depend on your data structure, size.
As you said, each record (row) size of your XML data nearly 1MB, if not exceed the limit 1MB, you can choose the Table Storage first.
You can reference this document to know more comparisons about Azure Table Storage and SQL Azure: Azure Table Storage vs. Windows SQL Azure
Hope this helps.

Azure Stream Analytics Job degrading while pushing data to cosmos DB

I Have data getting pushed from Azure IoT Hub -> Stream Analytics -> CosmosDB
I had 1 simulated device and my cosmos DB collection was of 1000 RU/s working fine .
now i have made it 10 simulated devices and my Cosmos DB collection scaled to 15000 RU/s still my stream analytics getting degraded .
Is there i need to increase number of parallel connections to collection ?
can we make it more optimal As Azure pricing of Cosmos DB , Depend on throughput and RU
Can we make it more optimal as Azure pricing of Cosmos DB, depend on
throughput and RUs?
I just want to share some thoughts with you about improving write performance of Cosmos db here.
1.Consistency Level
Based on the document:
Depending on what levels of read consistency your scenario needs
against read and write latency, you can choose a consistency level on
your database account.
You could try to set Consistency Level as Eventually. Details please refer to here.
2.Indexing:
Based on the document:
by default, Azure Cosmos DB enables synchronous indexing on each CRUD
operation to your collection. This is another useful option to control
the write/read performance in Azure Cosmos DB.
Please try to set index lazy. Also, remove useless index.
3.Partition:
Based on the document:
Azure Cosmos DB unlimited are the recommended approach for
partitioning your data, as Azure Cosmos DB automatically scales
partitions based on your workload. When writing to unlimited
containers, Stream Analytics uses as many parallel writers as previous
query step or input partitioning scheme.
Please partition your collection and pass the partition key in output to improve write performance.

Does usage of COSMOS data migration tool consuming any RUs?

Does usage of COSMOS data migration tool for periodical backups (export data to blob storage and then restore it when needed) consuming any RUs ? And does it affect performance or availability of my db operations?
What is recommended solution to implement in COSMOS smth like snapshots for blobs in az storage?
Does usage of COSMOS data migration tool for periodical backups
(export data to blob storage and then restore it when needed)
consuming any RUs?
Yes. Because you're reading from your Cosmos DB collections you will consume RUs.
And does it affect performance or availability of my db operations?
Again, Yes. Simply because you have assigned certain RUs for your collection and taking a backup is consuming a part of those RUs effectively leaving you with lesser RUs for other operations.
What is recommended solution to implement in COSMOS smth like
snapshots for blobs in az storage?
I'm also curious to know about this. Just thinking out loud, one possible solution would be to increase the throughput (i.e. either assign more RUs or enable RUPM) when backup is going on so that other operations are not impacted and bring them back to normal once backup is done.
Any operation against Cosmos DB is going to consume RU (that is, each operation has an RU cost). If you're reading content to back up to another source, it's still going to cost you RU during the reads.
As for affecting performance of other db operations: depends on your RU capacity. If you exceed your committed capacity, then you're throttled. If you find yourself being throttled while backing up, you'll need to increase RU accordingly.
There is no "best" solution for backup. Up to you.

Why is Polybase slow for large compressed files that span 1 billion records?

What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?
For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.
Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.
What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?
In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.
Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.
If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.
Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.
You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.
Also look in to:
CTAS (Create Table as Select), the fastest way to move data from external tables in to internal storage in Azure Data Warehouse.
Creating statistics for your external tables if you are going to query them repeatedly. SQL Data Warehouse does not create statistics automatically like SQL Server and you need to do this yourself.

Azure Data Warehouse Database Storage

I am new to Azure Data Warehouse and little confused while reading some articles as to where the data is actually stored. Is it the compute nodes that store the data for the db tables or the azure blob storage?
Thanks
From Azure Documentation..
SQL Data Warehouse is a massively parallel processing (MPP) distributed database system. By dividing data and processing capability across multiple nodes, SQL Data Warehouse can offer huge scalability - far beyond any single system. Behind the scenes, SQL Data Warehouse spreads your data across many shared-nothing storage and processing units. The data is stored in Premium locally redundant storage, and linked to compute nodes for query execution. With this architecture, SQL Data Warehouse takes a "divide and conquer" approach to running loads and complex queries. Requests are received by the Control node, optimized and then passed to the Compute nodes to do their work in parallel.

Resources