Does usage of COSMOS data migration tool consuming any RUs? - azure

Does usage of COSMOS data migration tool for periodical backups (export data to blob storage and then restore it when needed) consuming any RUs ? And does it affect performance or availability of my db operations?
What is recommended solution to implement in COSMOS smth like snapshots for blobs in az storage?

Does usage of COSMOS data migration tool for periodical backups
(export data to blob storage and then restore it when needed)
consuming any RUs?
Yes. Because you're reading from your Cosmos DB collections you will consume RUs.
And does it affect performance or availability of my db operations?
Again, Yes. Simply because you have assigned certain RUs for your collection and taking a backup is consuming a part of those RUs effectively leaving you with lesser RUs for other operations.
What is recommended solution to implement in COSMOS smth like
snapshots for blobs in az storage?
I'm also curious to know about this. Just thinking out loud, one possible solution would be to increase the throughput (i.e. either assign more RUs or enable RUPM) when backup is going on so that other operations are not impacted and bring them back to normal once backup is done.

Any operation against Cosmos DB is going to consume RU (that is, each operation has an RU cost). If you're reading content to back up to another source, it's still going to cost you RU during the reads.
As for affecting performance of other db operations: depends on your RU capacity. If you exceed your committed capacity, then you're throttled. If you find yourself being throttled while backing up, you'll need to increase RU accordingly.
There is no "best" solution for backup. Up to you.

Related

Reduce storage in cosmos db

I just realized that some of the tables which I moved from parquet to cosmos db, have pretty big size, as obviously there is not the same level of compression like in parquet. That is obviously resulting in big cost. Eventually RUs don't cost me much, but storage is a bit high. Any good recommendations how to reduce the size of collections in Cosmos db. Apart from the excluding not needed fields and indexes?
Cosmos DB is not designed to be a cold store for massive amounts of data that isn't actively queried. If you have huge amounts of data that is infrequently queried, one suggestion would be to enable Synapse Link and let it write that data from Cosmos DB into analytical storage on a remote blob store in parquet format. With your data in analytical store, you can then TTL the data from Cosmos DB that you are not actively using and querying for OLTP operations.
If you need to query the older data, you can provision a new Workspace and Notebooks and use SQL or Spark to query the data. If you don't need to query it then you can just let the data remain there. Best of all the storage costs are the same as regular blob storage, definitely less expensive than the price for storage in Cosmos DB which is .25c/GB due it being on cluster SSD storage.
Maybe someone could find it useful, but I have resolved this problem by applying "high storage low throughput program" https://learn.microsoft.com/en-us/azure/cosmos-db/set-throughput#high-storage-low-throughput-program

Azure Data Factory copy from Azure Block to CosmosDB is slow

I have a BlockBlob in Premium Azure Storage.
It's a 500mb zip file containing around 280 million phone numbers in CSV format.
I've created a Pipeline in ADF to unzip this and copy the entries into Cosmos DB SQL API, but it took 40 hours to complete. The goal is to update the DB nightly with a diff in the information.
My Storage Account and Cosmos DB are located in the same region.
The Cosmos DB partition key is the area code and that seems to distribute well.
Currently, at 20,000 RU's I've scaled a few time, but the portal keeps telling me to scale more. They are suggesting 106,000 RU's which is $6K a month.
Any ideas on practical ways I can speed this up?
-- Update.
I've tried importing the unzipped file, but it doesn't appear any faster. Slower in fact, despite reporting more peak connections.
I'm now trying to dynamically scale up/down the RU's to a really high number when it's time to start the transfer. Still playing with numbers. Not sure the formula to determine the number of RUs I need to transfer this 10.5GB in X minutes.
I ended up dynamically scaling the throughput with Azure Functions. Yes, the price for Cosmos would have been very expensive if I left the RUs very high. However, I only need that high when I'm doing the data ingestion and then scale back down. I used a Logic App to call Azure Function to "Scale the RUs up" then kicks off my Azure Data Factory Pipeline. When it's down it calls the Azure Function to scale down.

Azure Stream Analytics Job degrading while pushing data to cosmos DB

I Have data getting pushed from Azure IoT Hub -> Stream Analytics -> CosmosDB
I had 1 simulated device and my cosmos DB collection was of 1000 RU/s working fine .
now i have made it 10 simulated devices and my Cosmos DB collection scaled to 15000 RU/s still my stream analytics getting degraded .
Is there i need to increase number of parallel connections to collection ?
can we make it more optimal As Azure pricing of Cosmos DB , Depend on throughput and RU
Can we make it more optimal as Azure pricing of Cosmos DB, depend on
throughput and RUs?
I just want to share some thoughts with you about improving write performance of Cosmos db here.
1.Consistency Level
Based on the document:
Depending on what levels of read consistency your scenario needs
against read and write latency, you can choose a consistency level on
your database account.
You could try to set Consistency Level as Eventually. Details please refer to here.
2.Indexing:
Based on the document:
by default, Azure Cosmos DB enables synchronous indexing on each CRUD
operation to your collection. This is another useful option to control
the write/read performance in Azure Cosmos DB.
Please try to set index lazy. Also, remove useless index.
3.Partition:
Based on the document:
Azure Cosmos DB unlimited are the recommended approach for
partitioning your data, as Azure Cosmos DB automatically scales
partitions based on your workload. When writing to unlimited
containers, Stream Analytics uses as many parallel writers as previous
query step or input partitioning scheme.
Please partition your collection and pass the partition key in output to improve write performance.

HDInsight: HBase or Azure Table Storage?

Currently my team is creating a solution that would use HDInsight. We will be getting 5TB of data daily and will need to do some map/reduce jobs on this data. Would there be any performance/cost difference if our data will be stored in Azure Table Storage instead of Azure HBase?
The main differences will be in both functionality and cost.
Azure Table Storage doesn't have a map reduce engine attached to it in itself, though of course you could use the map reduce approach to write your own.
You can use Azure HDInsight to connect Map Reduce to table storage. There are a couple of connectors around, including one written by me which is hive focused and requires some configuration, and may not suit your partition scheme (http://www.simonellistonball.com/technology/hadoop-hive-inputformat-azure-tables/) and a less performance focused, but more complete version from someone at Microsoft (http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx).
The main advantage of Table Storage is that you aren't constantly taking processing cost.
If you use HBase, you will need to run a full cluster all the time, so there is a cost disadvantage, however, you will get some functionality and performance gains, plus you will have something a bit more portable, should you wish to use other hadoop platforms. You would also have access to a much greater range of analytic functionality with the HBase option.
HDInsight (HBase/Hadoop) uses Azure Blob storage not ATS. For your data-storage you will charged only applicable blob storage cost, based on your subscription.
P.S. Don't forget to delete your cluster once job has completed, to avoid charges. Your data will persist in BLOB storage and can be used by next cluster you build.

Azure Table Storage Vs On-premises NoSql

I need to consider a database to store large volumes of data. Though my initial requirement is to simply retrieve chunks of data and save them in excel file, I am expecting more complex use cases for this data in future where the data will be consumed by different applications especially for analytics - hence need to use aggregated queries.
I am open to use either cloud based storage or on-premises storage. I am considering azure storage table (when there is a need to use aggregated data, I can have a wrapper service + cache around azure table storage but eventually will end up with nosql type storage) and on-premises MongoDB. Can someone suggest pros and cons of saving large data in azure table storage Vs on-premises MongoDB/couchbase/ravendb? Cost factor can be ignored.
I suspect this question may end up getting closed due to its broad nature and potential for gathering more opinions than fact. That said:
This is really going to be an app-specific architecture issue, dealing with latency and bandwidth, as well as the need to maintain on-premises servers and other resources. On-prem, you'll have full control of your hardware resources, but if you're doing high-volume querying against your database, from the cloud, your performance will be hampered by latency and bandwidth. Cloud-based storage (whether in MongoDB or any other database) will have the advantage of being neighbors with your app if set up in the same data center.
Keep in mind: Any persistent database store will need to back its data in Azure Storage, meaning a mounted disk backed by Blob storage. You'll need to deal with the 1TB-per-disk size limit (expanding to 16TB on an 8-core box via stripe), and you'll need to compare this to your storage needs. If you need to go beyond 16TB, you'll need to either shard, go with 200TB Table storage, or go with on-prem MongoDB. But... MongoDB and Table Storage are two different beasts, one being document-based with a focus on query strength, the other a key/value store with very high speed discrete lookups. Comparing the two on the notion of on-prem vs cloud is secondary (in my opinion) to comparing functionality as it relates to your app.

Resources