Azure Data Factory copy from Azure Block to CosmosDB is slow - azure

I have a BlockBlob in Premium Azure Storage.
It's a 500mb zip file containing around 280 million phone numbers in CSV format.
I've created a Pipeline in ADF to unzip this and copy the entries into Cosmos DB SQL API, but it took 40 hours to complete. The goal is to update the DB nightly with a diff in the information.
My Storage Account and Cosmos DB are located in the same region.
The Cosmos DB partition key is the area code and that seems to distribute well.
Currently, at 20,000 RU's I've scaled a few time, but the portal keeps telling me to scale more. They are suggesting 106,000 RU's which is $6K a month.
Any ideas on practical ways I can speed this up?
-- Update.
I've tried importing the unzipped file, but it doesn't appear any faster. Slower in fact, despite reporting more peak connections.
I'm now trying to dynamically scale up/down the RU's to a really high number when it's time to start the transfer. Still playing with numbers. Not sure the formula to determine the number of RUs I need to transfer this 10.5GB in X minutes.

I ended up dynamically scaling the throughput with Azure Functions. Yes, the price for Cosmos would have been very expensive if I left the RUs very high. However, I only need that high when I'm doing the data ingestion and then scale back down. I used a Logic App to call Azure Function to "Scale the RUs up" then kicks off my Azure Data Factory Pipeline. When it's down it calls the Azure Function to scale down.

Related

Storing IOT Data in Azure: SQL vs Cosmos vs Other Methods

The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time,  I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha
First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.
Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.

Need help to understand the billing terms/parameters of Azure Storage Account

I have a azure storage account in V1. Now I am confused about shifting to V2. My system is accessing, moving and deleting the blob files very frequently. So, if I move to V2, I have to keep it Hot tier.
My system is processing around 1000 blob files daily and their total size will be around 1 GB. So at the end of the month, total 30000 files will be created in my system. all 30000 will be processed and moved to different blob folder (Deleted from current and created in other). Means 30 GB blob file will be created and moved.
To calculate the price of this scenario I have used following link:
https://azure.microsoft.com/en-in/pricing/calculator/?&OCID=AID719810_SEM_ha1Od3sE&lnkd=Google_Azure_Brand&dclid=CMTolaWp3OECFQLtjgodO7oBpQ
Calculation of billing in V1 (RA-GRS) is easy. it contains only two parameters:
Capacity
Storage transactions
So I selected 30GB as Capacity and 60000 (30000 created and 30000 move) Storage transactions. It costs me US$ 23.43
I want to calculate the same thing for V2 (RA-GRS and Hot Tier). But there are so many parameters:
Capacity
Write Operations
List and Create Container Operations
Read operations
All other operations
Data retrieval
Data write
Geo-replication data transfer
I am confused about in which field I should fill what value?
Please help me to understand these parameters/fields of storage account V2 billing.
Thank You.

Does usage of COSMOS data migration tool consuming any RUs?

Does usage of COSMOS data migration tool for periodical backups (export data to blob storage and then restore it when needed) consuming any RUs ? And does it affect performance or availability of my db operations?
What is recommended solution to implement in COSMOS smth like snapshots for blobs in az storage?
Does usage of COSMOS data migration tool for periodical backups
(export data to blob storage and then restore it when needed)
consuming any RUs?
Yes. Because you're reading from your Cosmos DB collections you will consume RUs.
And does it affect performance or availability of my db operations?
Again, Yes. Simply because you have assigned certain RUs for your collection and taking a backup is consuming a part of those RUs effectively leaving you with lesser RUs for other operations.
What is recommended solution to implement in COSMOS smth like
snapshots for blobs in az storage?
I'm also curious to know about this. Just thinking out loud, one possible solution would be to increase the throughput (i.e. either assign more RUs or enable RUPM) when backup is going on so that other operations are not impacted and bring them back to normal once backup is done.
Any operation against Cosmos DB is going to consume RU (that is, each operation has an RU cost). If you're reading content to back up to another source, it's still going to cost you RU during the reads.
As for affecting performance of other db operations: depends on your RU capacity. If you exceed your committed capacity, then you're throttled. If you find yourself being throttled while backing up, you'll need to increase RU accordingly.
There is no "best" solution for backup. Up to you.

How well does Azure SQL Data Warehouse scale?

I want to replace all my on-prem DW on SQL Server and use Azure SQL DW. My plan is to remove the spoke and hub model that I currently use for my on-prem SQL and basically have a large Azure SQL DW instance that scale with my client base (currently at ~1000). Would SQL DW scale or I need to retain my spoke and hub model?
Azure SQL Data Warehoue is a great choice for removing your hub and spoke model. The service allows you to scale storage and compute independently to meet the compute and storage needs of your data warehouse. For example, you may have a large number of data sets that are infrequently accessed. SQL Data Warehouse allows you to have a small number of compute resources (to save costs) and using SQL Server features like table partitioning access only the data in the "hot" partitions efficiently - say the last 6 months.
The service offers the ability to adjust the compute power by moving a slider up or down - with the compute change happening in about 60 seconds (during the preview). This allows you to start small - say with a single spoke - and add over time making the migration to the cloud easy. As you need more power, you can simply add DWU/Compute resources by moving the slider to the right.
As the compute model scales, the number of query and loading concurrency slots increase offering you the ability to support larger numbers of customers. You can read more about Elastic performance and scale with SQL Data Warehouse on Azure.com

best design solution to migrate data from SQL Azure to Azure Table

In our service, we are using SQL Azure as the main storage, and Azure table for the backup storage. Everyday about 30GB data is collected and stored to SQL Azure. Since the data is no longer valid from the next day, we want to migrate the data from SQL Azure to Azure table every night.
The question is.. what would be the most efficient way to migrate data from Azure to Azure table?
The naive idea i came up with is to leverage the producer/consumer concept by using IDataReader. That is, first get a data reader by executing "select * from TABLE" and put data into a queue. At the same time, a set of threads are working to grab data from the queue, and insert them into Azure Table.
Of course, the main disadvantage of this approach (i think) is that we need to maintain the opened connection for a long time (might be several hours).
Another approach is to first copy data from SQL Azure table to local storage on Windows Azure, and use the same producer/consumer concept. In this approach we can disconnect the connection as soon as the copy is done.
At this point, i'm not sure which one is better, or even either of them is a good design to implement. Could you suggest any good design solution for this problem?
Thanks!
I would not recommend using local storage primarily because
It is transient storage.
You're limited by the size of local storage (which in turn depends on the size of the VM).
Local storage is local only i.e. it is accessible only to the VM in which it is created thus preventing you from scaling out your solution.
I like the idea of using queues, however I see some issues there as well:
Assuming you're planning on storing each row in a queue as a message, you would be performing a lot of storage transactions. If we assume that your row size is 64KB, to store 30 GB of data you would be doing about 500000 write transactions (and similarly 500000 read transactions) - I hope I got my math right :). Even though the storage transactions are cheap, I still think you'll be doing a lot of transactions which would slow down the entire process.
Since you're doing so many transactions, you may get hit by storage thresholds. You may want to check into that.
Yet another limitation is the maximum size of a message. Currently a maximum of 64KB of data can be stored in a single message. What would happen if your row size is more than that?
I would actually recommend throwing blob storage in the mix. What you could do is read a chunk of data from SQL table (say 10000 or 100000 records) and save that data in blob storage as a file. Depending on how you want to put the data in table storage, you could store the data in CSV, JSON or XML format (XML format for preserving data types if it is needed). Once the file is written in blob storage, you could write a message in the queue. The message will contain the URI of the blob you've just written. Your worker role (processor) will continuously poll this queue, get one message, fetch the file from blob storage and process that file. Once the worker role has processed the file, you could simply delete that file and the message.

Resources