Is Azure table storage data retrieval faster than Sql Azure - azure

There is a requirement to store data (xml data) in some storage. As it is mentioned it stores large XML data, each record (row) size nearly 1MB. The doubt is which storage we are going to use to store the data means Azure Table storage (Storage Account) or Sql Azure.
So which storage will help data store and retrieval faster?

When looking at sheer volume, Table Storage is today far more scalable than SQL Azure. Given a storage account (storage accounts hold blobs, queues and tables) is allowed to be 100TB in size, in theory your table could consume all 100TB. At first glance, a 100TB chunk of data may seem overwhelming. However, Table Storage can be partitioned. Each partition of Table Storage can be moved to a separate server by the Azure controller thereby reducing the load on any single server. As demand lessens, the partitions can be reconsolidated. Reads of Azure Table Storage are load balanced across three replicas to help performance.
Entities in Table Storage are limited to 1MB each with no more than 255 properties (3 of which are required partition key, row key, and timestamp).
Today, SQL Azure databases are limited to 1GB or 10GB. However, sometime this month (June 2010), a 50GB limit is supposed to be available. What happens if your database is larger than 10GB today (or 50GB tomorrow)? Options include repartitioning your database into multiple smaller databases or sharding (Microsoft’s generally recommended approach). Without getting into the database details of both of these database design patterns, both of these approaches are not without issue and complexity, some of which must be resolved at the application level.
It's hard to say Azure table storage data retrieval must be faster than Sql Azure. It depend on your data structure, size.
As you said, each record (row) size of your XML data nearly 1MB, if not exceed the limit 1MB, you can choose the Table Storage first.
You can reference this document to know more comparisons about Azure Table Storage and SQL Azure: Azure Table Storage vs. Windows SQL Azure
Hope this helps.

Related

Reduce storage in cosmos db

I just realized that some of the tables which I moved from parquet to cosmos db, have pretty big size, as obviously there is not the same level of compression like in parquet. That is obviously resulting in big cost. Eventually RUs don't cost me much, but storage is a bit high. Any good recommendations how to reduce the size of collections in Cosmos db. Apart from the excluding not needed fields and indexes?
Cosmos DB is not designed to be a cold store for massive amounts of data that isn't actively queried. If you have huge amounts of data that is infrequently queried, one suggestion would be to enable Synapse Link and let it write that data from Cosmos DB into analytical storage on a remote blob store in parquet format. With your data in analytical store, you can then TTL the data from Cosmos DB that you are not actively using and querying for OLTP operations.
If you need to query the older data, you can provision a new Workspace and Notebooks and use SQL or Spark to query the data. If you don't need to query it then you can just let the data remain there. Best of all the storage costs are the same as regular blob storage, definitely less expensive than the price for storage in Cosmos DB which is .25c/GB due it being on cluster SSD storage.
Maybe someone could find it useful, but I have resolved this problem by applying "high storage low throughput program" https://learn.microsoft.com/en-us/azure/cosmos-db/set-throughput#high-storage-low-throughput-program

Azure Storage: Table vs Blob

Here is the problem. I have the devices pushing telemetry messages to Azure IoT hub and currently, I save all messages to the Table Storage with partition key device Id and row key telemetry kind. What I want to do is restrict the size of stored data. For instance, the table should keep only up to 50 MB and the should be cleared. What kind of storage should I use for such use case and what are the benefits? Any suggestions are highly appreciated.
Neither Azure Tables nor Azure Blobs have the feature where the content automatically gets deleted after a certain size is reached. In fact, I don't think I have come across any cloud storage solution that offers it (I've seen the data gets automatically deleted based on age).
Thus if you want to delete the data once it reaches a certain size, you will have to write some code and schedule it (using either Functions or WebJobs). That code will find the size occupied and delete the data going over the limit.
Between Blobs and Tables, I am somewhat conflicted. With Blobs, it is much easier to get the storage consumed - You just list the blobs in a container and sum up the size of the blobs. With tables, you will need to keep on fetching entities (i.e. download the data) and calculate the size of that data. But then deleting data from tables is easier as you will be deleting rows (unless you store each record in a separate blob).
If it were not on the data size and rather based on the data age, I would have recommended Cosmos DB. Though more expensive than Azure Storage, but you could define TTL at the collection level and based on that policy, the documents will be automatically deleted.

Why is Polybase slow for large compressed files that span 1 billion records?

What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?
For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.
Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.
What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?
In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.
Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.
If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.
Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.
You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.
Also look in to:
CTAS (Create Table as Select), the fastest way to move data from external tables in to internal storage in Azure Data Warehouse.
Creating statistics for your external tables if you are going to query them repeatedly. SQL Data Warehouse does not create statistics automatically like SQL Server and you need to do this yourself.

Azure Table Storage Vs On-premises NoSql

I need to consider a database to store large volumes of data. Though my initial requirement is to simply retrieve chunks of data and save them in excel file, I am expecting more complex use cases for this data in future where the data will be consumed by different applications especially for analytics - hence need to use aggregated queries.
I am open to use either cloud based storage or on-premises storage. I am considering azure storage table (when there is a need to use aggregated data, I can have a wrapper service + cache around azure table storage but eventually will end up with nosql type storage) and on-premises MongoDB. Can someone suggest pros and cons of saving large data in azure table storage Vs on-premises MongoDB/couchbase/ravendb? Cost factor can be ignored.
I suspect this question may end up getting closed due to its broad nature and potential for gathering more opinions than fact. That said:
This is really going to be an app-specific architecture issue, dealing with latency and bandwidth, as well as the need to maintain on-premises servers and other resources. On-prem, you'll have full control of your hardware resources, but if you're doing high-volume querying against your database, from the cloud, your performance will be hampered by latency and bandwidth. Cloud-based storage (whether in MongoDB or any other database) will have the advantage of being neighbors with your app if set up in the same data center.
Keep in mind: Any persistent database store will need to back its data in Azure Storage, meaning a mounted disk backed by Blob storage. You'll need to deal with the 1TB-per-disk size limit (expanding to 16TB on an 8-core box via stripe), and you'll need to compare this to your storage needs. If you need to go beyond 16TB, you'll need to either shard, go with 200TB Table storage, or go with on-prem MongoDB. But... MongoDB and Table Storage are two different beasts, one being document-based with a focus on query strength, the other a key/value store with very high speed discrete lookups. Comparing the two on the notion of on-prem vs cloud is secondary (in my opinion) to comparing functionality as it relates to your app.

How to manage through put of Azure Table Storage? (like AWS)

AWS dynamo db has a throughput parameter you can set.
How does Azure Table Storage scale in that regard?
Windows Azure Table Storage does not provide a throughput parameter instead with throughput is already set for the Azure Table storage as described in this article.
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
Up to 500 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target).

Resources