How efficient can Azure BLOB Table service can be? - azure

How efficient azure blob tables can be?
Azure BLOB service has various components like Containers, Queues and Tables too. How efficient can tables be, what is their exact use case and why are they generally used with a supporting service like Azure CosmoDB.
Can anyone help me understand the concept and thought behind it?
Edit: The problem I am facing is that I have to log a processing batch of 700 000 data rows in C#, into BLOB Tables. How do I achieve this in the best practices?

This is a three in one question :-)
How efficient can tables be
Very efficient, if used properly. Every row in a table has a PartitionKey and Rowkey. When querying data it performs very well if you can reduce the set by using (parts of) the PartitionKey and RowKey. As soon as you start filtering on other columns performance can decrease very fast. See also the docs regarding this topic.
what is their exact use case
It is basically a key/value pair nosql solution. It can be used very efficient to store simple data in a fast and cheap manner. It is one of the cheapest options when it comes to data storage. Tables don't have a fixed schema (hence, nosql) and is used to store for example logs, configuration data and simple data structures.
and why are they generally used with a supporting service like Azure CosmosDB.
This is not the case. Azure Table Storage can be used on its own. CosmosDB has a Table API that lets you make uses of CosmosDB against code written for Azure Table Storage without code modifications. It allows for premium performance as not only the PartitionKey and Rowkey are indexed, but all the other columns as well. So as soon as you start filtering on other columns performance will still be very good. But it will costs you more in terms of money.
Data storage could be best done using batches as data is written per partition. See the answer of Ivan.
Some more material on when to use it:
https://markheath.net/post/azure-tables-what-are-they-good-for
https://blogs.msdn.microsoft.com/brunoterkaly/2013/01/13/knowing-when-to-choose-windows-azure-table-storage-or-windows-azure-sql-database/

Related

Using SQL Stored Procedure vs Databricks in Azure Data Factory

I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases

Azure: Redis vs Table Storage for web cache

We currently use Redis as our persistent cache for our web application but with it's limited memory and cost I'm starting to consider whether Table storage is a viable option.
The data we store is fairly basic json data with a clear 2 part key which we'd use for the partition and row key in table storage so I'm hoping that would mean fast querying.
I appreciate one is in memory and one is out so table storage will be a bit slower but as we scale I believe there is only one CPU serving data from a Redis cache whereas with Table storage we wouldn't have that issue as it would be down to the number of web servers we have running.
Does anyone have any experience of using Table storage in this way or comparisons between the 2.
I should add we use Redis in a very minimalist way get/set and nothing more, we evict our own data and failing that leave the eviction to Redis when it runs out of space.
This is a fairly broad/opinion-soliciting question. But from an objective perspective, these are the attributes you'll want to consider when deciding which to use:
Table Storage is a durable, key/value store. As such, content doesn't expire. You'll be responsible for clearing out data.
Table Storage scales to 500TB.
Redis is scalable horizontally across multiple nodes (or, scalable via Redis Service). In contrast, Table Storage will provide up to 2,000 transactions / sec on a partition, 20,000 transactions / sec across the storage account, and to scale beyond, you'd need to utilize multiple storage accounts.
Table Storage will have a significantly lower cost footprint than a VM or Redis service.
Redis provides features beyond Azure Storage tables (such as pub/sub, content eviction, etc).
Both Table Storage and Redis Cache are accessible via an endpoint, with many language-specific SDK wrappers around the API's.
I find some metrials about the azure redis and table, hope that it can help you.There is a video about Azure Redis that also including a demo to compare between table storage and redis about from 50th minute in the videos.
Perhaps it can be as reference. But detail performance it depends on your application, data records and so on.
The pricing of the table storage depends on the capacity of table storage, please refer to details. It is much cheaper than redis.
There are many differences you might care about, including price, performance, and feature set. And, persistence of data, and data consistency.
Because redis is an in-memory data store it is pretty expensive. This is so that you may get low latency. Check out Azure's planning FAQ here for a general understanding of redis performance in a throughput sense.
Azure Redis planning FAQ
Redis does have an optional persistence feature, that you can turn on, if you want your data persisted and restored when the servers have rare downtime. But it doesn't have a strong consistency guarantee.
Azure Table Storage is not a caching solution. It’s a persistent storage solution, and saves the data permanently on some kind of disk. Historically (disclaimer I have not look for the latest and greatest performance numbers) it has much higher read and write latency. It is also strictly a key-value store model (with two-part keys). Values can have properties but with many strict limitations, around size of objects you can store, length of properties, and so on. These limitations are inflexible and painful if your app runs up against them.
Redis has a larger feature set. It can do key-value but also has a bunch of other data structures like sets and lists, and many apps can find ways to benefit from that added flexibility.
See 'Introduction to Redis' (redis docs) .
CosmosDB could be yet another alternative to consider if you're leaning primarily towards Azure technologies. It is pretty expensive, but quite fast and feature-rich. While also being primarily intended to be a persistent store.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Key differences between Azure DocumentDB and Azure Table Storage

I am choosing database technology for my new project. I am wondering what are the key differences between Azure DocumentDB and Azure Table Storage?
It seems that main advantage of DocumentDB is full text search and rich query functionality. If I understand it correctly, I would not need separate search engine library such as Lucene/Elasticsearch.
On the other hand Table Storage is much cheaper.
What are the other differences that could influence my decision?
I consider Azure Search an alternative to Lucene. I used Lucene.net in a worker role and simply the idea of not having to deal with the infrastructure, ingestion, etc.. issues make the Azure Search service very appealing to me.
There is a scenario I approached with Azure storage in which I see DocumentDB
as a perferct fit, and it might explain my point of view.
I used Azure storage to prepare and keep daily summaries of the user activities in my solution outside of Azure SQL Database, as the summaries are requested frequently by a large number of clients with good chances to experience spikes on certain times of the day. A simple write once read many scenario usage pattern (my schema) Azure SQL db found it difficult to cope with while it perfectly fit the capacity of storage (btw daily summaries were not in cache because of size) .
This scenario evolved over time and now I happen to keep more aggregated and ready to use data in those summaries, and updates became more complex.
Keeping these daily summaries in DocumentDB would make the write once part of the scenario more granular, updating only the relevant data in the complex summary, and ease the read part, as the capability of getting parts of more summaries becomes a trivial quest, for example.
I would consider DocumentDB in scenarios in which data is unstructured and rather complex and I need rich query capability (Table storage is lagging on this part).
I would consider Azure Search in scenarios in which a high throughput full-text search is required.
I did not find the quotas/expected perf to precisely compare DocumentDB to Search but I highly suspect Search is the best fit to replace Lucene.
HTH, Davide

ColumnStore index benefits on Azure?

We are currently running on Azure and we have a table with hundreds of millions of rows. This table is static and will be refreshed weekly. We've looked at ColumnStore index but unfortunately it is not Azure yet so below are my questions,
Will ColumnStore index be available in Azure?
if not what other technology can we use to get the same performance
benefits as the ColumnStore index would provide?
Can we get the same query performance by using Azure Table Storage?
I'm a newbie to both Azure and Columnar databases so please bear me with me if I ask these questions.. :)
About ColumnStore, if you have bought the license, you can check with development team or ask on blogs such as ScottGu's Blog. From there only you will come to know about any feature release.
Azure Database is designed for scalability. You will need to use the Partition Key very wisely. Partition Key is like index of book, so if you want to search something in book, you can quickly refer to the index and reach the page quickly. In other words, you can group data depending upon certain criteria and store it in a single partition. So where ever you have the same criteria, your query will hit only one partition. The thing with partitions is, for a table you can any number of partition, but it is not necessary that all the partition will reside on same machine or even same farm. So when you fire a query on badly designed Azure Table, it can hit more than one server, and thus bad performance. Read about Real World: Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Hope you get what you are looking for.
As Amar pointed out, keep an eye on the team blogs for the latest in new feature announcements. The goal for SQL Azure is for it to eventually be where new features are found first. However, it will still take awhile for things to get there.
As for your peformance question, there's no simple answer for this. Windows Azure resources are designed for scale, not necessarially high performance. So its to take your scale/capacity targets into account when designing solutions. For your situation, I would encourage you to conside table storage, but this will depend on frequency access and the types of queries you need to make on the data. Just do not be surprised if you have to mave redundant copies of your data that are modelled differently, or possibly even running parrallel queries and aggregating results. This is the way table storage was designed to be used. Its cheaper then SQL Azure and its this price difference that makes redundant specialized data models possible.
This approach also has to be weighed against the cost of retraining your developers to stop thinking in RDBMS terms. :)

Resources