Azure Table Storage - Entity Design Best Practices Question - azure

Im writing a 'proof of concept' application to investigate the possibility of moving a bespoke ASP.NET ecommerce system over to Windows Azure during a necessary re-write of the entire application.
Im tempted to look at using Azure Table Storage as an alternative to SQL Azure as the entities being stored are likely to change their schema (properties) over time as the application matures further, and I wont need to make endless database schema changes. In addition we can build refferential integrity into the applicaiton code - so the case for considering Azure Table Storage is a strong one.
The only potential issue I can see at this time is that we do a small amount of simple reporting - i.e. value of sales between two dates, number of items sold for a particular product etc.
I know that Table Storage doesnt support aggregate type functions, and I believe we can achieve what we want with clever use of partitions, multiple entity types to store subsets of the same data and possibly pre-aggregation but Im not 100% sure about how to go about it.
Does anyone know of any in-depth documents about Azure Table Storage design principles so that we make proper and efficient use of Tables, PartitionKeys and entity design etc.
there's a few simplistic documents around, and the current books available tend not to go into this subject in much depth.
FYI - the ecommerce site has about 25,000 customers and takes about 100,000 orders per year.

Have you seen this post ?
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
Pretty thorough coverage of tables

I think there are three potential issues I think in porting your app to Table Storage.
The lack of reporting - including aggregate functions - which you've already identified
The limited availability of transaction support - with 100,000 orders per year I think you'll end up missing this support.
Some problems with costs - $1 per million operations is only a small cost, but you can need to factor this in if you get a lot of page views.
Honestly, I think a hybrid approach - perhaps EF or NH to SQL Azure for critical data, with large objects stored in Table/Blob?
Enough of my opinion! For "in depth":
try the storage team's blog http://blogs.msdn.com/b/windowsazurestorage/ - I've found this very good
try the PDC sessions from Jai Haridas (couldn't spot a link - but I'm sure its there still)
try articles inside Eric's book - http://geekswithblogs.net/iupdateable/archive/2010/06/23/free-96-page-book---windows-azure-platform-articles-from.aspx
there's some very good best practice based advice on - http://azurescope.cloudapp.net/ - but this is somewhat performance orientated

If you have start looking at Azure storage such as table, it would do no harm in looking at other NOSQL offerings in the market (especially around document databases). This would give you insight into NOSQL space and how solution around such storages are designed.
You can also think about a hybrid approach of SQL DB + NOSQL solution. Parts of the system may lend themselves very well to Azure table storage model.
NOSQL solutions such as Azure table have their own challenges such as
Schema changes for data. Check here and here
Transactional support
ACID constraints. Check here

All table design papers I have seen are pretty much exclusively focused on the topics of scalability and search performance. I have not seen anything related to design considerations for reporting or BI.
Now, azure tables are accessible through rest APIs and via the azure SDK. Depending on what reporting you need, you might be able to pull out the information you require with minimal effort. If your reporting requirements are very sophisticated, then perhaps SQL azure together with Windows Azure SQL Reporting services might be a better option to consider?

Related

How much DocumentDB is suitable for saving application logs?

I want to save logs and traces if my bulky , big enterprise app in DocumentDB.
so that those logs not only help developer to troubleshoot issues in production but also helps Business takes critical data driven decisions.
For such scenario does Mongo DB or Azure Doc DB suits ?
There is no right answer to this question - only opinions.
Here are some tradeoffs you may want to consider:
Pros:
Document-oriented databases, like DocumentDB, are schema-agnostic. This means the logging data's schema is dictated solely by the application. In other words, you can store log output without having to manage schema updates between both the application and database and keeping those models in sync (low friction).
DocumentDB automatically indexes every property in every document (record). This can speed up your ability to query off arbitrary attributes when debugging... which in turn, can reduce your time-to-mitigate when troubleshooting high-severity incidents.
Cons:
When compared to storing logs as blobs in a blob store... DocumentDB can look fairly expensive as a log store. You are paying a premium to able to easily index and quickly query off of the data you are storing. You will want to make sure you are getting value out of what you are paying for.
As the comments above suggested, NoSQL is an umbrella term that which encapsulates key-value store, column-oriented databases, document-oriented databases, graph databases, etc. I'd recommend taking a quick look at the differences between various database categories and understand the differences.
As with any project (logging or otherwise)... You should evaluate the tradeoffs you are making when picking between technologies. An important aspect to software engineering is making the right tradeoffs, and not checking feature tickboxes for the sake of checkboxes.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Key differences between Azure DocumentDB and Azure Table Storage

I am choosing database technology for my new project. I am wondering what are the key differences between Azure DocumentDB and Azure Table Storage?
It seems that main advantage of DocumentDB is full text search and rich query functionality. If I understand it correctly, I would not need separate search engine library such as Lucene/Elasticsearch.
On the other hand Table Storage is much cheaper.
What are the other differences that could influence my decision?
I consider Azure Search an alternative to Lucene. I used Lucene.net in a worker role and simply the idea of not having to deal with the infrastructure, ingestion, etc.. issues make the Azure Search service very appealing to me.
There is a scenario I approached with Azure storage in which I see DocumentDB
as a perferct fit, and it might explain my point of view.
I used Azure storage to prepare and keep daily summaries of the user activities in my solution outside of Azure SQL Database, as the summaries are requested frequently by a large number of clients with good chances to experience spikes on certain times of the day. A simple write once read many scenario usage pattern (my schema) Azure SQL db found it difficult to cope with while it perfectly fit the capacity of storage (btw daily summaries were not in cache because of size) .
This scenario evolved over time and now I happen to keep more aggregated and ready to use data in those summaries, and updates became more complex.
Keeping these daily summaries in DocumentDB would make the write once part of the scenario more granular, updating only the relevant data in the complex summary, and ease the read part, as the capability of getting parts of more summaries becomes a trivial quest, for example.
I would consider DocumentDB in scenarios in which data is unstructured and rather complex and I need rich query capability (Table storage is lagging on this part).
I would consider Azure Search in scenarios in which a high throughput full-text search is required.
I did not find the quotas/expected perf to precisely compare DocumentDB to Search but I highly suspect Search is the best fit to replace Lucene.
HTH, Davide

ColumnStore index benefits on Azure?

We are currently running on Azure and we have a table with hundreds of millions of rows. This table is static and will be refreshed weekly. We've looked at ColumnStore index but unfortunately it is not Azure yet so below are my questions,
Will ColumnStore index be available in Azure?
if not what other technology can we use to get the same performance
benefits as the ColumnStore index would provide?
Can we get the same query performance by using Azure Table Storage?
I'm a newbie to both Azure and Columnar databases so please bear me with me if I ask these questions.. :)
About ColumnStore, if you have bought the license, you can check with development team or ask on blogs such as ScottGu's Blog. From there only you will come to know about any feature release.
Azure Database is designed for scalability. You will need to use the Partition Key very wisely. Partition Key is like index of book, so if you want to search something in book, you can quickly refer to the index and reach the page quickly. In other words, you can group data depending upon certain criteria and store it in a single partition. So where ever you have the same criteria, your query will hit only one partition. The thing with partitions is, for a table you can any number of partition, but it is not necessary that all the partition will reside on same machine or even same farm. So when you fire a query on badly designed Azure Table, it can hit more than one server, and thus bad performance. Read about Real World: Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Hope you get what you are looking for.
As Amar pointed out, keep an eye on the team blogs for the latest in new feature announcements. The goal for SQL Azure is for it to eventually be where new features are found first. However, it will still take awhile for things to get there.
As for your peformance question, there's no simple answer for this. Windows Azure resources are designed for scale, not necessarially high performance. So its to take your scale/capacity targets into account when designing solutions. For your situation, I would encourage you to conside table storage, but this will depend on frequency access and the types of queries you need to make on the data. Just do not be surprised if you have to mave redundant copies of your data that are modelled differently, or possibly even running parrallel queries and aggregating results. This is the way table storage was designed to be used. Its cheaper then SQL Azure and its this price difference that makes redundant specialized data models possible.
This approach also has to be weighed against the cost of retraining your developers to stop thinking in RDBMS terms. :)

What design decisions can I make today, that would make a migration to Azure and Azure Tables easier later?

I'm rebuilding an application from the ground up. At some point in the future...not sure if it's near or far yet, I'd like to move it to Azure. What decisions can I make today, that will make that migration easier.
I'm going to be dealing with large amounts of data, and like the idea of Azure Tables...are there some specific persistance choices I can make now that will mimick Azure Tables so that when the time comes the pain of migration will be lessened?
A good place to start is the Windows Azure Guidance
If you want to use Azure Tables eventually, you could design your database where all tables are a primary key, plus a field with XML data.
I would advise to plan along the lines of almost-infinitely scalable solutions (see Pat Helland's paper on Life beyond distributed transactions) and the CQRS approach in general. This way you'll be able to avoid common pitfalls of the distributed apps generally and Azure table storage peculiarities.
This really helps us to work with Azure and Cloud Computing at Lokad (data-sets are quite large plus various levels of scalability are needed).

Resources