We are currently working on a design using Azure functions with Azure storage queue binding.
Each message in the queue represents a complete transaction. An Azure function will be bound to that queue so that the function will be triggered as soon as there is a new message in the queue.
The function will then commit the transaction in a SQL DB.
The first-cut implementation is also complete; and it's working fine. However, on retrospective, we are considering the following:
In a typical DAL, there are well-established design patterns using entity framework, repository patterns, etc. However, we didn't find a similar guidance/best practices when implementing DAL within a server-less code.
Therefore, my question is: should such patterns be implemented with Azure functions (this would be challenging :) ), or should the server-less code be kept as light as possible or this is not a use-case for azure functions, at all?
It doesn't take anything too special. We're using a routine set of library DLLs for all kinds of things -- database, interacting with other parts of Azure (like retrieving Key Vault secrets for connection strings), parsing file uploads, business rules, and so on. The libraries are targeting netstandard20 so we can more easily migrate to Functions v2 when the right triggers become available.
Mainly just design your libraries so they're highly modularized, so you can minimize how much you load to get the job done (assuming reuse in other areas of the system is important, which it usually is).
It would be easier if dependency injection was available today. See this for a few ways some of us have hacked it together until we get official DI support. (DI is on the roadmap for Functions, I believe the 3.0 release.)
At first I was a little worried about startup time with the library approach, but the underlying WebJobs stack itself is already pretty heavy, and Functions startup performance seems to vary wildly anyway (on the cheaper tiers, at least). During testing, one of our infrequently-executed Functions has varied from just ~300ms to a peak of about ~3800ms to parse the exact same test file, with all but ~55ms spent on startup).
should such patterns be implemented with Azure functions (this would
be challenging :) ), or should the server-less code be kept as light
as possible or this is not a use-case for azure functions, at all?
My answer is NO.
There should be patterns to follow, but the traditional repository patterns and CRUD operations do not seem to be valid in the cloud era.
Many strong concepts we were raised up to adhere to, became invalid these days.
Denormalizing the data base became something not only acceptable but preferable.
Now designing a pattern will depend on the database you selected for your solution and also depends of the type of your application and the type of your data.
This is a link for general guideline when you do Table Storage design Guidelines.
Is your application read-heavy or write-heavy ? The design will vary accordingly.
Are you using Azure Tables or Mongo? There are design decisions based on that. Indexing is important in Mongo while there is non in Azure table that you can do.
Sharding consideration.
Redundancy Consideration.
In modern development/Architecture many principles has changed, each Microservice has its own database that might be totally different that any other Microservices'.
If you read along the guidelines that I provided, you will see what I mean.
Designing your Table service solution to be read efficient:
Design for querying in read-heavy applications. When you are designing your tables, think about the queries (especially the latency sensitive ones) that you will execute before you think about how you will update your entities. This typically results in an efficient and performant solution.
Specify both PartitionKey and RowKey in your queries. Point queries such as these are the most efficient table service queries.
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with different keys) to enable more efficient queries.
Consider denormalizing your data. Table storage is cheap so consider denormalizing your data. For example, store summary entities so that queries for aggregate data only need to access a single entity.
Use compound key values. The only keys you have are PartitionKey and RowKey. For example, use compound key values to enable alternate keyed access paths to entities.
Use query projection. You can reduce the amount of data that you transfer over the network by using queries that select just the fields you need.
Designing your Table service solution to be write efficient:
Do not create hot partitions. Choose keys that enable you to spread your requests across multiple partitions at any point of time.
Avoid spikes in traffic. Smooth the traffic over a reasonable period of time and avoid spikes in traffic.
Don't necessarily create a separate table for each type of entity. When you require atomic transactions across entity types, you can store these multiple entity types in the same partition in the same table.
Consider the maximum throughput you must achieve. You must be aware of the scalability targets for the Table service and ensure that your design will not cause you to exceed them.
Another good source is this link:
Related
I've read a bit about microservices and the favored approach appears to be a separate database for each microservice. With regards to Azure's CosmosDB, would that mean a separate Table for each service? What's the best way to architect this?
There are a huge variety of factors to consider here which ultimately means there is no right answer to this question and it will be very specific to the nature of the application you're trying to build. As such, broad statements attempting to offer "general" advice and patterns should be taken with a huge grain of salt. With Cosmos a few of the many high level things to consider when making your decisions are as follows:
Partitioning: Cosmos collections support almost infinite scale based on the selection of an appropriate partition key. So, for example you could have a single collection and separate your services such that they each write to a distinct partition key. This would provide you with a form of service multi-tenancy which might be perfectly appropriate for your particular application. However, throughput is also scaled at the collection level so if certain services have much higher read and/or write requirements this may not work for you and could be an indication that that particular service should use it's own collection which can be scaled independently.
Cost: You're billed per collection with a minimum throughput requirement. Depending on the number and nature of your micro services this could result in exponentially higher costs for little gain.
Isolation: Again, depending on the nature of your application you might have a hard business requirement that data from different services be physically separate from each other which would force you to use separate collections.
The point that I'm trying to make here is that there is absolutely no right answer to this question. You need to weight the pros/cons very carefully in the context of the solution you are trying to build and select the approach that is right for you.
In terms of
scalability,
performance,
maintenance,
Ease of use / Learning curve
cost,
In order of significance but wouldn't mind a general answer as I appreciate I m probably asking for too much :)
Thanks
EDIT: I m looking for a database that will serve as the single authoritative data store and I need all attributes of the documents stored to be indexed for various business reasons. Therefore I know that other solutions won't do what I m looking for.
tl;dr; If you are using JavaScript and building browser apps, node.js and DocumentDB are a match made in heaven. If you are using .NET and/or other Azure services, then DocumentDB is favored. If you are using other AWS services, then SimpleDB might be better.
I know that questions like this are not ideal for Stack Overflow, but I often see value in answers like this and my most popular answer on SO is essentially informed opinion backed by evidence. I have not used SimpleDB but I looked into it before deciding on DocumentDB. I rejected it pretty quickly... although I did give AWS Lambda a serious look before deciding on DocumentDB. So:
scalability. DocumentDB has a very straight forward and explicit scaling model -- add more collections if you need either more space or more operations per second. SimpleDB's scaling model is similar except less straight forward since you add domains which are overloaded to both provide type separation (think tables) and scalability. You can scale either to whatever you need.
performance. Since I never built anything on it, I can't say anything about SimpleDB's performance. However, I've been very impressed with the performance of DocumentDB. Latency is less than 10ms for simple id-based reads and I get impressive latency and throughput for queries. The DocumentDB implementation of our current app returns complex n-dimensional aggregations (done in stored procedures on DocumentDB using documentdb-lumenize) in 1/4 the time of the functionally-equivalent MongoDB/node.js implementation. You'd have to do your own performance testing on your actuall application to have a definitive answer here.
maintenance. Both are much more hands off than traditional data stores. There just aren't that many knobs to turn maintaining either of them. SimpleDB geographically distributes your data by default. You'd have to do the equivalent manually in DocumentDB. Possible, but harder. DocumentDB has good import/export tools and their backup solution is about to be significantly upgraded.
ease of use / learning curve. If you are JavaScript programmer, than DocumentDB has a lot to recommend it. DocumentDB uses JSON natively. SimpleDB uses XML. DocumentDB has ACID-enabling stored procedures written in JavaScript. You'd need to combine SimpleDB with something else (Lambda maybe, but the XML/JavaScript mismatch would make this less than ideal) to get the equivalent. Both allow use to use SQL but DocumentDB also allows for JavaScript native queries.
There is one huge mindset hurdle that you will have to get over in order to be successful with DocumentDB. Despite the fact that they both scale by adding more domains/collections, SimpleDB domains are closer conceptually to tables. The word choice of "collection" by the DocumentDB team is unfortunate since they are more akin to partitions and should not be thought of as tables. The hard part is getting used to the idea that you store all of your different data types in the same collection. Once you get over that, I find DocumentDB's approach refreshing and incredibly flexible. I can efficiently model inheritance and type-mixins. Collections nay partitions have one purpose -- scalability. Domains are used for both scalability and data type separation which is actually harder in practice.
cost. Not much to say here. Both allow you to scale your cost gradually. For really small implementations, DocumentDB is probably more expensive since the smallest unit of usage is a single collection which is $25/month minimum. You'd have to do your own modeling/what-if analysis to determine which would be less expensive for you. Note, that Azure is being every aggressive in general and even pushing AWS to lower prices in some cases. My gut is that they would be roughly equal in cost for most applications.
Other thoughts:
You wrote, "I need all attributes of the documents stored to be indexed". One really nice feature of DocumentDB is that you can specify the size of your indexes By default, every field is indexed into a 3-byte per field hash index, which is highly space efficient. I do not know if SimpleDB has the equivalent.
This is a bit like comparing apples to oranges. I consider DocumentDB to be like MongoDB or CouchDB in it's data model and VoltDB in its use execution model (although VoltBD sprocs are written in Java). SimpleDB feels more like a simple XML object store. If you already have a big XML mindset, then it might be easier, but I think there are more folks using JSON today than XML.
Writing ACID-enabling stored procedures in JavaScript is a killer feature that only DocumentDB has. Some say the days of stored procedures are over; that you should put all such logic in your application server layer. If you implementing a simple CRUD API, that may be, but almost every application requires some sort of transaction where more than one row is changed at a time. This is mind bogglingly hard to do correctly without transaction support in your data store. Even if you do implement the equivalent of transactions with your NoSQL database, the overhead of the implementation eats away any development/performance/scalability advantages that you got by choosing NoSQL rather than SQL.
DocumentDB's user defined functions and triggers (also written in JavaScript) might be useful, although I believe the trigger implementation is crippled at this moment in time and I haven't found a use for UDFs myself yet.
DocumentDB has built-in attachment support. You need to integrate manually with S3 for the equivalent on AWS.
DocumentDB has geo indexing and operators.
SimpleDB's 1K per document limit is a serious limitation. This tells me that it's designed mostly for logging or as an index to S3 and not a full-fledged document store. The limit for DocumentDB is 512K.
If comparison to SimpleDB is like apples to oranges, then comparison to ElasticSearch is like apples to fire engines. My impression of ElasticSearch is that it's all about full-text searching and analytics. I don't think it's space/execution/api efficient to serve as a primary transactional store. Built on Lucene, it was not designed to have the reliability/durability to be your primary store. Further, even when hosted, it's more of an IaaS offering, wherease DocumentDB and SimpleDB are true PaaS offerings. The maintenance will be much higher with ElasticSearch.
I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html
I need to create incremental reports in the table storage. I need to be able to update the same records from several different worker role instances (different roles with several instances each).
My reports consist mainly of values that I need to increment after I parse the raw data I initially stored.
The optimistic solution I found is to use a retry mechanism: Try to update the record. If you get a 412 result code (you don't have the latest ETAG value), retry. This solution becomes less efficient and more costly the more users you have and the more data you need to update simultaneously (my case exactly).
Another solution that comes to mind is to have only one instance of one worker role that can possibly update any given record. This is very problematic because this means that I will by-design create bottlenecks in my architecture, which is the opposite of the scale I want to reach with Azure.
If anyone here has some best practices in mind for such a use case, I would love to hear it.
Most cloud storages (Table Storage is one of those) do not offer scalable writes on a single entity/blob/whatever. There is no quick-fix for this limitation, as this limitation comes from the core tradeoff that have being made to create cloud storage in the first place.
Basically, a storage unit (entity/blob/whatever) can be updated about once every 20ms, and that's about it. Having a dedicated worker or not will not change anything to this aspect.
Instead, you need to address your task from from a different angle. For counters, the most usual approach is the use of sharded counters (link is for GAE, but you can implement an equivalent behavior on Azure).
Also, another way to ease the pain to go for an asynchronous architecture ala CQRS where the performance constraints you put on the update latency of entities is significantly relaxed.
I believe the approach needs re-architecture. In order to ensure scalability and limit amount of contention, you want to make sure that every write can work optimistically by providing unique Table/PartitionKey/RowKey
If you need those values for reports to be merged together, have a separate process/worker that will post-aggregated/merge the records for reporting purposes. You can use a queue or a timing mechanism to start aggregation/merging
I'm just wondering if anyone who has experience on Azure Table Storage could comment on if it is a good idea to use 1 table to store multiple types?
The reason I want to do this is so I can do transactions. However, I also want to get a sense in terms of development, would this approach be easy or messy to handle? So far, I'm using Azure Storage Explorer to assist development and viewing multiple types in one table has been messy.
To give an example, say I'm designing a community site of blogs, if I store all blog posts, categories, comments in one table, what problems would I encounter? On ther other hand, if I don't then how do I ensure some consistency on category and post for example (assume 1 post can have one 1 category)?
Or are there any other different approaches people take to get around this problem using table storage?
Thank you.
If your goal is to have perfect consistency, then using a single table is a good way to go about it. However, I think that you are probably going to be making things more difficult for yourself and get very little reward. The reason I say this is that table storage is extremely reliable. Transactions are great and all if you are dealing with very very important data, but in most cases, such as a blog, I think you would be better off just 1) either allowing for some very small percentage of inconsistent data and 2) handling failures in a more manual way.
The biggest issue you will have with storing multiple types in the same table is serialization. Most of the current table storage SDKs and utilities were designed to handle a single type. That being said, you can certainly handle multiple schemas either manually (i.e. deserializing your object to a master object that contains all possible properties) or interacting directly with the REST services (i.e. not going through the Azure SDK). If you used the REST services directly, you would have to handle serialization yourself and thus you could more efficiently handle the multiple types, but the trade off is that you are doing everything manually that is normally handled by the Azure SDK.
There really is no right or wrong way to do this. Both situations will work, it is just a matter of what is most practical. I personally tend to put a single schema per table unless there is a very good reason to do otherwise. I think you will find table storage to be reliable enough without the use of transactions.
You may want to check out the Windows Azure Toolkit. We have designed that toolkit to simplify some of the more common azure tasks.