Azure Table Storage Design for Web Application - azure

I am evaluating the use of Azure Table Storage for an application I am building, and I would like to get some advice on...
whether or not this is a good idea for the application, or
if I should stick with SQL, and
if I do go with ATS, what would be a good approach to the design of the storage.
The application is a task-management web application, targeted to individual users. It is really a very simple application. It has the following entities...
Account (each user has an account.)
Task (users create tasks, obviously.)
TaskList (users can organize their tasks into lists.)
Folder (users can organize their lists into folders.)
Tag (users can assign tags to tasks.)
There are a few features / requirements that we will also be building which I need to account for...
We eventually will provide features for different accounts to share lists with each other.
Users need to be able to filter their tasks in a variety of ways. For example...
Tasks for a specific list
Tasks for a specific list which are tagged with "A" and "B"
Tasks that are due tomorrow.
Tasks that are tagged "A" across all lists.
Tasks that I have shared.
Tasks that contain "hello" in the note for the task.
Etc.
Our application is AJAX-heavy with updates occurring for very small changes to a task. So, there is a lot of small requests and updates going on. For example...
Inline editing
Click to complete
Change due date
Etc...
Because of the heavy CRUD work, and the fact that we really have a list of simple entities, it would be feasible to go with ATS. But, I am concerned about the transaction cost for updates, and also whether or not the querying / filtering I described could be supported effectively.
We imagine numbers starting small (~hundreds of accounts, ~hundreds or thousands of tasks per account), but we obviously hope to grow our accounts.
If we do go with ATS, would it be better to have...
One table per entity (Accounts, Tasks, TaskLists, etc.)
Sets of tables per customer (JohnDoe_Tasks, JohnDoe_TaskLists, etc.)
Other thoughts?
I know this is a long post, but if anyone has any thoughts or ideas on the direction, I would greatly appreciate it!

Azure Table Storage is well suited to a task application. As long as you setup your partition keys and row keys well, you can expect fast and consistent performance with a huge number of simultaneous users.
For task sharing, ATS provides optimistic concurrency to support multiple users accessing the same data in parallel. You can use optimistic concurrency to warn users when more than one account is editing the same data at the same time, and prevent them from accidentally overwriting each-other's changes.
As to the costs, you can estimate your transaction costs based on the number of accounts, and how active you expect those accounts to be. So, if you expect 300 accounts, and each account makes 100 edits a day, you'll have 30K transactions a day, which (at $.01 per 10K transactions) will cost about $.03 a day, or a little less than $1 a month. Even if this estimate is off by 10X, the transaction cost per month is still less than a hamburger at a decent restaurant.
For the design, the main aspect to think about is how to key your tables. Before designing your application for ATS, I'd recommend reading the ATS white paper, particularly the section on partitioning. One reasonable design for the application would be to use one table per entity type (Accounts, Tasks, etc), then partition by the account name, and use some unique feature of the tasks for the row key. For both key types, be sure to consider the implications on future queries. For example, by grouping entities that are likely to be updated together into the same partition, you can use Entity Group Transactions to update up to 100 entities in a single transaction -- this not only increases speed, but saves on transaction costs as well. For another implication of your keys, if users will tend to be looking at a single folder at a time, you could use the row key to store the folder (e.g. rowkey="folder;unique task id"), and have very efficient queries on a folder at a time.
Overall, ATS will support your task application well, and allow it to scale to a huge number of users. I think the main question is, do you need cloud magnitude of scaling? If you do, ATS is a great solution; if you don't, you may find that adjusting to a new paradigm costs more time in design and implementation than the benefits you receive.

What your are asking is a rather big question, so forgive me if I don't give you an exact answer.. The short answer would be: Sure, go ahead with ATS :)
Your biggest concern in this scenario would be about speed. As you've pointed out, you are expecting a lot of CRUD operations. Out of the box, ATS doesn't support tranactions, but you can architect yourself out of such a challenge by using the CQRS structure.
The big difference from using a SQL to ATS is your lack of relations and general query possibilities, since ATS is a "NoSQL" approach. This means you have to structure your tables in a way that supports your query operations, which is not a simple task..
If you are aware of this, I don't see any trouble doing what your'e describing.
Would love to see the end result!

Related

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.
If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

DDD (Domain-Driven-Design) - large aggregates

I'm currently studying Eric Evans'es Domain-Driven-Design. The idea of aggregates is clear to me and I find it very interesting. Now I'm thinking of an example of aggregate like :
BankAccount (1) ----> (*) Transaction.
BankAccount
BigDecimal calculateTurnover();
BankAccount is an aggregate. To calculate turnover I should traverse all transactions and sum up all amounts. Evans assumes that I should use repositories to only load aggreagates. In the above case there could be a few tousands of transactions which I don't want load at once in memory.
In the context of the repository pattern, aggregate roots are the only objects > your client code loads from the repository.
The repository encapsulates access to child objects - from a caller's perspective it automatically loads them, either at the same time the root is loaded or when they're actually needed (as with lazy loading).
What would be your suggestion to implement calulcateTurnover in a DDD aggregate ?
As you have pointed out, to load 1000s of entities in an aggregate is not a scalable solution. Not only will you run into performance problems but you will likely also experience concurrency issues, as emphasised by Vaughn Vernon in his Effective Aggregate Design series.
Do you want every transaction to be available in the BankAccount aggregate or are you only concerned with turnover?
If it is only the turnover that you need, then you should establish this value when instantiating your BankAccount aggregate. This could likely be effectively calculated by your data store technology (indexed JOINs, for example, if you are using SQL). Perhaps you also need to consider having this this as a precalculated value in your data store (what happens when you start dealing with millions of transactions per bank account)?
But perhaps you still require the transactions available in your domain? Then you should consider having a separate Transaction repository.
I would highly recommend reading Vaughn Vernon's series on aggregate design, as linked above.
You have managed to pick a very interesting example :)
I actually use Account1->*Transaction when explaining event sourcing (ES) to anyone not familiar with it.
As a developer I was taught (way back) to use what we can now refer to as entity interaction. So we have a Customer record and it has a current state. We change the state of the record in some way (address, tax details, discount, etc.) and store the result. We never quite know what happened but we have the latest state and, since that is the current state of our business, it is just fine. Of course one of the first issues we needed to deal with was concurrency but we had ways of handling that and even though not fantastic it "worked".
For some reason the accounting discipline didn't quite buy into this. Why do we not simply have the latest state of an Account. We will load the related record, change the balance, and save the state. Oddly enough most people would probably cringe at the thought yet it seems to be OK for the rest of our data.
The accounting domain got around this by registering the change events as a series of Transaction entries. So should you lose you account record and the latest balance you can always run though all the transactions to obtain the latest balance. That is event sourcing.
In ES one typically loads an entire list of events for an aggregate root (AR) to obtain its latest state. There is also, typically, a mechanism to deal with a huge number of events when loading all would cause performance issues: snapshots. Usually only the latest snapshot is stored. The snapshot contains the full latest state of the aggregate and only event after the snapshot version are applied.
One of the huge advantages of ES is that one could come up with new queries and then simply apply all the events to the query handler and determine the outcome. Perhaps something like: "How many customer do I have that have moved twice in the last year". Quite arbitrary but using the "traditional" approach the answer would quite likely be that we'll start gathering that information from today and have it available next year as we have not been saving the CustomerMoved events. With ES we can search for the CustomerMoved events and get a result at any point.
So this brings me back to your example. You probably do not want to be loading all the transactions. Instead store the "Turnover" and calculate it on the go. Should the "Turnover" be a new requirement then a once off processing of all the ARs should get it up to speed. You can still have a calculateTurnover() method somewhere but that would be something you wouldn't run all too often. And in those cases you would need to load all the transactions for an AR.

Paging among multiple aggregate root

I'm new to DDD so please executes me if some term/understanding are bit off. But please correct me and any advice are appreciated.
Let's say I'm doing a social job board site, and I've identified my aggregate roots: Candidates, Jobs, and Companies. Very different things/contexts so each has own database table, repository, and service. But now I have to build a Pinterest style homepage where data blocks show data for either a Candidate, a Job, or a Company.
Now the tricky part is the data blocks have to be ordered by the last time something happened to the aggregate it represents (a company is liked/commented, or a job was update, etc), and paging occurs in form of infinite scrolling, again just like Pinterest. Since things occur to these aggregates independently I do not have a way to know how many of what aggregate is on any particular page. (but if I did btw, say a table that tracks aggregates' last update time, have I no choice but to promote this to be another aggregate root, with it's own repository?)
Where would I implement the paging logic? I read somewhere that there should be one service per repository per aggregate root, so should I sort and page in controller (I'm using MVC by the way)? Or should there be a independent Application Service that does cross boundary stuff like this? Either case I have to fetch ALL entities for ALL aggregates from db?
That's too many questions already but I'm basically asking:
Is paging presentation, business, or persistence logic? Which horizontal layer?
Where should cross boundary code reside in DDD? Which vertical stack?
Several things come to mind.
How fresh does this aggregated data need to be? I doubt realtime is going to add much value. Talk to a business person and bargain for some latency. This will allow you to build a simpler solution to the problem.
Why not have some process do the scanning, aggregation, sorting and store the result of that asynchronously? Doesn't even need to be in a database (Redis). The bargained latency could be the interval at which to run your process.
Paging is hardly a business decision concern in your example. You just need to provide infinite scrolling and some ajax calls that fetch the cached, aggregated, sorted information. This has little to do with DDD.
Your UI artifacts and the aggregation, sorting process seem to be very much a thing on their own, working together with the data or - better yet - a datacomponent of each context that provides the data in the desired format.

When is the Data Vault model the right model for a data-warehouse?

I recently found a reference to 'Data Vault Modeling' as a model for data-warehouses. The models I've seen before are Inmon and Kimball. The author refers to possible performance problems due to the joins needed. It looks like a nice model, but I wonder about the gotcha's. Are there any experience reports on-line?
We have been using a home-grown modification to Data Vault for a number of years, called 'Link Modelling', which only has entities and links; drawing principles from neo4j, but implementing in a SQL database.
Both Link Modelling and Data Vault are very different ways of thinking to Kimball/Inmon models.
My comments below relate to a system built with the follow structure: a temporary staging database, a DWH, then a number of marts build from the DWH. There are other ways to architect a DWH solution, but this is quite typical.
With Kimball/Inmon
Data is cleaned on the way into the DWH, but sometimes applied on the way into the staging database
Business rules and MDM are (generally) applied between the staging db and the DWH
The marts are often subject area specific
With Data Vault/Link Modelling
Data is landed unchanged in staging
These data are passed through to the DWH also uncleaned, but stored in an entity/link form
Data cleansing, MDM and business rules are applied between the DWH and the marts.
Marts are based on subject area specific needs (same as above).
For us, we would often (but not always) build Kimball Star Schema style Marts, as the end users understand the data structures of these easily.
The occasions where a Link Modelled DWH comes into its own, are the following (using Kimball terminology to express the issues)
Upon occasion, there will be queries from the users asking 'why is a specific number having this value?'. In traditional Kimball/Inmon, data is cleansed on the way in, there is no way to know what the original value was. Link Model has the original data in the DWH.
When no transaction records exist that link a number of dimensions, and it is required to be able to report on the full set of data, so e.g. ask questions like 'How many insurance policies that were sold by a particular broker have no claim transactions paid?'.
The application of MDM in a type 2 Kimball or Inmon DWH can cause massive numbers of type 2 change records to be written to Dimensions, which often contain all the data values, so there is a lot of duplication of data. With a Link Model/Data Vault, a new dimensional value will just cause new type 2 links to be created in a link table, which only have foreign keys to entity tables. This is often overcome in Kimball DWH by having a slowly changing dimension and a fast changing dimension, which is a fair workaround.
In Insurance and other industries where there is the need to be able to produce 'as at date' reports, Fact tables will be slowly changing as well, type 2 dimension tracking against type 2 fact records are a nightmare.
From a development point of view, adding a new column to a large Kimball dimension needs to be done carefully and consideration of back-populating is important, but with a Link Model, adding an extra column to an Entity is relatively trivial.
There are always ways around these in Kimball methodology, but they require some careful thought and sometimes some jumping through hoops.
From our perspective, there is little downside to the Link Modelling.
I am not connected with any of the companies marketing/producing Kimball/Inmon or Data Vault methodologies.
You can find a whole lot more information on my blog: http://danLinstedt.com, and on the forums at datavaultinstitute dot com
But to give you a quick/brief answer to your question:
The gotchas are as follows:
1) Have to accept the concept of loading raw data to the data warehouse
2) Understand that the Data Vault usually doesn't allow "end-users" direct access because of the model.
There may be a few more, but the benefits outweigh the drawbacks.
Feel free to check out the blog, it's free to register/follow.
Cheers,
Dan Linstedt

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources