When is the Data Vault model the right model for a data-warehouse? - modeling

I recently found a reference to 'Data Vault Modeling' as a model for data-warehouses. The models I've seen before are Inmon and Kimball. The author refers to possible performance problems due to the joins needed. It looks like a nice model, but I wonder about the gotcha's. Are there any experience reports on-line?

We have been using a home-grown modification to Data Vault for a number of years, called 'Link Modelling', which only has entities and links; drawing principles from neo4j, but implementing in a SQL database.
Both Link Modelling and Data Vault are very different ways of thinking to Kimball/Inmon models.
My comments below relate to a system built with the follow structure: a temporary staging database, a DWH, then a number of marts build from the DWH. There are other ways to architect a DWH solution, but this is quite typical.
With Kimball/Inmon
Data is cleaned on the way into the DWH, but sometimes applied on the way into the staging database
Business rules and MDM are (generally) applied between the staging db and the DWH
The marts are often subject area specific
With Data Vault/Link Modelling
Data is landed unchanged in staging
These data are passed through to the DWH also uncleaned, but stored in an entity/link form
Data cleansing, MDM and business rules are applied between the DWH and the marts.
Marts are based on subject area specific needs (same as above).
For us, we would often (but not always) build Kimball Star Schema style Marts, as the end users understand the data structures of these easily.
The occasions where a Link Modelled DWH comes into its own, are the following (using Kimball terminology to express the issues)
Upon occasion, there will be queries from the users asking 'why is a specific number having this value?'. In traditional Kimball/Inmon, data is cleansed on the way in, there is no way to know what the original value was. Link Model has the original data in the DWH.
When no transaction records exist that link a number of dimensions, and it is required to be able to report on the full set of data, so e.g. ask questions like 'How many insurance policies that were sold by a particular broker have no claim transactions paid?'.
The application of MDM in a type 2 Kimball or Inmon DWH can cause massive numbers of type 2 change records to be written to Dimensions, which often contain all the data values, so there is a lot of duplication of data. With a Link Model/Data Vault, a new dimensional value will just cause new type 2 links to be created in a link table, which only have foreign keys to entity tables. This is often overcome in Kimball DWH by having a slowly changing dimension and a fast changing dimension, which is a fair workaround.
In Insurance and other industries where there is the need to be able to produce 'as at date' reports, Fact tables will be slowly changing as well, type 2 dimension tracking against type 2 fact records are a nightmare.
From a development point of view, adding a new column to a large Kimball dimension needs to be done carefully and consideration of back-populating is important, but with a Link Model, adding an extra column to an Entity is relatively trivial.
There are always ways around these in Kimball methodology, but they require some careful thought and sometimes some jumping through hoops.
From our perspective, there is little downside to the Link Modelling.
I am not connected with any of the companies marketing/producing Kimball/Inmon or Data Vault methodologies.

You can find a whole lot more information on my blog: http://danLinstedt.com, and on the forums at datavaultinstitute dot com
But to give you a quick/brief answer to your question:
The gotchas are as follows:
1) Have to accept the concept of loading raw data to the data warehouse
2) Understand that the Data Vault usually doesn't allow "end-users" direct access because of the model.
There may be a few more, but the benefits outweigh the drawbacks.
Feel free to check out the blog, it's free to register/follow.
Cheers,
Dan Linstedt

Related

How to denormalize deep hierarchies?

I’ve read quite a lot about Cassandra and the art of denormalization and materialization while writing the data. I think I understand the concept, and it seems to make sense. However, I am having some trouble implementing it in scenarios where there is a deep hierarchical data structure.
Consider the contrived domain where
Owner 1:* Company
Company 1:* Teams
Team 1:* Players
Players 1:* Equipment
We have tables for each of these entities, but we would also like to query quickly for equipment attributes by owner so it seems the thing to do is create a table (OwnerEquipment) that has the owner id and the equipment id as the primary key with the owner id as the partition key. This makes sense, but what if the UX scenarios that add and edit equipment do not include the owner’s id as part of the working set?
Most of the denormalization examples I’ve encountered in my research are usually a single level parent-child or master-detail type use case. It seems pretty reasonable that an updating client would have enough information about the immediate parent when updating the child to write the denormalized reverse index, but what if the data you would really like to denormalize by is several “joins” away?
This problem is compounded further in our example when we consider a Company is sold to a different Owner. Assume that the desired behavior is for OwnerEquipment to reflect this change. How should the code that writes this updated Company to the database handle the OwnerEquipment table updates? Should it, knowing the ID of the old owner, try to update all the OwnerEquipment records for that owner? This seems like a very un-Cassandra-y thing to do and also fraught with concurrency issues. The problem gets worse as you move down the chain (Team to new Company, Player to new Team). In these cases the “old owner” is not necessarily in the working set and would need to be read in order to be updated.
Are there some better ways to think about this problem?
This makes sense, but what if the UX scenarios that add and edit equipment do not include the owner’s id as part of the working set?
Easy, pass the owner id along with equipment id to the UX. Owner id can be a hidden value not to be shown on the interface
but what if the data you would really like to denormalize by is several “joins” away?
Create as many tables for different query use-cases
For multiple updates and denormalizations, you can look at the new materialized views feature. Read my blog: www.doanduyhai.com/blog/?p=1930

Cassandra Data sync issues

I'm Researching on Cassandra for over 2 weeks just have the full grasp on the same. I've read almost all the web about Cassandra and still not clear over some concepts. Following are the ones:-
As per the documentation :- We model our Column Families as per our queries. Hence we need to know our queries before-hand, which is not at all possible in a real world scenario. We can have a certain set of queries before-hand, which all keeps changing with time. Hence if I'd designed a model based on my previous queries, then after a new requirement comes i, I need to redesign a the model. And as read over one SO thread It’s very hard to fix a bad Cassandra data model in the future. For Eg:- I'd a user model having fields say
name, age,phone,imei,address, state,city,registration_type, created_at
Currently, I need to filter by (lets say) only by state. I'll make a PK as state. Lets name the model UserByState.
Now after 2-3 months, I came with a requirement of filtering by created_at. Now I'll create a model UserByCreatedAt with PK as created_at.
Now there are 2 problems:-
a) If I create a new model when the requirement comes in, then I need to migrate the data into the new model, ie if I create a new model, I need to have the previous data in the current model as well. Hence I need to migrate the data from UserByState to UserByCreatedAt, ie I need to write a script to copy the data from UserByState to UserByCreatedAt. Correct me if Im wrong!!!
If another new filtering requirement comes in, I'll be creating new models and then migration and so on.
b) To create models before-hand as per the queries, I need to keep data in sync, ie in the above case of Users, I created 2 models for 2 queries.
UserByState and UserByCreatedAt
So do I need to apply 2 different write queries??, ie
UserByState.create(row = value,......)
UserByCreatedAt.create(row = value,......)
And if I've other models, such as 'UserByGender' and so on. do I need to apply different write queries to different models MANUALLY or does it happen on its own??? The problem of keeping the data in sync arises.
There is no free lunch in distributed systems and you've hit some of key limitations on the head.
If you want extremely performant writes that scale horizontally you end up having to make concessions on other pats of the database. Cassandra chose to sacrifice flexibility in query patterns to ensure extremely fast access to well defined query patterns.
When most users reach a situation where they need to have to extremely different and frequent query patterns, they build a second table and update both at once. To get atomicity with the multi-table writes, logged batching can be used to make sure that either all of the data is written or none of it is. Logged batching increases the cost so this is still yet another tradeoff with performance. Beyond that the normal consistency level tradeoffs all still apply.
For moving data from the old table to the new one Hadoop/Spark are good options. These are batch based systems so they will not provide low latency but are great for one-offs like rebuilding a table with a new index and cronjob operations.

SharePoint InfoPath best practices for persisting large forms

I am currently architecting a large SharePoint deployment.
This deployment has the potential to grow to petabytes in size over the course of several years.
One of the current issues we are discussing is the option of storing our data in SharePoint using InfoPath Forms. Some of these forms contain hundres of fields and require a lot of mapping to content types for persistence and search. Our search requirement is primarily a singular identifier and NOT the contents of the forms, although I am told I should preempt the "want" to search in the future.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly after persisting to the system.
My core questions therefore are:
What are the benefit/risks of this approach compared to storing
our data in a singular relational store using web-service
persistence?
If we decided on this approach what would be the
impact of changing the forms, content-types over time?
What happens when our farm grows beyond a single web-application / site collection how accessible will the information be?
Will I know where it is and how portable will the information be overtime?
1.)
Benefit:
Form templates can be created & deployed (relatively) easy
You can easily configure Field Validation
Probably no code involved
Risks:
Hitting SharePoint 2010 Limits (not so uncommon as you might
think)
Needs careful form design/planning (correct XML structure)
Information only accessible via SharePoint Object model or
WebService's (very slow)
2.) Well this is a tough one. Changing the form template and re-deploying is easy and only takes a few minutes. However changing the structure (underlying XML) of the template can get you in trouble very easily, because older (filled out) forms will be invalid - there is an option to "upgrade" older forms out-of-the-box, but in my experience it never worked as it supposed to.
Content Types behave very similar, say you want to delete a column from a content type because it's no longer needed - you'll have to remove all references to it, which means removing all items so you can delete the column.
3.) Well portability is definitly an issue with InfoPath, because it heavily relies on the corresponding URL structure. You absolutely can add more site collections, but this means you have to deploy your form template to each site collection. Information (filled out forms) can't easily be shared between site collection's because each form contains the SourceURL (where it came from) and the Namespace of the template (which changes constantly once you deploy).
Considering your requirements, i would strongly recommend a relational store instead of InfoPath - simply because it is not designed to be a data storage.
I would use a SQL database to store the data and a custom UI (WebPart or Application Page) to perform CRUD operations. This means that the information is not actually stored in SharePoint - just displayed (which also means that it can't be searched with the builtin SharePoint Search). There is also the possibilty to use the Business Connectivity Services (which basically does all of the above without you needing to create a custom UI - however very slow with large amount of data).
If you do need the information just in SharePoint, why not just make all this happen with Lists only?
This is going to be a long one and may not have an answer just because there's no silver bullet for what you're looking for. It's mostly insight and ultimately the choice is up to you.
the option of storing our data in SharePoint using InfoPath Forms
This statement throws me a little. SharePoint data is stored in SharePoint (well, SQL technically) but InfoPath is just a UI layer for accessing any part of that data.
Some of these forms contain 100s of fields and require alot of mapping to content types for persistence and search
From this I assume there are multiple forms which would mean different types of data being accessed (and probably different purposes). Hundreds of fields is no problem and it really boils down to managing the form and view design.
From the form side you should check out cxpartners form design crib sheet. This gives you a nice standard to follow to manage all those fields. Another thing would be to look at breaking the form up in tabs or views itself (in InfoPath) based on what the user needs to fill out. Basically it breaks down to not creating a form with 100s of fields on one massively scrolling screen the user will just freak out over.
Same with the views on the form or document library you're storing the form data in. InfoPath forms are just xml stored in a library (so regardless of how many fields you have, the footprint is pretty minimal). You don't want to map and surface every field in the form nor do you want to have a view with 100 columns on it. You should look at breaking down the views as they're fit for purpose, with only a few hundred items in each view with a few columns. It's a balancing act too as you don't want to create 100s of views either so you need to find out what's right. A good B.A. or Information Architect will help with this with the SharePoint/InfoPath guru and business user helping out.
We require our information to be utilised for secondary purposes (such as reporting etc). The information MUST be accessible instantly
This is another requirement that's going to be a little difficult to meet exactly. If the library has thousands of items (or 10s of thousands) and a view has dozens of fields then expect the view to come to crawl (especially if the user is insistent on "seeing everything" and wants the limits of each view to be set to 1000 items, like anyone could process that much information at once). Instant access is difficult if you're keeping everything online for a long time (like for reporting). There's the operational side where users are filling out forms, finding forms, editing them, etc. and for that you only want a few hundred items to be live at any given moment (up to a few thousand but you need to be careful on the views). If you have a list with 100,000 items in it and users are using this for daily activities and trying to run reports for trending or long term operations against it, you're going to lose the performance battle. Look at doing reporting offline, potentially shipping the data that's reportable to a second source like SQL and using SSRS against it. Performance Point is an option but adds a layer of complexity to the architecture. The question will ultimately fall to what reporting looks like and how important is it in relation to daily operations.
To try to answer your questions directly:
The benefits to using SharePoint over a database are that the data can be easily viewed and sliced and diced up. Creating a view is child's play and can quickly show you useful information like # of sales in a month or customer feedback grouped by call centre person. SharePoint makes it easy to view this information and even setup dashboards, hook in KPIs, etc. without having to get some developer to craft custom web pages. As far as risks go, you need to be careful with letting things grow organically and out of control. Don't let the users design views of data, they generally want something but not sure and will ask for all columns to be available which they just export to Excel to slice and dice. Make sure there's a good design around the views and lists and they're fit for purpose and meet what needs the user is trying to get out of the data. Ask the question of what they're looking for and why, that will help shape what to expose.
Any change needs to be thought out and planned and tested. It's no different in SharePoint if you add a column to a list as you would by adding a column to a SQL database. Form updates should be considered and while you won't get it 100% right the first time, you should try to get as much as possible without going overboard and putting in crazy things like 100 "blank" fields that are players to be named later. Strike a balance by understanding the needs of the users and company and where things are going. Hopefully someone will have a vision of what this thing might be when it grows up and that'll go a long way to understanding the impact of change.
Data is just xml and as long as you're not doing stupid stuff in the form like hard coding absolute paths to services (use data connection libraries) the impact of growth will be minimal. Growing beyond a web application into multiple ones is a pretty big change and not something to be taken lightly. Even splitting site collections out is big and there needs to be a really good reason for this. Site collections can handle thousands of sites and millions of documents without issue. Web applications are really there for dividing up areas of interest or separation of purpose (like team sites on one web app and a publishing portal on another) and not really meant for splitting data due to growth concerns.
Like I said, there's no silver bullet here and what you're asking for is an architecture for a solution that nobody here has all the requirements for. Hope this helps.

Azure Table Storage Design for Web Application

I am evaluating the use of Azure Table Storage for an application I am building, and I would like to get some advice on...
whether or not this is a good idea for the application, or
if I should stick with SQL, and
if I do go with ATS, what would be a good approach to the design of the storage.
The application is a task-management web application, targeted to individual users. It is really a very simple application. It has the following entities...
Account (each user has an account.)
Task (users create tasks, obviously.)
TaskList (users can organize their tasks into lists.)
Folder (users can organize their lists into folders.)
Tag (users can assign tags to tasks.)
There are a few features / requirements that we will also be building which I need to account for...
We eventually will provide features for different accounts to share lists with each other.
Users need to be able to filter their tasks in a variety of ways. For example...
Tasks for a specific list
Tasks for a specific list which are tagged with "A" and "B"
Tasks that are due tomorrow.
Tasks that are tagged "A" across all lists.
Tasks that I have shared.
Tasks that contain "hello" in the note for the task.
Etc.
Our application is AJAX-heavy with updates occurring for very small changes to a task. So, there is a lot of small requests and updates going on. For example...
Inline editing
Click to complete
Change due date
Etc...
Because of the heavy CRUD work, and the fact that we really have a list of simple entities, it would be feasible to go with ATS. But, I am concerned about the transaction cost for updates, and also whether or not the querying / filtering I described could be supported effectively.
We imagine numbers starting small (~hundreds of accounts, ~hundreds or thousands of tasks per account), but we obviously hope to grow our accounts.
If we do go with ATS, would it be better to have...
One table per entity (Accounts, Tasks, TaskLists, etc.)
Sets of tables per customer (JohnDoe_Tasks, JohnDoe_TaskLists, etc.)
Other thoughts?
I know this is a long post, but if anyone has any thoughts or ideas on the direction, I would greatly appreciate it!
Azure Table Storage is well suited to a task application. As long as you setup your partition keys and row keys well, you can expect fast and consistent performance with a huge number of simultaneous users.
For task sharing, ATS provides optimistic concurrency to support multiple users accessing the same data in parallel. You can use optimistic concurrency to warn users when more than one account is editing the same data at the same time, and prevent them from accidentally overwriting each-other's changes.
As to the costs, you can estimate your transaction costs based on the number of accounts, and how active you expect those accounts to be. So, if you expect 300 accounts, and each account makes 100 edits a day, you'll have 30K transactions a day, which (at $.01 per 10K transactions) will cost about $.03 a day, or a little less than $1 a month. Even if this estimate is off by 10X, the transaction cost per month is still less than a hamburger at a decent restaurant.
For the design, the main aspect to think about is how to key your tables. Before designing your application for ATS, I'd recommend reading the ATS white paper, particularly the section on partitioning. One reasonable design for the application would be to use one table per entity type (Accounts, Tasks, etc), then partition by the account name, and use some unique feature of the tasks for the row key. For both key types, be sure to consider the implications on future queries. For example, by grouping entities that are likely to be updated together into the same partition, you can use Entity Group Transactions to update up to 100 entities in a single transaction -- this not only increases speed, but saves on transaction costs as well. For another implication of your keys, if users will tend to be looking at a single folder at a time, you could use the row key to store the folder (e.g. rowkey="folder;unique task id"), and have very efficient queries on a folder at a time.
Overall, ATS will support your task application well, and allow it to scale to a huge number of users. I think the main question is, do you need cloud magnitude of scaling? If you do, ATS is a great solution; if you don't, you may find that adjusting to a new paradigm costs more time in design and implementation than the benefits you receive.
What your are asking is a rather big question, so forgive me if I don't give you an exact answer.. The short answer would be: Sure, go ahead with ATS :)
Your biggest concern in this scenario would be about speed. As you've pointed out, you are expecting a lot of CRUD operations. Out of the box, ATS doesn't support tranactions, but you can architect yourself out of such a challenge by using the CQRS structure.
The big difference from using a SQL to ATS is your lack of relations and general query possibilities, since ATS is a "NoSQL" approach. This means you have to structure your tables in a way that supports your query operations, which is not a simple task..
If you are aware of this, I don't see any trouble doing what your'e describing.
Would love to see the end result!

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources