We are building inventory system. We do CQRS and EventSourcing. WE have separate ReadModels (views) like OnHandSTockView, InTransitSTockView and so on, where each view has its state in db (1 view = 1 sql table, denormalized). When designing Service Contracts, how reusable should we design them? If Client A (WEb platform, React) needs inTransitView, and MobileClient needs inTransitView, with minor differences maybe in number of columns, should we keep 1 Table in Db and create 1 Service Contract(reusable for both clients) or create 2 tables per specific view per client with 2 service contracts? Or should it be 1 table in sql and 2 rest Endpoints for different clients?
What are you trying to generalize, and what are you trying to optimize?
I would start with what you mentioned as first option: 1 Table in Db and create 1 Service Contract (reusable for both clients), because that makes you reusing as much as possible, and moving responsibility to different clients, to use only the data they need.
If you start noticing that using 2 endpoints and single denormalized table - one with simplified data, and one with extended data, saves you a lot of resources, then that would be logical next step. It's more contracts to maintain, so it's a maintenance cost for you, but it makes sense if it's planned optimization.
2 tables per specific view per client with 2 service contracts - this and any other solution would probably be the result of you solving other specific problem with previous solutions. It makes sense for having same data, but used for different purposes, and optimized in different way. For example, same looking data, but one being used by search engine, and second being used as document database / relational database. That kind of projections might also be done in different ways than just projecting events as they are (i.e. connecting it with data from other projections in some way).
Related
I'm currently getting started with NodeJS, MongoDB, Mongoose etc and got a few questions about Mongoose schemes/model best practices:
Is it acceptable to have multiple schemes/models for a single view? Let's take a calendar app for example. I have a single view for the calendar and several models like the CalendarModel (stores calendarID, color, owner etc) and the CalendarEventModel (contains info on the events to be shown in the calendar.) Is this a suitable approach?
In the above example, should there be a controller for each Model/Scheme or is it considered acceptable to have one controller that controls both models/schemes and puts them together into the single view that I have?
Would it be a good alternative to have one CalendarModel and store all CalendarEvents within that model, so that the model would consist of info like this: calendarID, owner, color, eventsList
Thank you! :)
There is no one simple answer to this problème. This really depends on what is the requirement. Here are a few pointers.
Both have some pros and cons but I would say it is a suitable approach
Single View with multiple schemes/models
Pro:
Allow to group data that change together in a schema
Simple because only one view to use and all present
Con:
A single view is great but changing data requires all the view to be loaded again
A single view may not be reusable because it will tend to contain a lot of useless info if reused
Multiple View with multiple schemes/models
Pro:
Very flexible and reusable.
More control over the size of the data.
A good grouping of data that change together (Granularity)
Con:
More complex to manage
May not new reusability
Maybe overkill
It depends on the job the controller does for the app. For this question, I would really ask myself the question of what is the goal of the controller. A big controller is simpler but does multiple things and it may become quickly a mess to maintain.
If you change a model and you have a single controller you have to change it and you may break the controller for another functionality.
If you change a model and you have multiple controllers you have to change a single point and it is more controlled but you are more vulnerable to create side effects on other views and controllers.
This is a question of data.
Single Model
Pro:
Simple
Load once get all
Less roundtrip
All centralized
Con:
Forced to store events and calendar in the same database
No caching possible
Large transfer size each time something change
New event requires the update of the calendar
Multiple Models
Pro:
More Control on the database for storage
Possible to get pieces to size the data
Possible to cache stable data
Con:
More complex for migration and data coherence
More roundtrip to the database or aggregation required
Maybe overkill and more query and assembly
My approach to this example without knowing the exact requirement.
I would separate the events and the main model.
This allows to update and reload update without reloading the calendar
Since the calendar does not change a lot I would cache (Redis or in process state) the calendar avoiding the database load.
I could load events per month and years. This is great to keep the size small.
If the requirements are fixed one controller and one view is good but rarely the requirements will change. Without over-engineering I would separate events and calendar in two-controller. When I add the custom font feature per calendar this will change the calendar model and controller only.
The view can be a single instance for the global view but if I have a detailed event view also I would add another view. Each view has a single purpose and reuse is possible but not forced.
Lastly, the business rule should stay in the same layer and not leak between. For instance, the rule that two events cannot be on the same day may be enforced by the controller. Another rule that says that event should be a move to an existing calendar should also be in the controller layer if decided so.
Notes: This is my opinion and there exist multiple opinions on the subject. Also, it is really dependant on the requirements. One good solution for one app/api may be the worst solution for another.
EDIT:
Single or Multiple Controllers
I would tend to group code that does a single purpose. Here if I have 2 models (event/calendar) and 2 views (Calendar overview and event detail) I would tend to have 2 controllers if they have different roles. If creating and editing can be directly done in the calendar overview, then use a single controller and the event detail use a subset of that same controller for its view. But calendar preference/overview and event management can be two different things. Note here that the number of models could be 5 or 7 and it would not matter. I could have 6 different schemas to help me with the storage and database but only have 1 controller.
Deciding the number of things
Models:
An abstraction of the data and the storage solution (files, database, in memory,...). Therefore, choosing the correct representation depends on the desired data structure. Think about what changes and what can be group together.
In this example, 1 model for Calendar({id, color, owner,...}), 1 model for Events, 1 model for Owner, ... If you need to use SQL for Events ({id, calendar id, detail id}) and Events details is in Mongo ({id, name, time, color, date, description}) then use 2 models for the events.
Controllers:
Represent a function and a way to interact with the user. Logical function grouping. And business rules centralization.
In this example, there is 2 logical grouping, Calendar overview management with preference and Event update and creation and detail loading. Note that both controllers will use the Event model. The calendar will load the events of the current month but may just load the id, name and time. The Event controller will load only a specific event and allow it to create and update this event.
Views:
These are representations of the data. This allows us to simplify the output and keep the model structure and the business rules from leaking. Nobody has to know that you may use 3 database kind to store your data and how your data is structured.
In this example, there could be 3 or more views. There could be a per month overview and a per year overview. Each uses the calendar overview but are separate views because the data may not be structure exactly the same. Also, there could be an event list overview that uses the calendar overview controller. And event detail view.
Note that visually all views will be different but at the core, it is just a way to package the calendar and the selected events (1 month, 1 day, 1 year, all events) for the visual. If all the view turn out identical just create a single one but independent view allows you to change requirement. 1 day view may require more detail on events than the 1 year view.
I'm Researching on Cassandra for over 2 weeks just have the full grasp on the same. I've read almost all the web about Cassandra and still not clear over some concepts. Following are the ones:-
As per the documentation :- We model our Column Families as per our queries. Hence we need to know our queries before-hand, which is not at all possible in a real world scenario. We can have a certain set of queries before-hand, which all keeps changing with time. Hence if I'd designed a model based on my previous queries, then after a new requirement comes i, I need to redesign a the model. And as read over one SO thread It’s very hard to fix a bad Cassandra data model in the future. For Eg:- I'd a user model having fields say
name, age,phone,imei,address, state,city,registration_type, created_at
Currently, I need to filter by (lets say) only by state. I'll make a PK as state. Lets name the model UserByState.
Now after 2-3 months, I came with a requirement of filtering by created_at. Now I'll create a model UserByCreatedAt with PK as created_at.
Now there are 2 problems:-
a) If I create a new model when the requirement comes in, then I need to migrate the data into the new model, ie if I create a new model, I need to have the previous data in the current model as well. Hence I need to migrate the data from UserByState to UserByCreatedAt, ie I need to write a script to copy the data from UserByState to UserByCreatedAt. Correct me if Im wrong!!!
If another new filtering requirement comes in, I'll be creating new models and then migration and so on.
b) To create models before-hand as per the queries, I need to keep data in sync, ie in the above case of Users, I created 2 models for 2 queries.
UserByState and UserByCreatedAt
So do I need to apply 2 different write queries??, ie
UserByState.create(row = value,......)
UserByCreatedAt.create(row = value,......)
And if I've other models, such as 'UserByGender' and so on. do I need to apply different write queries to different models MANUALLY or does it happen on its own??? The problem of keeping the data in sync arises.
There is no free lunch in distributed systems and you've hit some of key limitations on the head.
If you want extremely performant writes that scale horizontally you end up having to make concessions on other pats of the database. Cassandra chose to sacrifice flexibility in query patterns to ensure extremely fast access to well defined query patterns.
When most users reach a situation where they need to have to extremely different and frequent query patterns, they build a second table and update both at once. To get atomicity with the multi-table writes, logged batching can be used to make sure that either all of the data is written or none of it is. Logged batching increases the cost so this is still yet another tradeoff with performance. Beyond that the normal consistency level tradeoffs all still apply.
For moving data from the old table to the new one Hadoop/Spark are good options. These are batch based systems so they will not provide low latency but are great for one-offs like rebuilding a table with a new index and cronjob operations.
I'm new to DDD so please executes me if some term/understanding are bit off. But please correct me and any advice are appreciated.
Let's say I'm doing a social job board site, and I've identified my aggregate roots: Candidates, Jobs, and Companies. Very different things/contexts so each has own database table, repository, and service. But now I have to build a Pinterest style homepage where data blocks show data for either a Candidate, a Job, or a Company.
Now the tricky part is the data blocks have to be ordered by the last time something happened to the aggregate it represents (a company is liked/commented, or a job was update, etc), and paging occurs in form of infinite scrolling, again just like Pinterest. Since things occur to these aggregates independently I do not have a way to know how many of what aggregate is on any particular page. (but if I did btw, say a table that tracks aggregates' last update time, have I no choice but to promote this to be another aggregate root, with it's own repository?)
Where would I implement the paging logic? I read somewhere that there should be one service per repository per aggregate root, so should I sort and page in controller (I'm using MVC by the way)? Or should there be a independent Application Service that does cross boundary stuff like this? Either case I have to fetch ALL entities for ALL aggregates from db?
That's too many questions already but I'm basically asking:
Is paging presentation, business, or persistence logic? Which horizontal layer?
Where should cross boundary code reside in DDD? Which vertical stack?
Several things come to mind.
How fresh does this aggregated data need to be? I doubt realtime is going to add much value. Talk to a business person and bargain for some latency. This will allow you to build a simpler solution to the problem.
Why not have some process do the scanning, aggregation, sorting and store the result of that asynchronously? Doesn't even need to be in a database (Redis). The bargained latency could be the interval at which to run your process.
Paging is hardly a business decision concern in your example. You just need to provide infinite scrolling and some ajax calls that fetch the cached, aggregated, sorted information. This has little to do with DDD.
Your UI artifacts and the aggregation, sorting process seem to be very much a thing on their own, working together with the data or - better yet - a datacomponent of each context that provides the data in the desired format.
I am evaluating the use of Azure Table Storage for an application I am building, and I would like to get some advice on...
whether or not this is a good idea for the application, or
if I should stick with SQL, and
if I do go with ATS, what would be a good approach to the design of the storage.
The application is a task-management web application, targeted to individual users. It is really a very simple application. It has the following entities...
Account (each user has an account.)
Task (users create tasks, obviously.)
TaskList (users can organize their tasks into lists.)
Folder (users can organize their lists into folders.)
Tag (users can assign tags to tasks.)
There are a few features / requirements that we will also be building which I need to account for...
We eventually will provide features for different accounts to share lists with each other.
Users need to be able to filter their tasks in a variety of ways. For example...
Tasks for a specific list
Tasks for a specific list which are tagged with "A" and "B"
Tasks that are due tomorrow.
Tasks that are tagged "A" across all lists.
Tasks that I have shared.
Tasks that contain "hello" in the note for the task.
Etc.
Our application is AJAX-heavy with updates occurring for very small changes to a task. So, there is a lot of small requests and updates going on. For example...
Inline editing
Click to complete
Change due date
Etc...
Because of the heavy CRUD work, and the fact that we really have a list of simple entities, it would be feasible to go with ATS. But, I am concerned about the transaction cost for updates, and also whether or not the querying / filtering I described could be supported effectively.
We imagine numbers starting small (~hundreds of accounts, ~hundreds or thousands of tasks per account), but we obviously hope to grow our accounts.
If we do go with ATS, would it be better to have...
One table per entity (Accounts, Tasks, TaskLists, etc.)
Sets of tables per customer (JohnDoe_Tasks, JohnDoe_TaskLists, etc.)
Other thoughts?
I know this is a long post, but if anyone has any thoughts or ideas on the direction, I would greatly appreciate it!
Azure Table Storage is well suited to a task application. As long as you setup your partition keys and row keys well, you can expect fast and consistent performance with a huge number of simultaneous users.
For task sharing, ATS provides optimistic concurrency to support multiple users accessing the same data in parallel. You can use optimistic concurrency to warn users when more than one account is editing the same data at the same time, and prevent them from accidentally overwriting each-other's changes.
As to the costs, you can estimate your transaction costs based on the number of accounts, and how active you expect those accounts to be. So, if you expect 300 accounts, and each account makes 100 edits a day, you'll have 30K transactions a day, which (at $.01 per 10K transactions) will cost about $.03 a day, or a little less than $1 a month. Even if this estimate is off by 10X, the transaction cost per month is still less than a hamburger at a decent restaurant.
For the design, the main aspect to think about is how to key your tables. Before designing your application for ATS, I'd recommend reading the ATS white paper, particularly the section on partitioning. One reasonable design for the application would be to use one table per entity type (Accounts, Tasks, etc), then partition by the account name, and use some unique feature of the tasks for the row key. For both key types, be sure to consider the implications on future queries. For example, by grouping entities that are likely to be updated together into the same partition, you can use Entity Group Transactions to update up to 100 entities in a single transaction -- this not only increases speed, but saves on transaction costs as well. For another implication of your keys, if users will tend to be looking at a single folder at a time, you could use the row key to store the folder (e.g. rowkey="folder;unique task id"), and have very efficient queries on a folder at a time.
Overall, ATS will support your task application well, and allow it to scale to a huge number of users. I think the main question is, do you need cloud magnitude of scaling? If you do, ATS is a great solution; if you don't, you may find that adjusting to a new paradigm costs more time in design and implementation than the benefits you receive.
What your are asking is a rather big question, so forgive me if I don't give you an exact answer.. The short answer would be: Sure, go ahead with ATS :)
Your biggest concern in this scenario would be about speed. As you've pointed out, you are expecting a lot of CRUD operations. Out of the box, ATS doesn't support tranactions, but you can architect yourself out of such a challenge by using the CQRS structure.
The big difference from using a SQL to ATS is your lack of relations and general query possibilities, since ATS is a "NoSQL" approach. This means you have to structure your tables in a way that supports your query operations, which is not a simple task..
If you are aware of this, I don't see any trouble doing what your'e describing.
Would love to see the end result!
I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.