How do you deal with legacy data integrity issues when rewriting software?

How do you deal with legacy data integrity issues when rewriting software? - domain-driven-design

I am working on a project which is a rewrite of an existing legacy software. The legacy software primarily consists of CRUD operations (create, read, update, delete) on an SQL database.
Despite the CRUD-based style of coding, the legacy software is extremely complex. This software complexity is not only the result of the complexity of the problem domain itself, but also the result of poor (and regularly bordering on insane) design decision. This poor coding has lead to the data in the database lacking integrity. These integrity issues are not solely in terms of relationships (foreign keys), but also in terms of the integrity within a single row. E.g., the meaning of column "x" outright contradicts the meaning of column "y". (Before you ask, the answer is "yes", I have analysed the problem domain and correctly understand the meaning and purpose of these columns, and better than the original software developers it seems).
When writing the replacement software, I have used principles from Domain Driven Design and Command Query Reponsibility Segregation, primarily due to the complexity of the domain. E.g., I've designed aggregate roots to enforce invariants in the write model, command handlers to perform "cross-aggregate" consistency checks, query handlers to query intentionally denormalised data in a manner appropriate for various screens, etc, etc.
The replacement software works very well when entering new data, in terms of accuracy and ease of use. In that respect, it is successful. However, because the existing data is full of integrity issues, operations that involve the existing data regularly fail by throwing an exception. This typically occurs because an aggregate can't be read from a repository because the data passed to the constructor violates the aggregate's invariants.
How should I deal with this legacy data that "breaks the rules". The old software worked fine in this respect, because it performed next to no validation. Because of this lack of validation, it was easy for inexperienced users to enter nonsensical data (and experienced users became very valuable because they had years of understanding it's "idiosyncrasies").
The data itself is very important, so it cannot be discarded. What can I do? I've tried sorting out the integrity issues as I go, and this has worked in some cases, but in others it is nearly impossible (e.g., data is outright missing from the database because the original developers decided not to save it). The sheer number of data integrity issues is overwhelming.
What can I do?

for a question tagged with DDD the answer is almost always talk to your domain expert. How do they want things to work.
I also noticed your question is tagged with CQRS. are you actually implementing CQRS? in that case it should be almost a non-issue.
Your domain model will live on the command side of your application and always enforce validation. The read stack will provide just dumb viewmodels. This means that on a read, your domain model isn't even involved and also no validation is applied. It will just show whatever nonsense it can use to populate your viewmodel. However on a write validations are triggered. and any write will need to adher to the full validations of your viewmodel.
Now back to reality: be VERY sure that the validations you implement are actually the valididations required. For example even something simple as a telephone number (often implemented as 3 digits dash 3 digits dash 4 digits). But then companies haver special phone numbers like 1800-CALLME which not only have digits but also have letters, and could even be of different lengths (and different countries might also have different rules). If your system needs to handle this it pritty much means you can't apply any validation on phonenumbers.
This is just an example how what you might think is a real validation really can't be implemented at all because that 1 special case it needs to handle. The rule here becomes again. Talk to your domain expert how he wants to have things handled. But be VERY careful that your validations doens't make it near impossible for real users to use your system. since that's the fastest way to have your project killed.
Update: In DDD you would also hear the term anti-corruption layer. This layer ensures that incomming data meets the expectations of your domain model. This might be the prefered method but if you say you cannot ignore items with garbage data then this might not solve the problem in your case.

Related

Looking for way of inspecting Drools interactions with Facts

I am inserting pojo-like objects/facts into KieSession and have rules that interact with those objects. After all rules are fired I would like to be able to inspect which objects and their methods were accessed by rules. In concept this is similar to mocking.
I attempted to use Mockito and inserted mocked objects into kieSession. I'm able to get a list of methods that were called but not all of the interactions seem to show up.
Not sure if this is Mockito's limitation or it is something about how Drools is managing facts and lifecycles which breaks mocking.
Perhaps there is a better way of accomplishing this?
Update: Reasoning - we have an application executing various rule sets. Application makes available all of data but each rule set needs only some subset of the data. There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) by given rule set.

This part of your question indicates that you (or more likely your management) fundamentally doesn't understand the basic Drools lifecycle:
Reasoning - we have an application executing various rule sets. Application makes available all of data but each rule set needs only some subset of the data. There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) by given rule set.
The following is a very simplified explanation of how this works. A more detailed explanation would exceed the character limit of the answer field on StackOverflow.
When you call fireAllRules in Drools, the rule engine enters a phase called "matching", which is when it decides which rules are actually going to be run. First, the rule engine orders the rules depending on your particular rule set (via salience, natural order, etc.), and then it iterates across this list of rules and executes the left-hand side (LHS; AKA "when clause" or "conditions") only to determine if the rule is valid. On the LHS of each rule, each statement is executed sequentially until any single statement does not evaluate true.
After Drools has inspected all of the rules' LHS, it moves into the "execution" phase. At this point all of the rules which it decided were valid to fire are executed. Using the same ordering from the matching phase, it iterates across each rule and then executes the statements on the right hand side of the rules.
Things get even more complicated when you consider that Drools supports inheritance, so Rule B can extend Rule A, so the statements in Rule A's LHS are going to be executed twice during the matching phase -- once when the engine evaluates whether it can fire Rule B, and also once when the engine evaluates whether it can fire Rule A.
This is further complicated by the fact that you can re-enter the matching phase from execution through the use of specific keywords in the right hand side (RHS; aka "then clause" or "consequences") of the rules -- specifically by calling update, insert, modify, retract/delete, etc. Some of these keywords like insert will reevaluate a subset of the rules, while others like update will reevaluate all of the words during that second matching phase.
I've focused on the LHS in this discussion because your statement said:
There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) ....
The majority of your getters should be on your LHS unless you've got some really non-standard rules. This is where you should be getting your data out and doing comparisons/checks/making decisions about whether your rule should be fired.
Hopefully it makes sense as to why the request to know which "get" calls are triggered doesn't really make sense -- because in the matching phase we're going to be triggering a whole lot of "get" calls and then ignoring the result because some other part of the LHS doesn't evaluate true.
I did consider that potentially we're having a communication problem here and the actual need is to know what data is actually being used for execution (RHS). In this case, you should be looking to using listeners as I suggested in the comments. If you write a listener that hooks into the Drools lifecycle, specifically onto the execution phase (AgendaEventListener's afterMatchFired). At this point, you know that the rule matched and actually was executed, so you can log or record the rule name and details. Since you know the exact data needed and used by each rule, this will allow you to track the data you actually use.
This all being said, I found this part concerning based on my previous experience:
Application makes available all of data but each rule set needs only some subset of the data.
The company I worked for followed this approach -- we made all data available to all the rules by adding everything into the working memory. The idea was that if all the data was available, then we would be able to write rules without changing the supporting code because any data that you might need in the future was already available in working memory.
This turned out to be OK when we had small data, but as the company and product grew, this data also grew, and our rules started requiring massive amounts of memory to support working memory (especially as our call volume increased, since we needed that larger heap allocation per rules request.) It was exacerbated by the fact that we were using extremely unperformant objects to pass into working memory -- namely HashMaps and objects which extended HashMap.
Given that, you should strongly consider rethinking your strategy. Once we trimmed the data that we were passing into the rules, decreasing both volume and changing the structures to performant POJOs, we not only saw a great decrease in resource usage (heap, primarily) but we also saw a performance improvement in terms of greater rule throughput, since the rule engine didn't need to keep handling and evaluating those massive and inefficient volumes of data in working memory.
And, finally, in terms of the question you had about Mocking objects in working memory -- I would strongly caution against attempting to do this. Mocking libraries really shouldn't be used in production code. Most mocking works by leveraging combinations of reflection and bytecode manipulation. Data in working memory isn't guaranteed to remain in the initial state that it was passed in -- it gets serialized and deserialized a different points in the process, so depending on how your specific mocking library is implemented you'll likely lose "access" to the specific mocked instance and instead your rules will be working against a functional equivalent copy from that serialization/deserialization process.
Though I've never tried it in this situation, you may be able to use aspects if you really want to instrument your getter methods. There's a non-negligible chance you'll run into the same issue there, though.

Complex Finds in Domain Driven Design

I'm looking into converting part of an large existing VB6 system, into .net. I'm trying to use domain driven design, but I'm having a hard time getting my head around some things.
One thing that I'm completely stumped on is how I should handle complex find statements. For example, we currently have a screen that displays a list of saved documents, that the user can select and print off, email, edit or delete. I have a SavedDocument object that does the trick for all the actions, but it only has the properties relevant to it, and I need to display the client name that the document is for and their email address if they have one. I also need to show the policy reference that this document may have come from. The Client and Policy are linked to the SavedDocument but are their own aggregate roots, so are not loaded at the same time the SavedDocuments are.
The user is also allowed to specify several filters to reduce the list down. These to can be from properties that are stored on the SavedDocument or the Client and Policy.
I'm not sure how to handle this from a Domain driven design point of view.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocuments, that I then have to turn into a different object or DTO, and fill with the additional client and policy information? That seem a little slow as I have to load all the details using multiple calls.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocumentsForList objects that contain just the information I want? This seems the quickest but doesn't feel like I'm using DDD.
Do I load everything from their objects and do all the filtering and column selection in a service? This seems the slowest, but also appears to be very domain orientated.
I'm just really confused how to handle these situations, and I've not really seeing any other people asking questions about it, which masks me feel that I'm missing something.

Queries can be handled in a few ways in DDD. Sometimes you can use the domain entities themselves to serve queries. This approach can become cumbersome in scenarios such as yours when queries require projections of multiple aggregates. In this case, it is easier to use objects explicitly designed for the respective queries - effectively DTOs. These DTOs will be read-only and won't have any behavior. This can be referred to as the read-model pattern.

API design and security: Why hide internal ids?

I've heard a few people say that you should never expose your internal ids to the outside world (for instance an auto_increment'ng primary key).
Some suggest having some sort of uuid column that you use instead for lookups.
I'm wondering really why this would be suggested and if it's truly important.
Using a uuid instead is basically just obfuscating the id. What's the point? The only thing I can think of is that auto_incrementing integers obviously point out the ordering of my db objects. Does it matter if an outside user knows that one thing was created before/after another?
Or is it purely that obfuscating the ids would prevent "guessing" at different operations on specific objects?
Is this even an issue I should thinking about when designing an external facing API?

Great answers, I'll add another reason to why you don't want to expose your internal auto incremented ID.
As a competitive company I can easily instrument how many new users/orders/etc you get every week/day/hour. I just need to create a user and/or order and subtract the new ID from what I got last time.
So not only for security reasons, it's business reasons as well.

Any information that you provide a malicious user about your application and its layout can and will be used against your application. One of the problems we face in (web) application security is that seemingly innocuous design decisions taken at the infancy of a project become achilles heels when the project scales larger. Letting an attacker make informed guesses about the ordering of entities can come back to haunt you in the following, somewhat unrelated ways:
The ID of the entity will inevitably be passed as a parameter at some point in your application. This will result in hackers eventually being able to feed your application arguments they ordinarily should not have access to. I've personally been able to view order details (on a very popular retailer's site) that I had no business viewing, as a URL argument no less. I simply fed the app sequential numbers from my own legitimate order.
Knowing the limits or at least the progression of primary key field values is invaluable fodder for SQL injection attacks, scope of which I can't cover here.
Key values are used not only in RDBMS systems, but other Key-Value mapping systems. Imagine if the JSESSION_ID cookie order could be predetermined or guessed? Everybody with opposable thumbs will be replaying sessions in web apps.
And many more that I'm sure other ppl here will come up with.
SEAL team 6 doesn't necessarily mean there are 6 seal teams. Just keeps the enemy guessing. And the time spent guessing by a potential attacker is more money in your pocket any way you slice it.

As with many security-related issues, it's a subtle answer - kolossus gives a good overview.
It helps to understand how an attacker might go about compromising your API, and how many security breaches occur.
Most security breaches are caused by bugs or oversights, and attackers look for those. An attacker who is trying to compromise your API will firstly try to collect information about it - as it's an API, presumably you publish detailed usage documentation. An attacker will use this document, and try lots of different ways to make your site crash (and thereby expose more information, if he's lucky), or react in ways you didn't anticipate.
You have to assume the attacker has lots of time, and will script their attack to try every single avenue - like a burglar with infinite time, who goes around your house trying every door and window, with a lock pick that learns from every attempt.
So, if your API exposes a method like getUserInfo(userid), and userID is an integer, the attacker will write a script to iterate from 0 upwards to find out how many users you have. They'll try negative numbers, and max(INT) + 1. Your application could leak information in all those cases, and - if the developer forgot to handle certain errors - may expose more data than you intended.
If your API includes logic to restrict access to certain data - e.g. you're allowed to execute getUserInfo for users in your friend list - the attacker may get lucky with some numbers because of a bug or an oversight, and he'll know that the info he is getting relates to a valid user, so they can build up a model of the way your application is designed. It's the equivalent of a burglar knowing that all your locks come from a single manufacturer, so they only need to bring that lock pick.
By itself, this may be of no advantage to the attacker - but it makes their life a tiny bit easier.
Given the effort of using UUIDs or another meaningless identifier, it's probably worth making things harder for the attacker. It's not the most important consideration, of course - it probably doesn't make the top 5 things you should do to protect your API from attackers - but it helps.

Code generation against Sprocs?

I'm trying to understand choices for code generation tools/ORM tools and discover what solution will best meet the requirements that I have and the limitations present.
I'm creating a foundational solution to be used for new projects. It consists of ASP.NET MVC 3.0, layers for business logic and data access. The data access layer will need to go against Oracle for now, and then switch to SQL this year as the db migration is finished.
From a DTO standpoint mapping to custom types in the solution, what ORM/code generation tool will work with creating my needed code but can ONLY access Stored Procs in Oracle and SQL.?
Meaning, I need to generate the custom objects that are the artifacts from and being pushed to the stored procedures as the parameters, I don't need to generate the sprocs themselves, they already exist. I'm looking for the representation of what the sproc needs and gives back to be generated into DTOs. In some cases I can go against views and generate DTOs. I'm assuming most tools already do this. But for 90% of the time, I don't have access directly to any tables or views, only stored procs.
Does this make sense?

ORMs are best at mapping objects to tables (and/or views), not mapping objects to sprocs.
Very few tools can do automated code generation against whatever output a sproc may generate, depending on the complexity of the sproc. It's much more straight-forward to code generate the input to a sproc as that is generally well defined and clear.
I would say if you are stuck with sprocs, your options for using third party code to help reduce your development and maintenance time are severely limited.
I believe either LinqToSql or EntityFramework (or both?) are capable of some magic with regards to SQL Server to try to mostly automatically figure out what a sproc may be returning. I don't think it works all the time, it's just sophisticated guess work and I seriously doubt it would work with Oracle. I am not aware of anything else software-wise that even attempts to figure out what a sproc may return.
A sproc can return multiple diverse record sets that can be built dynamically by the sproc depending on the input and data in the database. A technical solution to automatically anticipating sproc output seems like it would require the following:
A static set of underlying data in the database
The ability to pass all possible inputs to the sproc and execute the sproc without any negative impact or side effects
That would give you a static set of possible outputs for any given valid input. A small change in the data in the database could invalidate everything.
If I recall correctly, the magic Microsoft did was something like calling the sproc passing NULL for all input parameters and assuming the output is always exactly the first recordset that comes back from the database. That is clearly an incomplete solution to the problem, but in simple cases it appears to be magic because it can work very well some of the time.

Transaction with Cassandra data model

According to the CAP theory, Cassandra can only have eventually consistency. To make things worse, if we have multiple reads and writes during one request without proper handling, we may even lose the logical consistency. In other words, if we do things fast, we may do it wrong.
Meanwhile the best practice to design the data model for Cassandra is to think about the queries we are going to have, and then add a CF to it. In this way, to add/update one entity means to update many views/CFs in many cases. Without atomic transaction feature, it's hard to do it right. But with it, we lose the A and P parts again.
I don't see this concerns many people, hence I wonder why.
Is this because we can always find a way to design our data model to avoid to do multiple reads and writes in one session?
Is this because we can just ignore the 'right' part?
In real practice, do we always have ACID feature somewhere in the middle? I mean maybe implement in application layer or add a middleware to handle it?

It does concern people, but presumably you are using cassandra because a single database server is unable to meet your needs due to scaling or reliability concerns. Because of this, you are forced to work around the limitations of a distributed system.
In real practice, do we always have ACID feature somewhere in the
middle? I mean maybe implement in application layer or add a
middleware to handle it?
No, you don't usually have acid somewhere else, as presumably that somewhere else must be distributed over multiple machines as well. Instead, you design your application around the limitations of a distributed system.
If you are updating multiple columns to satisfy queries, you can look at the eventually atomic section in this presentation for ideas on how to do that. Basically you write enough info about your update to cassandra before you do your write. That way if the write fails, you can retry it later.
If you can structure your application in such a way, using a co-ordination service like Zookeeper or cages may be useful.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string