Looking for way of inspecting Drools interactions with Facts

Looking for way of inspecting Drools interactions with Facts - mockito

I am inserting pojo-like objects/facts into KieSession and have rules that interact with those objects. After all rules are fired I would like to be able to inspect which objects and their methods were accessed by rules. In concept this is similar to mocking.
I attempted to use Mockito and inserted mocked objects into kieSession. I'm able to get a list of methods that were called but not all of the interactions seem to show up.
Not sure if this is Mockito's limitation or it is something about how Drools is managing facts and lifecycles which breaks mocking.
Perhaps there is a better way of accomplishing this?
Update: Reasoning - we have an application executing various rule sets. Application makes available all of data but each rule set needs only some subset of the data. There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) by given rule set.

This part of your question indicates that you (or more likely your management) fundamentally doesn't understand the basic Drools lifecycle:
Reasoning - we have an application executing various rule sets. Application makes available all of data but each rule set needs only some subset of the data. There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) by given rule set.
The following is a very simplified explanation of how this works. A more detailed explanation would exceed the character limit of the answer field on StackOverflow.
When you call fireAllRules in Drools, the rule engine enters a phase called "matching", which is when it decides which rules are actually going to be run. First, the rule engine orders the rules depending on your particular rule set (via salience, natural order, etc.), and then it iterates across this list of rules and executes the left-hand side (LHS; AKA "when clause" or "conditions") only to determine if the rule is valid. On the LHS of each rule, each statement is executed sequentially until any single statement does not evaluate true.
After Drools has inspected all of the rules' LHS, it moves into the "execution" phase. At this point all of the rules which it decided were valid to fire are executed. Using the same ordering from the matching phase, it iterates across each rule and then executes the statements on the right hand side of the rules.
Things get even more complicated when you consider that Drools supports inheritance, so Rule B can extend Rule A, so the statements in Rule A's LHS are going to be executed twice during the matching phase -- once when the engine evaluates whether it can fire Rule B, and also once when the engine evaluates whether it can fire Rule A.
This is further complicated by the fact that you can re-enter the matching phase from execution through the use of specific keywords in the right hand side (RHS; aka "then clause" or "consequences") of the rules -- specifically by calling update, insert, modify, retract/delete, etc. Some of these keywords like insert will reevaluate a subset of the rules, while others like update will reevaluate all of the words during that second matching phase.
I've focused on the LHS in this discussion because your statement said:
There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) ....
The majority of your getters should be on your LHS unless you've got some really non-standard rules. This is where you should be getting your data out and doing comparisons/checks/making decisions about whether your rule should be fired.
Hopefully it makes sense as to why the request to know which "get" calls are triggered doesn't really make sense -- because in the matching phase we're going to be triggering a whole lot of "get" calls and then ignoring the result because some other part of the LHS doesn't evaluate true.
I did consider that potentially we're having a communication problem here and the actual need is to know what data is actually being used for execution (RHS). In this case, you should be looking to using listeners as I suggested in the comments. If you write a listener that hooks into the Drools lifecycle, specifically onto the execution phase (AgendaEventListener's afterMatchFired). At this point, you know that the rule matched and actually was executed, so you can log or record the rule name and details. Since you know the exact data needed and used by each rule, this will allow you to track the data you actually use.
This all being said, I found this part concerning based on my previous experience:
Application makes available all of data but each rule set needs only some subset of the data.
The company I worked for followed this approach -- we made all data available to all the rules by adding everything into the working memory. The idea was that if all the data was available, then we would be able to write rules without changing the supporting code because any data that you might need in the future was already available in working memory.
This turned out to be OK when we had small data, but as the company and product grew, this data also grew, and our rules started requiring massive amounts of memory to support working memory (especially as our call volume increased, since we needed that larger heap allocation per rules request.) It was exacerbated by the fact that we were using extremely unperformant objects to pass into working memory -- namely HashMaps and objects which extended HashMap.
Given that, you should strongly consider rethinking your strategy. Once we trimmed the data that we were passing into the rules, decreasing both volume and changing the structures to performant POJOs, we not only saw a great decrease in resource usage (heap, primarily) but we also saw a performance improvement in terms of greater rule throughput, since the rule engine didn't need to keep handling and evaluating those massive and inefficient volumes of data in working memory.
And, finally, in terms of the question you had about Mocking objects in working memory -- I would strongly caution against attempting to do this. Mocking libraries really shouldn't be used in production code. Most mocking works by leveraging combinations of reflection and bytecode manipulation. Data in working memory isn't guaranteed to remain in the initial state that it was passed in -- it gets serialized and deserialized a different points in the process, so depending on how your specific mocking library is implemented you'll likely lose "access" to the specific mocked instance and instead your rules will be working against a functional equivalent copy from that serialization/deserialization process.
Though I've never tried it in this situation, you may be able to use aspects if you really want to instrument your getter methods. There's a non-negligible chance you'll run into the same issue there, though.

Related

Should I use a nestjs pipe, guard or I should go for an interceptor?

Well I have a few pipes in the application I'm working on and I'm starting to think they actually should be guards or even interceptors.
One of them is called PincodeStatusValidationPipe and its job as simple as snow. It checks the cache for a certain value if that value is the one expected then it returns what it gets otherwise it throws the FORBIDEN exception.
Another pipe is called UserExistenceValidationPipe it operates on the login method and checks if a user exists in DB and some other things related to that user (e.g. wheter a password expected in the login method is present and if it does then whether it matches that of the retrieved user) otherwise it throws appropriate exceptions.
I know it's more of a design question but I find it quite important and I would appreciate any hints. Thanks in advance.
EDIT:
Well I think UserExistenceValidationPipe is definitely not the best name choice, something like UserValidationPipe fits way better.

If you are throwing a FORBIDEN already, I would suggest migrating the PincodeStatusValidationPipe to be PincodeStatusValidationGuard, as returning false from a guard will throw a FORBIDEN for you. You'll also have full access to the Request object which is pretty nice to have.
For the UserExistenceValidationPipe, a pipe is not the worst thing to have. I consider existence validation to be a part of business logic, and as such should be handled in the service, but that's me. I use pipes for data validation and transformation, meaning I check the shape of the data there and pass it on to the service if the shape looks correct.
As for interceptors, I like to use those for logging, caching, and response mapping, though I've heard of others using interceptors for overall validators instead of using multiple pipes.
As the question is mostly an opinionated one, I'll leave the final decision up to you. In short, guards are great for short circuiting requests with a failure, interceptors are good for logging, caching, and response mapping, and pipes are for data validation and transformation.

Is it a good idea to rely on a given aggregate's history with Event Sourcing?

I'm currently dealing with a situation in which I need to make a decision based on whether it's the first time my aggregate got into a situation (an Order was bought).
I can solve this problem in two ways:
Introduce in my aggregate a field stating whether an order has ever been bought (or maybe the number of bought orders);
Look up in the aggregate's history for any OrderWasBought event.
Is option 2 ever acceptable? For some reason I think option 1) is for the general case safer / cleaner but I lack experience in these matters.
Thanks

IMHO both effectively do the same thing: The field stating that an order was bought needs to be hydrated somehow. Basically this would be done as part of the replay, which basically does not mean anything but that when an OrderWasBought event happened, the field will be set.
So, it does not make any difference if you look at the field, or if you look for the existence of the event. At least it does not make a difference, when it is about the effective result.
Talking about efficiency, it may be the better idea to use a field, since this way the field gets hydrated as part of the replay, which needs to be run anyway. So, you don't have to search the list of events again, but you can simply look at the (cached) value in the field.
So, in the end, to cut a long story short: It doesn't matter. Use what feels better to you. If the history of an aggregate gets lengthy, you may be better off using the field approach in terms of performance.
PS: Of course, this depends on the implementation of how aggregates are being loaded – is the aggregate able to access its own event history at all? If not, setting a field while the aggregate is being replayed is your only option, anyway. Please note that the aggregate does not (and should not!) have access to the underlying repository, so it can not load its history on its own.

Option 2 is valid as long as the use case doesn't need the previous state of the aggregate. Replaying events only restores a readonly state, if the current command doesn't care about it, searching for a certain event may be a valid simple solution.
If you feat "breaking encapsulation" this concern may not apply. Event sourcing and aggregate are concepts mainly they don't impose a certain OO approach. The Event Store contains the business state expressed as a stream of events. You can read it and use it as an immutable collection any time. I would replay events only if I'd need a certain complex state restored. But in your case here, the simpler 'has event' solution encapsulated as a service should work very well.
That being said, there's nothing wrong with always replaying events to restore state and have that field. It's a matter of style mostly, choose between a consistent style of doing things or adapt it to go for the simplest solution for a given case.

How do you deal with legacy data integrity issues when rewriting software?

I am working on a project which is a rewrite of an existing legacy software. The legacy software primarily consists of CRUD operations (create, read, update, delete) on an SQL database.
Despite the CRUD-based style of coding, the legacy software is extremely complex. This software complexity is not only the result of the complexity of the problem domain itself, but also the result of poor (and regularly bordering on insane) design decision. This poor coding has lead to the data in the database lacking integrity. These integrity issues are not solely in terms of relationships (foreign keys), but also in terms of the integrity within a single row. E.g., the meaning of column "x" outright contradicts the meaning of column "y". (Before you ask, the answer is "yes", I have analysed the problem domain and correctly understand the meaning and purpose of these columns, and better than the original software developers it seems).
When writing the replacement software, I have used principles from Domain Driven Design and Command Query Reponsibility Segregation, primarily due to the complexity of the domain. E.g., I've designed aggregate roots to enforce invariants in the write model, command handlers to perform "cross-aggregate" consistency checks, query handlers to query intentionally denormalised data in a manner appropriate for various screens, etc, etc.
The replacement software works very well when entering new data, in terms of accuracy and ease of use. In that respect, it is successful. However, because the existing data is full of integrity issues, operations that involve the existing data regularly fail by throwing an exception. This typically occurs because an aggregate can't be read from a repository because the data passed to the constructor violates the aggregate's invariants.
How should I deal with this legacy data that "breaks the rules". The old software worked fine in this respect, because it performed next to no validation. Because of this lack of validation, it was easy for inexperienced users to enter nonsensical data (and experienced users became very valuable because they had years of understanding it's "idiosyncrasies").
The data itself is very important, so it cannot be discarded. What can I do? I've tried sorting out the integrity issues as I go, and this has worked in some cases, but in others it is nearly impossible (e.g., data is outright missing from the database because the original developers decided not to save it). The sheer number of data integrity issues is overwhelming.
What can I do?

for a question tagged with DDD the answer is almost always talk to your domain expert. How do they want things to work.
I also noticed your question is tagged with CQRS. are you actually implementing CQRS? in that case it should be almost a non-issue.
Your domain model will live on the command side of your application and always enforce validation. The read stack will provide just dumb viewmodels. This means that on a read, your domain model isn't even involved and also no validation is applied. It will just show whatever nonsense it can use to populate your viewmodel. However on a write validations are triggered. and any write will need to adher to the full validations of your viewmodel.
Now back to reality: be VERY sure that the validations you implement are actually the valididations required. For example even something simple as a telephone number (often implemented as 3 digits dash 3 digits dash 4 digits). But then companies haver special phone numbers like 1800-CALLME which not only have digits but also have letters, and could even be of different lengths (and different countries might also have different rules). If your system needs to handle this it pritty much means you can't apply any validation on phonenumbers.
This is just an example how what you might think is a real validation really can't be implemented at all because that 1 special case it needs to handle. The rule here becomes again. Talk to your domain expert how he wants to have things handled. But be VERY careful that your validations doens't make it near impossible for real users to use your system. since that's the fastest way to have your project killed.
Update: In DDD you would also hear the term anti-corruption layer. This layer ensures that incomming data meets the expectations of your domain model. This might be the prefered method but if you say you cannot ignore items with garbage data then this might not solve the problem in your case.

Complex Finds in Domain Driven Design

I'm looking into converting part of an large existing VB6 system, into .net. I'm trying to use domain driven design, but I'm having a hard time getting my head around some things.
One thing that I'm completely stumped on is how I should handle complex find statements. For example, we currently have a screen that displays a list of saved documents, that the user can select and print off, email, edit or delete. I have a SavedDocument object that does the trick for all the actions, but it only has the properties relevant to it, and I need to display the client name that the document is for and their email address if they have one. I also need to show the policy reference that this document may have come from. The Client and Policy are linked to the SavedDocument but are their own aggregate roots, so are not loaded at the same time the SavedDocuments are.
The user is also allowed to specify several filters to reduce the list down. These to can be from properties that are stored on the SavedDocument or the Client and Policy.
I'm not sure how to handle this from a Domain driven design point of view.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocuments, that I then have to turn into a different object or DTO, and fill with the additional client and policy information? That seem a little slow as I have to load all the details using multiple calls.
Do I have a function on a repository that takes the filters and returns me a list of SavedDocumentsForList objects that contain just the information I want? This seems the quickest but doesn't feel like I'm using DDD.
Do I load everything from their objects and do all the filtering and column selection in a service? This seems the slowest, but also appears to be very domain orientated.
I'm just really confused how to handle these situations, and I've not really seeing any other people asking questions about it, which masks me feel that I'm missing something.

Queries can be handled in a few ways in DDD. Sometimes you can use the domain entities themselves to serve queries. This approach can become cumbersome in scenarios such as yours when queries require projections of multiple aggregates. In this case, it is easier to use objects explicitly designed for the respective queries - effectively DTOs. These DTOs will be read-only and won't have any behavior. This can be referred to as the read-model pattern.

Code generation against Sprocs?

I'm trying to understand choices for code generation tools/ORM tools and discover what solution will best meet the requirements that I have and the limitations present.
I'm creating a foundational solution to be used for new projects. It consists of ASP.NET MVC 3.0, layers for business logic and data access. The data access layer will need to go against Oracle for now, and then switch to SQL this year as the db migration is finished.
From a DTO standpoint mapping to custom types in the solution, what ORM/code generation tool will work with creating my needed code but can ONLY access Stored Procs in Oracle and SQL.?
Meaning, I need to generate the custom objects that are the artifacts from and being pushed to the stored procedures as the parameters, I don't need to generate the sprocs themselves, they already exist. I'm looking for the representation of what the sproc needs and gives back to be generated into DTOs. In some cases I can go against views and generate DTOs. I'm assuming most tools already do this. But for 90% of the time, I don't have access directly to any tables or views, only stored procs.
Does this make sense?

ORMs are best at mapping objects to tables (and/or views), not mapping objects to sprocs.
Very few tools can do automated code generation against whatever output a sproc may generate, depending on the complexity of the sproc. It's much more straight-forward to code generate the input to a sproc as that is generally well defined and clear.
I would say if you are stuck with sprocs, your options for using third party code to help reduce your development and maintenance time are severely limited.
I believe either LinqToSql or EntityFramework (or both?) are capable of some magic with regards to SQL Server to try to mostly automatically figure out what a sproc may be returning. I don't think it works all the time, it's just sophisticated guess work and I seriously doubt it would work with Oracle. I am not aware of anything else software-wise that even attempts to figure out what a sproc may return.
A sproc can return multiple diverse record sets that can be built dynamically by the sproc depending on the input and data in the database. A technical solution to automatically anticipating sproc output seems like it would require the following:
A static set of underlying data in the database
The ability to pass all possible inputs to the sproc and execute the sproc without any negative impact or side effects
That would give you a static set of possible outputs for any given valid input. A small change in the data in the database could invalidate everything.
If I recall correctly, the magic Microsoft did was something like calling the sproc passing NULL for all input parameters and assuming the output is always exactly the first recordset that comes back from the database. That is clearly an incomplete solution to the problem, but in simple cases it appears to be magic because it can work very well some of the time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string