Cross referencing objects management - reference

Often I would have an unit and node object. Units are required to know which node they are at and nodes have to know the units its holding. This means that they each have a reference to each other. The problem I have is that I can see this getting messy since moving a unit to a new node means also having to update the reference for the nodes too. Is there a easier way to model/organize this relationship? This specific example is in lua but the question applies to any language.

It may be easiest to manage one side using a caching protocol. If the information changes, invalidate the cache. Then, when the side needs the information again, it obtains it.
In your example, perhaps the node has an authoritative list of units, while each unit only caches which node it is on. Perhaps these units are stopped and started, and the unit can simply look up which node it is attached to when it starts?

Related

Looking for way of inspecting Drools interactions with Facts

I am inserting pojo-like objects/facts into KieSession and have rules that interact with those objects. After all rules are fired I would like to be able to inspect which objects and their methods were accessed by rules. In concept this is similar to mocking.
I attempted to use Mockito and inserted mocked objects into kieSession. I'm able to get a list of methods that were called but not all of the interactions seem to show up.
Not sure if this is Mockito's limitation or it is something about how Drools is managing facts and lifecycles which breaks mocking.
Perhaps there is a better way of accomplishing this?
Update: Reasoning - we have an application executing various rule sets. Application makes available all of data but each rule set needs only some subset of the data. There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) by given rule set.
This part of your question indicates that you (or more likely your management) fundamentally doesn't understand the basic Drools lifecycle:
Reasoning - we have an application executing various rule sets. Application makes available all of data but each rule set needs only some subset of the data. There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) by given rule set.
The following is a very simplified explanation of how this works. A more detailed explanation would exceed the character limit of the answer field on StackOverflow.
When you call fireAllRules in Drools, the rule engine enters a phase called "matching", which is when it decides which rules are actually going to be run. First, the rule engine orders the rules depending on your particular rule set (via salience, natural order, etc.), and then it iterates across this list of rules and executes the left-hand side (LHS; AKA "when clause" or "conditions") only to determine if the rule is valid. On the LHS of each rule, each statement is executed sequentially until any single statement does not evaluate true.
After Drools has inspected all of the rules' LHS, it moves into the "execution" phase. At this point all of the rules which it decided were valid to fire are executed. Using the same ordering from the matching phase, it iterates across each rule and then executes the statements on the right hand side of the rules.
Things get even more complicated when you consider that Drools supports inheritance, so Rule B can extend Rule A, so the statements in Rule A's LHS are going to be executed twice during the matching phase -- once when the engine evaluates whether it can fire Rule B, and also once when the engine evaluates whether it can fire Rule A.
This is further complicated by the fact that you can re-enter the matching phase from execution through the use of specific keywords in the right hand side (RHS; aka "then clause" or "consequences") of the rules -- specifically by calling update, insert, modify, retract/delete, etc. Some of these keywords like insert will reevaluate a subset of the rules, while others like update will reevaluate all of the words during that second matching phase.
I've focused on the LHS in this discussion because your statement said:
There are some monitoring needs where we want to see exactly what data was accessed (getters called on fact objects) ....
The majority of your getters should be on your LHS unless you've got some really non-standard rules. This is where you should be getting your data out and doing comparisons/checks/making decisions about whether your rule should be fired.
Hopefully it makes sense as to why the request to know which "get" calls are triggered doesn't really make sense -- because in the matching phase we're going to be triggering a whole lot of "get" calls and then ignoring the result because some other part of the LHS doesn't evaluate true.
I did consider that potentially we're having a communication problem here and the actual need is to know what data is actually being used for execution (RHS). In this case, you should be looking to using listeners as I suggested in the comments. If you write a listener that hooks into the Drools lifecycle, specifically onto the execution phase (AgendaEventListener's afterMatchFired). At this point, you know that the rule matched and actually was executed, so you can log or record the rule name and details. Since you know the exact data needed and used by each rule, this will allow you to track the data you actually use.
This all being said, I found this part concerning based on my previous experience:
Application makes available all of data but each rule set needs only some subset of the data.
The company I worked for followed this approach -- we made all data available to all the rules by adding everything into the working memory. The idea was that if all the data was available, then we would be able to write rules without changing the supporting code because any data that you might need in the future was already available in working memory.
This turned out to be OK when we had small data, but as the company and product grew, this data also grew, and our rules started requiring massive amounts of memory to support working memory (especially as our call volume increased, since we needed that larger heap allocation per rules request.) It was exacerbated by the fact that we were using extremely unperformant objects to pass into working memory -- namely HashMaps and objects which extended HashMap.
Given that, you should strongly consider rethinking your strategy. Once we trimmed the data that we were passing into the rules, decreasing both volume and changing the structures to performant POJOs, we not only saw a great decrease in resource usage (heap, primarily) but we also saw a performance improvement in terms of greater rule throughput, since the rule engine didn't need to keep handling and evaluating those massive and inefficient volumes of data in working memory.
And, finally, in terms of the question you had about Mocking objects in working memory -- I would strongly caution against attempting to do this. Mocking libraries really shouldn't be used in production code. Most mocking works by leveraging combinations of reflection and bytecode manipulation. Data in working memory isn't guaranteed to remain in the initial state that it was passed in -- it gets serialized and deserialized a different points in the process, so depending on how your specific mocking library is implemented you'll likely lose "access" to the specific mocked instance and instead your rules will be working against a functional equivalent copy from that serialization/deserialization process.
Though I've never tried it in this situation, you may be able to use aspects if you really want to instrument your getter methods. There's a non-negligible chance you'll run into the same issue there, though.

How to use UUIDs in Neo4j, to keep pointers to nodes elsewhere?

I figured out thanks to some other questions that Neo4j makes use of ids for its nodes that could get recycled in case of node deletion.
That's a real concern for me as I need to store a reference to my node in another database (relational this time) in order to keep some sort of "pinned" nodes.
I've tried using this https://github.com/graphaware/neo4j-uuid to generate them automatically, but I did not succeed, all my queries kept running indefinitely.
My new idea is to make a new field in each of my nodes that I would manually fill with a UUID generated by NodeJs package uuid through uuid.v4().
I also came across the concept of indexing multiple times, which is totally unclear to me, but it seems that I should run this query:
CREATE INDEX ON :MyNodeLabel(myUUIDField)
If you think that it doesn't make sense at all don't hesitate to come up with another proposition. I am open to all kinds of suggestions.
Thanks for your help.
I would consider using the APOC library's apoc.uuid.install procedure.
Definitely create a unique constraint on the label and attribute you are going to use. This will not only create an index but also guarantee uniqueness of the attribute in the label namespace.
CREATE CONSTRAINT ON (mynode:MyNodeLabel) ASSERT mynode.myUUIDField IS UNIQUE
Then call the apoc.uuid.install procedure. This will create uuid's in the attribute myUUIDField on all of the existing MyNodeLabel nodes and on any new ones.
CALL apoc.uuid.install('MyNodeLabel', {addToExistingNodes: true, uuidProperty: 'myUUIDField'}) yield label, installed, properties
NOTE: you will have to install APOC and set apoc.uuid.enabled=true n the neo4j.conf file.

What is the best method for setting up data for ATDD style automation?

I assume that most implementations have a base set of known data that gets spun up fresh each test run. I think there are a few basic schools of thought from here..
Have test code, use application calls to produce the data.
Have test code spin up the data manually via direct datastore calls.
Have that base set of data encompass everything you need to run tests.
I think it's obvious that #3 is the least maintainable approach.. but I'm still curious if anyone has been successful with it. Perhaps you could have databases for various scenarios, and drop/add them from test code.
It depends on the type of data and your domain. I had one unsuccessful attempt when the schema wasn't stable yet. We kept running into problems adding data to new and changed columns which bricked the tests all the time.
Now we successfully use starting state data where the dataset will largely be fixed, stable schemas and required in the same state for all tests. (e.g. A postcode database)
for most other stuff the tests are responible for setting up data themselves. That works for us!

Transaction with Cassandra data model

According to the CAP theory, Cassandra can only have eventually consistency. To make things worse, if we have multiple reads and writes during one request without proper handling, we may even lose the logical consistency. In other words, if we do things fast, we may do it wrong.
Meanwhile the best practice to design the data model for Cassandra is to think about the queries we are going to have, and then add a CF to it. In this way, to add/update one entity means to update many views/CFs in many cases. Without atomic transaction feature, it's hard to do it right. But with it, we lose the A and P parts again.
I don't see this concerns many people, hence I wonder why.
Is this because we can always find a way to design our data model to avoid to do multiple reads and writes in one session?
Is this because we can just ignore the 'right' part?
In real practice, do we always have ACID feature somewhere in the middle? I mean maybe implement in application layer or add a middleware to handle it?
It does concern people, but presumably you are using cassandra because a single database server is unable to meet your needs due to scaling or reliability concerns. Because of this, you are forced to work around the limitations of a distributed system.
In real practice, do we always have ACID feature somewhere in the
middle? I mean maybe implement in application layer or add a
middleware to handle it?
No, you don't usually have acid somewhere else, as presumably that somewhere else must be distributed over multiple machines as well. Instead, you design your application around the limitations of a distributed system.
If you are updating multiple columns to satisfy queries, you can look at the eventually atomic section in this presentation for ideas on how to do that. Basically you write enough info about your update to cassandra before you do your write. That way if the write fails, you can retry it later.
If you can structure your application in such a way, using a co-ordination service like Zookeeper or cages may be useful.

Drupal: Avoid database when dealing with node type info?

I'm writing a Drupal module that deals with creating new nodes from CSV files. The way I've been doing it currently, the user provides a node type, and my module goes to the database to find the fields for that node.
After the user matches the node fields to the CSV fields, I want to validate the data. This requires finding out the types of the node fields. I'm not entirely sure how to do that. (Maybe look at the content_node_field table?)
Then, I have to create the nodes. Currently, the module creates a new StdClass object, populates it with the necessary data, and saves it.
But what if I could abstract away from the database entirely and avoid dealing with it? What if I asked the user to a node of this type that already exists? I could node_load() this node, and use that to determine node fields. When it comes time to save the nodes, I could use the "seed" node to figure out what the structure of the new nodes needs to be.
One downside: this requires at least one node of this type to exist before the module can function.
Also, would this be slower than accessing the db directly?
I fear that over time, db names could change, and content types could be defined across multiple tables. By working only from a pre-existing node, I could get around many of these issues. Right?
Surely node_load will be hitting the database anyway? The node fields are stored in the database so if you need to get them, at some point you have to talk to the database. Given that some page loads on Drupal invoke hundreds (or even thousands!) of database queries I really wouldn't worry about one or two!
Table names are unlikely to change and the schema should stay fixed between point versions of Drupal at least. It would be better practice to use the API to get the data you want if it is possible though, and this would give better protection against change. I don't know if that's possible.

Resources