How do I keep track of references in a distributed object graph?

How do I keep track of references in a distributed object graph? - garbage-collection

I'm developing a distributed system, which must maintain a distributed object graph, where objects can point to other objects, some of which may be on remote computers. It must also be possible to move objects from one computer to another.
This raises two related problems:
How do we know when it is safe to garbage-collect objects?
If we move an object, how do we robustly ensure that all references to it are updated?
An initial thought is to maintain an exhaustive list of all inbound references.
Of course, this alone isn't sufficient for garbage-collection, as it won't catch cyclic references (same problem as with reference-counting garbage collections).
However this list can be used to ensure that any remote computer with references to the object can be notified if the object is moved.
Another thought is that if an object is moved, it leaves behind a "breadcrumb" on its original computer. If a computer receives a message intended for an object that is no-longer present, it can be forwarded to its new location.
However it is not scalable to keep such references around indefinitely, how do we know when its safe to delete them?
Can anyone provide pointers to existing solutions to this problem, particularly those that deal with replication and concurrency issues too?

The most known solution in Java is implemented as part of the RMI spec.
See Garbage Collection of Remote Objects & Distributed Garbage Collection

Related

Spring Integration Header Update Not Reflected between PublishSubscribe SubFlows

May be it is my lack of knowledge, I'm seeing the behavior where a header value updated in SubFlow1 of PublishSubscribe is NOT reflecting in the other SubFlow2 which is executed on the same main thread.
.publishSubscribeChannel(pubSubSpec -> pubSubSpec.subscribe(flow1())
.subscribe(flow2()))
.get();

The Message is immutable by definition. So, in those to sub-flows you really deal with different messages.
Imaging you have to Map objects which contain the same key. Modifying one of them definitely does not effect the other instance.
If you still think that you need share the same object between sub-flows, then consider to have a mutable object as a header. For example an AtomicReference will do the trick to be shared as a header value between different message instance. Then modifying this value in a header will lead you to the reflected change from one sub-flow to another.
In general it is better to think only in immutable object since there is no guarantee how your messages are going to travel through distributed asynchronous solution.

What is the difference between date transfer object (DTO) and representation object of domain driven design pattern?

I know DTO is returned by the server-side and received by the client-side, but I am confused by the representation object in DDD. I think they are almost the same. Can someone tell me their differences?

Can someone tell me their differences?
They solve different problems in different contexts
Data transfer is a boundary concern - how do we move information from here to there (across a remote interface)? Among the issues that you may run into: the transfer of information is slow, or expensive. One way of keeping this under control is to move information in a larger grain.
the main reason for using a Data Transfer Object is to batch up what would be multiple remote calls into a single call -- Martin Fowler, Patterns of Enterprise Application Architecture
In other words, a DTO is your programs representation of a fat message.
In DDD, the value object pattern is a modeling concern; it is used to couple immutable representations of information and related computations.
A DTO tends to look like a data structure, with methods that can be used to transform that data structure into a representation (for example: an array of bytes) that can be sent across a boundary.
A value object tends to look like a data structure, with methods that can be used to compute other information that is likely to be interesting in your domain.
DTO tend to be more stable (or at least backwards compatible) out of necessity -- because producer and consumer are remote from one another, coordinating a change to both requires more effort than a single local change.
Value objects, in contrast, are easier to change because they are a domain model concern. IF you want to change the model, that's just one thing, and correspondingly easier to coordinate.
(There's kind of a hedge - for system that need persistence, we need some way to get the information out of the object into a representation that can be stored and retrieved. That's not necessarily a value object concern, especially if you are willing to use general purpose data structures to move information in and out of "the model".)
In the kingdom of nouns, the lines can get blurry - partly because any information that isn't a general purpose data structure/primitive is "an object", and partly because you can often get away with using the same objects for your internal concerns and boundary cnocerns.

How to check if adding child DDD entity to parent (tree-like structure) entity doesn't do cycle in another tree

I'm modelling a domain where entities are packages of business services and can be made of other packages - tree structure. I have some issues with designing domain logic for avoid cycles.
Let's say we have entity (object of class "Package") PackageA and it has children (PackageB and PackageC). We also have PackageD with child PackageB.
Now we want add PackageA to PackageB as its child - we're doing this by getting PackageB from PackageRepostiory and using PackageB::addChildPackage() method. But before that we need to make sure that adding this package won't do any loops in other trees (in the example it will). The question is how to implement this in DDD?
I thought about moving adding packages to domain service, so I can get all packages that are currently related to the package I want to modify and make a check on them if there will be no cycles.
Do you think it's good idea?

I thought about moving adding packages to domain service, so I can get all packages that are currently related to the package I want to modify and make a check on them if there will be no cycles.
Do you think it's good idea?
Yes, you can do it. Adding a PackageServices class that provide validation for Package it make some sense for your description of Domain.

An aggregate root should not contain a reference to an instance of another aggregate root. You should indirectly reference the other aggregate using either an Id or a value object containing the Id and some extra interesting information.
Your case appears to be along the lines of a classification structure and you could event model that separately from your main aggregate. Either way you would need to get the entire hierarchy that your aggregate belongs to an check that there is no cycle. However, there is no guarantee that you are not creating a cycle along with another user at the same time if what you and the other user are doing would create a cycle.
More-or-less the same goes for some unique attribute such as an e-mail address. In that case it could be as simple as creating a unique constraint on your data store.
In the case of a cyclic dependency it isn't going to be as simple as creating a constraint and the same goes for a data store that does not support unique constraints.
In such a case you would need to use a bit of a process manager and perform a couple of steps. We would probably want to prevent the obvious cycles by checking for a cycle before creating the item in the hierarchy. After you have committed your unit of work (say, transaction) you could send a message to validate the new entry. That second step would then check for a cycle again since someone else may have created an invalid state along with you. If you have a cycle then the last entered item "loses" and is removed an a notification of sorts published in order to make that decision known.
Another way would be to attempt to prevent creating cycles using some locking strategy. This would have to have the correct grain depending on your design. If you have various independent hierarchies they could be locked on a root level. The root would be the Package that has no parent. After your commit you would release the lock. If you have a single hierarchy then you could probably still lock that and permit only a single change to the hierarchy at any one time.
Using a domain service for this seems to be the way to do it but that in itself does not quite solve your issue. It is more about what that service is going to do.

Managing complex state in FP

I want to write a simulation of a multi-entity system. I believe such systems motivated creation of Simula and OOP where each object would maintain its own state and the runtime would manage the the entire system (e.g. stop threads, serialize data).
On the other hand, I would like to have ability to rewind, change the simulation parameters and compare the results. Thus, immutability sounds great (at least up to almost certain garbage collection issues caused by keeping track of possibly redundant data).
However I don't know how to model this. Does this mean that I must put every interacting entity into a single, huge structure where each object update would require locating it first?
I'm worried that such approach would affect performance badly because of GC overhead and constant structure traversals as opposed to keeping one fixed address of entity in memory.
UPDATE
To clarify, this question asks if there is any other design option available other than creating a single structure that contains all possibly interacting entities as a root. Intuitively, such a structure would imply logarithmic single update penalty unless updates are "clustered" somehow to amortize.
Is there a known system where interactions could be modelled differently? For example, like in cold/hot data storage optimization?
After some research, there seems to be a connection with N-body simulation where systems can be clustered but I'm not familiar with it yet. Even so, would that also mean I need to have a single structure of clusters?

While I agree with the people commenting that this is a vague question, I'll still try to address some of the issues put forth.
It's true that there's some performance overhead from immutability, because when you use mutable state, you can update some values in-place, whereas with immutable state, some copying has to take place.
It is, however, a common misconception that this is causes problems with big 'object' graphs. It doesn't have to.
Consider a Haskell data structure:
data BigDataStructure = BigDataStructure {
bigChild1 :: AnotherBigDataStructure
, bigChild2 :: YetAnotherBigDataStructure
-- more elements go here...
, bigChildN :: Whatever }
deriving (Show, Eq)
Imagine that each of these child elements are big and complex themselves. If you want to change, say, bigChild2, you could write something like:
updatedValue = myValue { bigChild2 = updatedChild }
When you do that, some data copying takes place, but it's often less that most people think. This expression does create a new BigDataStructure record, but it doesn't 'deep copy' any of its values. It just reuses bigChild1, updatedChild, bigChildN, and all the other values, because they're immutable.
In theory (but we'll get back to that in a minute), the flatter your data structures are, the more data sharing should be enabled. If, on the other hand, you have some deeply nested data structures, and you need to update the leafs, you'll need to create a copy of the immediate parents of those leafs, plus the parents of those parents, and their parents as well, all the way to the root. That might be expensive.
That's the theory, though, but we've known for decades that it's impractical to try predict how software will perform. Instead, try to measure it.
While the OP suggest that significant data is involved, it doesn't state how much, and neither does it state the hardware specs of the system that's going to run the simulation. So, as Eric Lippert explains so well, the person who can best answer questions about performance is you.
P.S. It's my experience that when I start to encounter performance problems, I need to get creative with how I design my system. Efficient data structures can address many performance issues. This is just as much the case in OOP as it is in FP.

Showing data on the UI in the Hexagonal architecture

I'm learning DDD and Hexagonal architecture, I think I got the basics. However, there's one thing I'm not sure how to solve: how am I showing data to the user?
So, for example, I got a simple domain with a Worker entity with some functionality (some methods cause the entity to change) and a WorkerRepository so I can persist Workers. I got an application layer with some commands and command bus to manipulate the domain (like creating Workers and updating their work hours, persisting the changes), and an infrastructure layer which has the implementation of the WorkerRepository and a GUI application.
In this application I want to show all workers with some of their data, and be abe to modify them. How do I show the data?
I could give it a reference to the implementation of WorkerRepository.
I think it's not a good solution because this way I could insert new Workers in the repository skipping the command bus. I want all changes going through the command bus.
Okay then, I'd split the WorkerRepository into WorkerQueryRepository and WorkerCommandRepository (as per CQRS), and give reference only to the WorkerQueryRepository. It's still not a good solution because the repo gives back Worker entities which have methods that change them, and how are these changes will be persisted?
Should I create two type of Repositories? One would be used in the domain and application layer, and the other would be used only for providing data to the outside world. The second one wouldn't return full-fledged Worker entities, only WorkerDTOs containing only the data the GUI needs. This way, the GUI has no other way to change Workers, only through the command bus.
Is the third approach the right way? Or am I wrong forcing that the changes must go through the command bus?

Should I create two type of Repositories? One would be used in the domain and application layer, and the other would be used only for providing data to the outside world. The second one wouldn't return full-fledged Worker entities, only WorkerDTOs containing only the data the GUI needs.
That's the CQRS approach; it works pretty well.
Greg Young (2010)
CQRS is simply the creation of two objects where there was previously only one. The separation occurs based upon whether the methods are a command or a query (the same definition that is used by Meyer in Command and Query Separation, a command is any method that mutates state and a query is any method that returns a value).
The current term for the WorkerDTO you propose is "Projection". You'll often have more than one; that is to say, you can have a separate projection for each view of a worker in the GUI. (That has the neat side effect of making the view easier -- it doesn't need to think about the data that it is given, because the data is already formatted usefully).
Another way of thinking of this, is that you have a "write-only" representation (the aggregate) and "read-only" representations (the projections). In both cases, you are reading the current state from the book of record (via the repository), and then using that state to construct the representation you need.
As the read models don't need to be saved, you are probably better off thinking factory, rather than repository, on the read side. (In 2009, Greg Young used "provider", for this same reason.)
Once you've taken the first step of separating the two objects, you can start to address their different use cases independently.
For instance, if you need to scale out read performance, you have the option to replicate the book of record to a bunch of slave copies, and have your projection factory load from the slaves, instead of the master. Or to start exploring whether a different persistence store (key value store, graph database, full text indexer) is more appropriate. Udi Dahan reviews a number of these ideas in CQRS - but different (2015).
"read models don't need to be saved" Is not correct.
It is correct; but it isn't perhaps as clear and specific as it could be.
We don't need to create a durable representation of a read model, because all of the information that describes the variance between instances of the read model has already been captured by our writes.
We will often want to cache the read model (or a representation of it), so that we can amortize the work of creating the read model across many queries. And various trade offs may indicate that the cached representations should be stored durably.
But if a meteor comes along and destroys our cache of read models, we lose a work investment, but we don't lose information.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string