I am exploring GPars(Groovy Parallel System) library these days and I came across DataFlow variables. The documentation says Dataflow variables can only be assigned once. However I could not find a reason for this restriction.
Can somebody tell me why Dataflow variables in GPars be assigned only once?
In dataflow programming variables are write-once and this is how they synchronize the code that uses (consumes) their value - it is this flow of data that guides execution.
A nice book about programming in general and dataflow concurrency in particular is Concepts, Techniques, and Models of Computer Programming, where you can learn more about the way dataflow synchronization variables are meant to be used (using the Oz language).
On the other hand, if you could assign more than one values to a variable, then each time the value was written, its consumers should be notified - this sounds more like message-passing.
Since a DataflowVariable can never change it's value, you cannot have any race condition on it. You can even do non-blocking writes and thus be very efficient and deadlock-free by design. After the initial assignment, you can even do non-blocking reads.
Related
I have a system which may generate certain events in the lifecycle of a transaction. On every even I need to update a row in a DB and also send out a UI event over websocket.
One option I have is to implement the event processing (DB and UI) in actors thus avoiding any locking issues - also I can afford minor delays so handling this sequentially will greatly simplify matters.
What are alternative ways of handling this in Scala as I feel maybe Actors might be overkill in this case?
There are those blogs stating that actors should be used for "concurrency with state" - though I would like to see a more appropriate mechanism in order to eliminate this option.
Ultimately the main unique benefit for using actors is that they are great for encapsulating mutable variables so that you avoid race conditions.
To do what you describe, you could just use classic threads. In your (possibly simplified) description, I don't see the potential for deadlocks. If you want to something a bit more composable, e.g. a sequence of asynchronous tasks, you can use Scala's Futures.
Not at all sure if this is applicable for Scala, but there's a great lib for Groovy and Java with several concurrency models. I've myself used the Dataflow Concurrency with great success and can recommend it as a light-weight yet manageable model.
Dataflow Concurrency offers an alternative concurrency model, which is
inherently safe and robust. It puts an emphasis on the data and their
flow though your processes instead of the actual processes that
manipulate the data. Dataflow algorithms relieve developers from
dealing with live-locks, race-conditions and make dead-locks
deterministic and thus 100% reproducible. If you don’t get dead-locks
in tests you won’t get them in production.
There are other models available in the linked GPars library as well.
I wouldn't suggest making the threading yourself, unless you have no other choices.
Addendum
After posting I got interested in the topic and made a few searches. Seems like Akka has direct support for Dataflow model also. Or at least has had in some version.
Actors avoid locking issues because they use queues to interact. You can use threads with (blocking) queues and get the same level of safety. The only advantage of Actors over Threads is that an Actor does not spend memory for call stack, thus we can have many more actors than threads in the same amount of core memory. The downside of actor model is that a complex algorithm which could be implemented in single thread require several actors and so actor implementation may look obscured.
I was just introduced to concurrent programming, and I learned that locks are one of the simplest synchronization primitive and people almost always get it wrong.
My question is: why aren't they rather using something like promises(dataflow variables), csp, actors, etc all the time? Wouldn't that save us from bugs and deadlocks? Cannot all locks be replaced by that?
Other problems are livelocks, race conditions and lack of parallelism. All concurrency models suffer from a few of these to different degrees.
Locks are just one tool in the toolbox. They have their sweet spot, so do have the other models.
The more you think about concurrency models the more you recognize how similar they really are. For example, actors plus message passing are like one lock per actor and the additional rule that you can only take one lock at a time. Clearly, that guarantees freedom of deadlocks and per-actor data races but it does not provide global correctness.
I'm looking for insight/ papers/ articles, etc. whether a fully declarative Domain Model (as per DDD) is possible.
For example:
Validation can be declarative (lot's of ORMs do this)
business flow logic can be declarative: having a DSL for defining a workflow / Finite State Machine / process manager / DDD Saga (whatever you want to call it) on Crud-operations, through ddd-repositories most likely
decision logic can be declarative. I.e: most of the time this boils down to simple conditionals
derived / calculated fields could be done declaratively but is as bit tricky, especially when this cascades. I.e: you'd have to keep a dependency graph on the calculated fields, etc. Still it can be done.
Any links to people having actually tried that, or some convincing couter-arguments why this can't be done?
p.s.: Please don't answer with "Yes it can be done, since a FSM is Turing-complete with enough memory bla bla"
"Everything is a nail if you hold a hammer"
Instead of asking if it is possible - ask:
What are the reasons I want to do this particular thing declaratively?
Data validation is a nice thing to do declaratively. You have a DTO, you add some attibutes, everything is clear and readable.
Business flow done declaratively... Reminds me of a great failure of Windows Workflow Foundation. Is anyone actually using it?
What is the benefit of making a behaviour-centric components in a declarative way? Imperative way seems to be well-fitted to this.
Decision logic... maybe. But again - why?
"derived / calculated fields could be done declaratively but is as bit tricky"
Why do you want to do tricky things when there is a way of achieving the same result with straightforward things?
So why?
Do you need to change the behaviour of the app in runtime by editing some config file? OK, go for it.
Do you have a generic domain that will be used by many clients and you need some reconfiguration to fit them? Ok, but you need to clearly distinguish the unchangeable core of your domain and the varying stuff. Don't try to make all the things configurable - you'll end up with Inner Platform syndrome.
Do you believe you can make a declarative language your client could use to change his domain without a need for a programmer? No. You will fail. I know a language which was supposed to be like this. An easy, declarative language that ordinary accountants would use to explore their data. It's called SQL :D
I fully concur with Bartłomiej Szypelow's remark. Regardless, I will try to answer your question.
A declarative language always depends on an imperative language for it to work. Take something as simple as arithmetic for example. When we ask for the outcome of 1+1 we first need to know how the 1 should be understood and how in that context the + operator can be calculated before the statement as a whole can be interpreted. This is one of the ways we are able to overcome complexity; you don't need to know how a plus operation is done to be able to use one.
From http://en.wikipedia.org/wiki/Imperative_programming:
Procedural programming could be considered a step towards declarative
programming. A programmer can often tell, simply by looking at the
names, arguments and return types of procedures (and related
comments), what a particular procedure is supposed to do, without
necessarily looking at the details of how it achieves its result. At
the same time, a complete program is still imperative since it 'fixes'
the statements to be executed and their order of execution to a large
extent.
In any domain the same applies. If you ask the flight attendant to reschedule your flight she knows what a flight and a passenger is and how to reschedule one. Creating a fully declarative flight scheduling domain model without describing how scheduling works is therefore impossible. Since all domain models contain operations/behavior it is therefore equally impossible to create a domain model in a declarative language unless your problem domain is not unique, for example when you have three flight companies that operate in similar fashion.
Back to why... you said:
Actually, the hypothetical reason why is actually to allow non-coders
to create a (trivial) backend by configuration
Declarative programming is not always simpler than imperative programming. Most of the time people tend to think in outcomes, but sometimes the outcomes are so varied that it is (much) simpler to think in steps. This is the reason why these different styles exist and continue to exist in the first place. The main difference between a coder and a non-coder is not that a non-coder tends to think in terms of outcomes instead of processes, but that the non-coder simply does not want to be bothered with the workings of a computer. The non-coder wants to focus on the problem. Depending on the domain a non-coder may be helped best with an imperative DSL, instead of an declarative DSL.
I need to know this differences in order to undestand how to use them right.
Which are the differences of DFD and Activity diagram?
Actually, it's reasonably logical. You only have to look at the names.
In data flow diagrams, the lines between "boxes" represent data that flows between components of a system. Because these only show the flow of data, they do not give an indication of sequencing.
In activity diagrams, those lines are simply transitions between activities and do not represent data flow at all. They more represent the sequencing of activities and decisions. You can tell from these what order things happen in.
That's a simplistic explanation but should be a good starting point. Further information can be garnered from Wikipedia for DFDs and activity diagrams.
Explicit bias: I'm a DFDs proponent.
#John is correct that Activity diagrams can be used to represent object flow. #pax equally correct they seldom are.
Two big DFD advantages for me:
Link to object model. Data stores on a DFD provide a really nice way to link the data produced / consumed to the object model. Very useful for consistency and ensuring your thinking is joined up.
They de-emphasise control flow. Far too often designs over-constrain sequencing. Activity diagrams do support concurrency - but it requires the user to (a) remember and (b) use it. So the default is over-serialisation. DFDs don't. They lay bare the real sequence dependencies without any extra effort on the part of the user. Consequently they also make it easier to see causal relationships. If processes a and b both require data input D then it's obvious on the diagram. And hence parallel activity is obvious.
Don't get me wrong - I'm not against Activity diags. Where control flow is the primary consideration I'll use an AD over a DFD. But empirically I'd say I find DFDs a more useful tool in ~70-80% of cases.
Of course, YMMV.
Just a humble opinion from someone who has had to explain processes, both computer and manual, to upper management and CIOs. I have found simple is better and pound-for-pound, DFDs get the message across when I am actually "asked" about details. That being said, the better approach is to always practice the story line and answer in simple answers.
One final comment regarding the age of tools and products. Remember that in most cases these are running the business and work pretty darn good. The adage "you break it (or replace it) and you own it" can make you a hero or make you into a clown.
We have a CIO who wants to replace all mainframe application for the simple reason that they are old technology. One must weigh the consequences and understand if the replacement can handle workload. Have you ever wondered why JPMC, Credit Swiss, Walmart, and Bank of America to name a few still run mainframes?
My apologies for taking it in that direction. Just make sure, whatever analysis tool is used, that all aspects of the replacement are documented including workload, I/O, concurrent users, adoption curve, and scalability.
Data Flow represents flow within one module or one independent code. However Sequence Diagram represents sequence of activity in between different modules.
Yes at some points they may pass the same messages.
I basically use Sequence diagram in interface documents which will be shared with other modules/elements, however DFD will be used in Low level design documents which will be used to develop the code within one module or network element.
If we look closer to a dataflow diagram, we can notice that when a node collects data from all its edges, it starts to process them. And to process, it needs an activity token, which represents access to a processor. Usually the process of obtaining that token is omitted, but it has to exist. Typically, the whole node as a token is put in a double-ended queue, where on the other end of that queue free activities (processors or threads) are stored. A thread pool is a perfect example of such a queue, and the nodes ready to work are represented as tasks. As soon as a node meets activity, they both are taken from the queue and actual processing starts. When processing finishes, activity is returned to the queue. This way we can think of activity as a special kind of token.
So both dataflow and activity diagrams are just simplified variants of a general active-data-flow diagram, with either activities or data tokens omitted. But generally, both kinds of tokens can be represented in a diagram simultaneously.
Programmers used to think about threads as activities, but if we look at them closer, we notice that when a thread is ready to execute, it gets in a line to processors, and real execution starts only when a free processor switch to that thread. This is absolute analogy as tasks are executed on a thread pool. So from simplified point of view, a thread is an activity, and from more rigorous point of view a thread is a data token and the only real activity is a physical processor. This shows that activity tokens are not different from data tokens. And indeed, we can omit the route of node chasing an activity and consider a dataflow node as an activity itself, which starts to work immediately when all its edges (inputs) contain data.
it's my first message here and I'm glad to join this community.
It looks like that everything is now going towards multi-thread development. Big fishes say that it won't take longer to reach hundreds of cores.
I've recently read about actor based development and how wonderful message passing is to handle concurrent programming. In addition, I also read that they can be implemented as a means of method call. In this case, a given object is also an actor.
In other words we no longer call methods arbitrarily. They are post in queue for late processing. A queue then ensures that a object's state(var) isn't modified at the same time because messages are all serialized.
I understand that this model is quite straightforward to implement (at least an experimental one) and perhaps that's why is too difficult to find any technical detail.
My question concerns queues. This is a typical case of multiple-producers and one consumer and I suspect they require some sort of synchronization. Is that true? There would be another solution? I heard they can be implemented as lock-free structures.
I'm not really sure about that. Any comment will be greatly appreciated.
Have a nice day pals
Multiple producers and a single consumer is a great scenario for using Actors, and doesn't require any synchronization. In Scala, you generally don't use any mutable state when working with Actors. You just send over a copy of whatever data needs processing.
You can read more about Actors in Scala in "Programming Scala", available online for free.
If I understood correctly, agents received the messages in a MailBox which behaves like a concurrent queue. So you do not have to care about it. If you want to play with mailbox directly, you can have a look at this nice article from the great "The busy Java developer's guide to Scala" series.