Spring XD: Writing an aggregator; how to output asynchronously, or at least not 1:1? - aggregation

Per the title, are there best practices, examples, etc. for writing an aggregator transformer module? We have a situation in which our aggregation is based on a timestamp in the data, and the data may come out of order, late, or even TOO late to consider.
For the "too late" case, we plan on using a filter to remove them before the transformer sees them.
However, this gets down to the basic question - how does one write a transformer to aggregate when the output of the aggregator is not based on the input? Can one be written to output to the same XD stream asynchronously? Or even, can one write a transformer to output based on some aspect of the input, but only on certain inputs and not output anything on the others?
As a concrete, albeit contrived, example, I want an aggregator to take a Tuple of (String, Int), and sum the Int's based on the String, and then ONLY output the aggregations when the String is a certain value; say "stop". Easy enough to demonstrate:
(foo, 1)
(bar, 2)
(foo, 2)
(stop, 0)
When the last tuple enters the transformer, I should emit a stream of tuples:
(foo, 3)
(bar, 2)
How is a transformer written to do this?

Use an aggregator processor module, not a transform. The standard aggregator module doesn't have it, but the Spring Integration aggregator can discard late messages via a discard-channel (if expire-groups-on-completion is false).
The aggregator itself, when the release strategy determines that a group can be released (or via group-timeout), can do anything you want with the group, returning any kind of result that is emitted and will go to the next module in the stream. So you would typically do your tuple accumulation there, or use the default aggregator, which will release a collection and you can use a transform module downstream.
The aggregator is a passive component by default; unless you configure a MessageGroupStoreReaper or use group-timeout, group release is only triggered when a message arrives for that group (has the same correlation).

Related

How to create a 100% declarative model?

I am creating a simple model of a network. The network contains nodes. Nodes send and receive data.
Here's one way to model the network: Each node has a "data" field representing the data possessed by the node at time t. Each node also has a "send" field recording the data sent to other nodes at time t.
sig Node {
data: Data -> Time,
send: Data -> Node -> Time
}
In the spectrum between 100% declarative and 100% imperative, I don't think that that signature is at the 100% declarative side. In my first paragraph I said nothing about nodes having data stores, nothing about keeping a record of sent data.
Also, isn't "send" a verb? Isn't that a sign that the model is not as declarative as it could be? Shouldn't declarative models exclusively use nouns?
I want my model to be at the 100% declarative end of the spectrum. To achieve that, I must simply state what "is". Let's do it!
What is is there are nodes:
sig Node {}
What is is that at any time t, a node has data.
sig Node {
data: Data -> Time
}
What is is that the network has wires between nodes.
sig Network {
wire: Node -> Node
}
What is is that data d is on a wire between time t and t' ...
Let me stop there. Is the second approach that I sketched out more declarative? Is there an approach that is even more declarative?
How to model a network in a 100% declarative manner?
I think you need to separate the semantic content of the model, which is one issue, from the names used in it. To me, the essence of a declarative model is that it records observations -- which may be about state, or about dynamic things like transitions -- expressed in logic. The alternative, an operational model, is to describe behavior and states by building up sequences of primitive actions. A simple example: modeling an action that picks an element non-deterministically from a set. An operational spec might treat the set as ordered, and then use a loop to walk through the set, at each step tossing a coin, and returning the given element when the toss first comes up heads. This models the intuitive operational description: "go through the elements of the set one at a time, and pick one to return". A declarative spec would simply say that the returned value is an element of the set -- nothing else needs to be said.
Regarding the use of names in your particular model, it seems that a larger issue to me is whether the names convey directly what they mean. So send to me is not a very helpful name; a better one would be sentAt.

How does lazy-evaluation allow for greater modularization?

In his article "Why Functional Programming Matters," John Hughes argues that "Lazy evaluation is perhaps the most powerful tool for modularization in the functional programmer's repertoire." To do so, he provides an example like this:
Suppose you have two functions, "infiniteLoop" and "terminationCondition." You can do the following:
terminationCondition(infiniteLoop input)
Lazy evaluation, in Hughes' words "allows termination conditions to be separated from loop bodies." This is definitely true, since "terminationCondition" using lazy evaluation here means this condition can be defined outside the loop -- infiniteLoop will stop executing when terminationCondition stops asking for data.
But couldn't higher-order functions achieve the same thing as follows?
infiniteLoop(input, terminationCondition)
How does lazy evaluation provide modularization here that's not provided by higher-order functions?
Yes you could use a passed in termination check, but for that to work the author of infiniteLoop would have had to forsee the possibility of wanting to terminate the loop with that sort of condition, and hardwire a call to the termination condition into their function.
And even if the specific condition can be passed in as a function, the "shape" of it is predetermined by the author of infiniteLoop. What if they give me a termination condition "slot" that is called on each element, but I need access to the last several elements to check some sort of convergence condition? Maybe for a simple sequence generator you could come up with "the most general possible" termination condition type, but it's not obvious how to do so and remain efficient and easy to use. Do I repeatedly pass the entire sequence so far into the termination condition, in case that's what it's checking? Do I force my callers to wrap their simple termination conditions up in a more complicated package so they fit the most general condition type?
The callers certainly have to know exactly how the termination condition is called in order to supply a correct condition. That could be quite a bit of dependence on this specific implementation. If they switch to a different implementation of infiniteLoop written by another third party, how likely is it that exactly the same design for the termination condition would be used? With a lazy infiniteLoop, I can drop in any implementation that is supposed to produce the same sequence.
And what if infiniteLoop isn't a simple sequence generator, but actually generates a more complex infinite data structure, like a tree? If all the branches of the tree are independently recursively generated (think of a move tree for a game like chess) it could make sense to cut different branches at different depths, based on all sorts of conditions on the information generated thus far.
If the original author didn't prepare (either specifically for my use case or for a sufficiently general class of use cases), I'm out of luck. The author of the lazy infiniteLoop can just write it the natural way, and let each individual caller lazily explore what they want; neither has to know much about the other at all.
Furthermore, what if the decision to stop lazily exploring the infinite output is actually interleaved with (and dependent on) the computation the caller is doing with that output? Think of the chess move tree again; how far I want to explore one branch of the tree could easily depend on my evaluation of the best option I've found in other branches of the tree. So either I do my traversal and calculation twice (once in the termination condition to return a flag telling infinteLoop to stop, and then once again with the finite output so I can actually have my result), or the author of infiniteLoop had to prepare for not just a termination condition, but a complicated function that also gets to return output (so that I can push my entire computation inside the "termination condition").
Taken to extremes, I could explore the output and calculate some results, display them to a user and get input, and then continue exploring the data structure (without recalling infiniteLoop based on the user's input). The original author of the lazy infiniteLoop need have no idea that I would ever think of doing such a thing, and it will still work. If we've got purity enforced by the type system, then that would be impossible with the passed-in termination condition approach unless the whole infiniteLoop was allowed to have side effects if the termination condition needs to (say by giving the whole thing a monadic interface).
In short, to allow the same flexibility you'd get with lazy evaluation by using a strict infiniteLoop that takes higher order functions to control it can be a large amount of extra complexity for both the author of infiniteLoop and its caller (unless a variety of simpler wrappers are exposed, and one of them matches the caller's use case). Lazy evaluation can allow producers and consumers to be almost completely decoupled, while still giving the consumer the ability to control how much output the producer generates. Everything you can do that way you can do with extra function arguments as you say, but it requires to the producer and consumer to essentially agree on a protocol for how the control functions work; and that protocol is almost always either specialised to the use case at hand (tying the consumer and producer together) or so complicated in order to be fully-general that the producer and consumer are up tied to that protocol, which is unlikely to be recreated elsewhere, and so they're still tied together.

Representing timestamps

I would like to represent the timestamp coming from an HTMLMediaElement. Its defining characteristics are:
Its value is represented as a Double
It can be queried at any time using getCurrentTime :: IO Double (as partially applied on a given HTMLMediaElement)
It is potentially continuously changing (whenever the media player is playing)
My initial plan was to represent it as a Behavior t Double that re-runs the IO Double every time it is observed, but that hasn't worked out too well.
Things I've tried:
Using a Behavior that is prodded under the hood at a fixed frequency, as described in the workaround section of this question
Passing an Event t () representing the desired sampling frequency, and returning an Event t Double that holds the coinciding timestamps
I don't really like either -- the first one either couples the behaviour (sorry) too much to my specific use case (if I use the eventual sampling frequency I'll use in my app) or seems wasteful (if I use something like 1 kHz sampling when creating the Behavior just to then sample it at 60 Hz on the application end), and the second is quite inflexible if you want to do more than one thing with the timestamp at different sampling rates.
Right now, using an Event to explicitly sample the time (your second option) value is your best bet. We haven't yet created a way to write Behaviors that lazily poll outside resources, although that is something that I hope we'll be able to get done soon.
Keep in mind that, with your second option, you don't necessarily need to use a specific sampling rate; instead, you can sample on-demand, and even have multiple locations doing that sampling. It's not perfect, but I hope that'll let you get the job done!

Why are bacon.js observables classified lazy sequences?

My understanding of lazy sequences is that they don't load data in memory until it's accessed by the program. So I can see how this would make sense if there was a large list of numbers, waiting to be consumed, but the sequence only pulled in the data from the producer when the iterator called the next method.
But observables append the item to themselves whenever the producer pushes it to them. So it's not like the sequence loads the data when consumer asks for it, it loads it whenever the producer sends it. So in what way are observables lazy?
There are 2 kinds of laziness in Bacon.js Observables:
Observables don't register to their underlying data source (for example, an AJAX fetch) until there is at least one Observer. This laziness practically gives you automatic resource management in the sense that connections to data sources are automatically opened and closed based on demand. Also if there are multiple Observers, only a single connection to the data source is used and the results are shared.
Observables don't evaluate the functions passed to map, combine etc until the value is actually used. So if you do expensive calculation in the function you give to map and only sample the stream once in a second, only those values will be actually evaluated.
What you don't get is backpressure management. So if you're data source, say observable a, produces infinite values sequentially, you can either
process them immediately using a.onValue
take 1000 first of them using a.take(1000).onValue
take until some condition using a.takeUntil((x) -> x > 1000)).onValue
But you cannot affect the rate that the source produces the values, because Bacon.js provides no way to tell the source that "I'm interested in more values later, I'll tell you when". I've been thinking about adding something like this, but it would complicate matters a lot.
As the bottom line, I'd say that Bacon.js is not your ideal library for working with infinite lists, for instance, in the Haskell style. And AFAIK, neither is any other Javascript FRP lib.
A lazy sequence is a sequence that evaluates its elements at the last time possible. What's useful about them is that you can pass the sequence around and even perform operations on it without evaluating its contents.
Since you don't evaluate the sequence, you can even create an infinite sequence; just make sure that you don't evaluate the whole of it. For example, the following Haskell program creates an infinite sequence of natural numbers, then lazily multiplies each element by 2, producing an infinite sequence of even numbers, then takes the first 5 elements, evaluating them (and only them):
take 5 (map (*2) [1..])
-- [2,4,6,8,10]
Basically, with lazy sequences and a set of functions that work over them, like map, you can write programs that create and process potentially infinite streams of data in a composable way.
Which, incidentally, is exactly what (functional) reactive programming is about. An observable can be seen as a potentially infinite stream of pairs (value, timestamp). An operation on an observable, such as map, simply creates another stream based on the original one. Attaching a consumer to an observable evaluates the elements of it that "already happened", while leaving the rest un-evaluated.
For example, Bacon's observable.map can be implemented as a function that lazily applies a function to the 'value' part of the stream (again in Haskell):
map f = Prelude.map (\(value, tstamp) -> (f value, tstamp))
while observable.delay adds delay to the 'timestamp' part:
delay dt = Prelude.map (\(value, tstamp) -> (value, tstamp + dt))
where Prelude.map is just a regular "map" function over lazy sequences.
Hopefully I haven't confused you any further!

Behavior in reactive-banana

Pardon me, I'm just starting to look into reactive-banana and FRP.
The author of reactive-banana made this example per my suggestion, in which he creates a counter which can be increased and decreased. He uses accumE function which accumulates events. I think I was able to somewhat grok the Event type, and was able to test quite a few things with it, but then I remembered that there was also Behavior. I looked into it, but it seems like the behavior is meant to be used in similar situations; to modify an existing variable, just like accumE does with events.
What does Behavior mean, and what are the use cases for it?
I agree with Ankur rather than Chris: a text box is a value over time and so naturally wants to be a behavior rather than an event. The reasons Chris give for the less natural choice of event are implementation issues and so (if accurate) an unfortunate artifact of the reactive-banana implementation. I'd much rather see the implementation improved than the paradigm used unnaturally.
Besides the semantic fit, it's pragmatically very useful to choose Behavior over Event. You can then, for instance, use the Applicative operations (e.g., liftA2) to combine the time-varying text box value with other time-varying values (behaviors).
Semantically, you have
Behavior a = Time -> a
That is, a Behavior a is a value of type a that varies over time. In general, you know nothing at all about when a Behavior a would change, so it turns out to be a rather poor choice for updating a text field on the click of a button. That said, it would be easy to get a behavior that expresses the current value of the number in the counter example. Just use stepper on the event stream, or alternatively, build it from scratch the same way, except by using accumB instead of accumE.
Typically, things you hook up to input and output will always be Events, so Behavior is used internally for intermediate results.
Suppose that in the given example, you want to add a new button that remembers the current value, like the memory function on simple calculators. You would start out by adding a memory button and a text field for the remembered value:
bmem <- button f [text := "Remember"]
memory <- staticText f []
You need to be able to ask for the current value at any time, so in your network, you'd add a behavior to represent it.
let currentVal = stepper 0 counter
Then you can hook up events, and use apply to read the value of the behavior every time the Remember button is pressed, and produce an Event with that sequence of values.
emem <- event0 bmem command
let memoryE = apply (const <$> currentVal) emem
And finally, hook up this new event to the output
sink memory [text :== ("", show <$> memoryE)]
If you wanted to use memory internally, then again you'd want a Behavior for its current value too... but since we only ever use it to connect it to an output, we only need an event for now.
Does that help?
Library author speaking. :-)
Apparently, Chris Smith can read minds because he accurately describes what I am thinking. :-)
But Conal and Arthur have a point, too. Conceptually, the counter is a value that varies in time, not a sequence of event occurrences. Thus, thinking of it as a Behavior would be more appropriate.
Unfortunately, behaviors do not come with any information about when they will change, the are "poll-only". Now, I could try to implement various clever schemes that will minimize the polling and thus allow effient updates of GUI elements. (Conal does something similar in the original paper.) But I have adopted a "no magic" philosophy: the library user shall be responsible for managing updates via events himself.
The solution I currently envision is to provide a third type besides Event and Behavior, namely Reactive (name subject to change) which embodies qualities of both: conceptually, it's a value that varies in time, but it also comes with an event that notifies of changes. One possible implementation would be
type Reactive a = (a,Event a)
changes :: Reactive a -> Event a
changes (_, e) = e
value :: Reactive a -> Behavior a
value (x, e) = stepper x e
It is no surprise that this is precisely the type that sink expects. This will be included in a future version of the reactive-banana library.
EDIT: I have released reactive-banana version 0.4 which includes the new type, which is now called Discrete.
Generally, Behavior is a value that changes over a period of time. It is a continuous value, where as events are discrete values. In case of Behavior a value is always present.
For example: The text on a text box is a Behavior as the text can change over a period of time but there will be a current value, where as a keyboard stroke in a event as you cannot query a keyboard stroke for its "current" value.

Resources