Spring Integration aggregator based on content of next message - spring-integration

I have to read a file and split each line and group lines based on first column, when the first column value changes I have to release previous group. Can this be done in Spring integration DSL ?
here is how file look like, and it's sorted:
x 1
x 2
x 3
y 4
y 5
y 6
The out put should be two messages with x =1, 2, 3 and y = 4, 5, 6.
Since this doesn't have any other relation, regarding when message should be grouped, Can I group message as soon as I hit next non matching record ? In this case as son as I hit "y" at line number 4, group the previous "x" messages and release it ? Is it possible using custom aggregator ?

The simplest solution is to rely on the groupTimeout() as far as you split and aggregate in a single thread and quickly enough. So, all your records will be processed and distributed to their groups. But since we don't know how to release them, we will rely on some scheduled timeout. So, the configuration for the aggregator would be like this:
.aggregate(a -> a
.correlationExpression("payload.column1")
.releaseStrategy(g -> false)
.groupTimeout(1000)
.sendPartialResultOnExpiry(true)
.outputProcessor(g -> {
Collection<Message<?>> messages = g.getMessages();
// iterate and build your output payload
})
)

Related

Is there a way in AnyLogic to assign resources to a population of agents in use rather than individual agents?

I have a simple example of dish washers at a restaurant to illustrate the issue I am having.
Question
How can I ensure that the correct number of dish washers are seized & released when it's depended on the number of agents being used?
Problem
Using a function to assign the resources, the number of dish washers are not always correct due to different times in which sinks are used and not used.
Example
Main:
Generates dishes and randomly assigns them to one of three sinks in the exit block.
Sinks is a population of agents.
dish_washers is a ResourcePool with a capacity of 10.
Sink:
Dishes enter a queue and are entered one at a time using a hold block.
Once the dish is cleaned, the hold is unblocked to grab the next dish.
Details:
I have a shared ResourcePool of dish_washers at a restaurant.
There are 3 sinks at the restaurant.
Dishes are generated and randomly assigned to each sink.
If only 1 sink is being used, then two dish washers are needed.
However, if 2 or more sinks are being used then the number of dish washers becomes:
numberOfDishWashers = 2 + numberOfSinksInUse;
In order to change the numberOfDishWashers as more sinks are being used, I created a function that defines the numberOfDishWashers to be seized from the dish_washer ResourcePool.
int numberOfSinksUsed = 0;
int numberOfWorkersToSeize = 0;
int numberOfWorkersAlreadySeized = 0;
int numberOfWorkersToAssign = 0;
ResourcePool[][] dish_washers;
for(Sink curSink : main.sinks){
if(curSink.queue.size() > 0){
numberOfSinksUsed += 1;
}
}
numberOfWorkersAlreadySeized = main.dish_washers.busy();
numberOfWorkersToSeize = 2 + numberOfSinksUsed;
numberOfWorkersToAssign = numberOfWorkersToSeize - numberOfWorkersAlreadySeized;
dish_washers = new ResourcePool[1][numberOfWorkersToAssign];
for(int i = 0; i < numberOfWorkersToAssign; i++){
dish_washers[0][i] = main.dish_washers;
}
return dish_washers;
Error Description:
However, depending on which sink completes first & releases, the number of dish washer assigned will be incorrect. A traceln at the end of the sink process illustrates this where the numberOfDishWashers seized on the exit block doesn't match "2 + numberOfSinksInUse".
There is an instance where 3 sinks are in used but only 4 workers were seized.
Exit, Sink: C Workers Currently Seized: 4
Sinks in Use: 2
Exit, Sink: C Workers Currently Seized: 4
Sinks in Use: 3
Exit, Sink: C Workers Currently Seized: 5
Sinks in Use: 2
Exit, Sink: C Workers Currently Seized: 4
Sinks in Use: 2
Another way to look at the issue, is this Excel table outlining the current logic.
The number of busy workers doesn't match the number of busy workers there should be based on the number of active sinks.
Methods I have Tried
Custom function to release only the necessary workers to keep the correct total.
Generates an error because the resource gets assigned to the 'agent' or dish.
When the dish gets destroyed it has unreleased resources attached to it.
Passing the "sink" agent through an "enter", "seize", and "exit" block to assign the
resource to the agent "sink" instead of the dish that is generated.
Error regarding the "dish" agent being in the flowchart of the "sink" agent while the
"sink" agent is seizing the workers.
How can I ensure the correct number of dish washers are always grabbed?
So your fundamental problem here is that inside the sink you will seize a dishwasher, then the dish goes into the delay (With the number of dishwashers seized) and once out of the delay it will release what ever dishwashers it seized... But during the time it is in the delay the situation might have changed and you actually wanted to seize a different number of dishwashers for that specific sink...
Your options are to either
Remove dishes from the delay, release the correct amount of dishwashers, return back into the delay and delay for the remainder of the time...
Implement your own logic.
I would go for option 2 as option 1 means that you develop a workaround for the block created by AnyLogic and you will end up not using the blocks the way they were designed, this is unfortunately the issue with blockification
So I would have a collection inside of a sink that shows the number of dishwashers currently assigned to this sink. Then whenever a new dish enters a sink we recalculate the number of dishwashers to assign (perhaps at every sink? ) and then make the correct assignment.
Here is an example with some sample code - I did not test it but you will have something similar

List all "unique paths" to a node

I have a representation of a process trough something that is very much like a DAG (Directed Acyclic Graph). This graph is represented with an adjacency table, but not like a "regular" adjacency table, there are few differences:
Each entry in the table is a list of lists,
Each "inner" list states the predecessor nodes required.
The idea for this data structure is to hold requirements of steps within a process. So, for example:
P = {1:[[]], 2:[[1]], 3:[[2]], 4:[[3]], 5:[[2]], 6:[[]], 7: [[4,6],[8,5]], 8:[[]]}
For process P, step 1 doesn't require any predecessor, step requires step 1,..., step 6 also doesn't require any predecessor, step 7 requires steps (4 and 6) OR (8 and 5).
Each step has a state (some ID reference) that determines if the next step can be executed or not, or if the process can be terminated or not.
In the example above, I would not be able to execute step 2 if step 1 didn't fulfill some specific condition regarding the state the same for step 5, which requires step 2 with state=something specific. And for step 7, the only way to execute it, would be if step 4&6 OR 5&8 have their corresponding state=something specific.
What I need is a way to get all the unique paths that lead to a certain step, so later I can check against this paths if the conditions are met. For step 7 it would be :
paths = [[1,2,3,4,6],[1,2,5,8]]
I've checked:
Python get all paths from graph
How to implement the search for all paths inside a directed graph in JavaScript? (reversing this??)
Depth first search list paths to all end nodes
How to find the nodes that leads to node A without traversing all the graph (directed graph)
Most of the information around points to some sort of modified DFS or some kind of enhanced Dijkstra. For what I've checked and tested none of the above gives me what I need which is a list of all "unique paths" that lead to a node that may be reached from "different paths".
The question is not language specific, so any example in any language would be appreciated :)
EDIT: 04/01/22
Further clarifications:
The steps are one way, meaning that node 1 is connected to step 2 by a distance of 1, to step 3 a distance of 2, and so on. But step/node 1 is not conntected with 6 or 8.
All graphs have a unique starting point and ending point. In the example 1 and 7.
Yes, node 5 should be connected to node 7. Img updated.
The number of nodes will always be <100.
How big is your graph? What is your performance requirement?
For a small graph like your example, Dijsktra is almost instant. So you do not need to store all the paths.
Set cost of all links to 1
Set cost of links that lead to nodes that are NOT in the required state to 10^10
Run Dijkstra to find shortest path from source to destination through nodes in required state.
I think I've managed to get what I needed, nevertheless I think the answer is overly complex.
Function to populate a tracker object with all the possible paths.
const tracker = {};
function getPaths (step, branchRef) {
const currentStepRequires = getStepRequires(step); // func that gets the array of arrays of current step
const oldBranchRef = branchRef;
const hasBranches = currentStepRequires.length > 1;
for (const branch of currentStepRequires) {
if (branch.length === 0) {
return;
}
if (!hasBranches && !branchRef) {
tracker[branch] = [];
}
if (!branchRef) branchRef = branch;
if (hasBranches) {
if (oldBranchRef && oldBranchRef !== branchRef) {
tracker[branch] = [...tracker[oldBranchRef]];
}
else if (tracker[branchRef]) {
tracker[branch] = [...tracker[branchRef]];
branchRef = branch;
}
else {
tracker[branch] = [];
}
}
for (const step of branch) {
tracker[branchRef].push(step);
getPaths(step, branchRef);
}
if (hasBranches) branchRef = '';
}
}
After the tracker object has been populated I need to remove the paths that are contained within the other paths.
I'm using lodash here to simplify the filtering, checking and adding the paths
const paths = [];
_.forEach(_.sortBy(tracker, path => path.length * -1), branch => {
const isSubpath = _.some(paths, path => _.isEqual(branch, _.intersection(path, branch)));
if (!isSubpath) {
paths.push(branch);
}
});
For the example above, this returns the following:
[[4,3,2,1,6], [8,5,2,1]]
I've also tested with more "branching", like example:
P = {1:[[]], 2:[[1]], 3:[[2]], 4:[[3]], 5:[[2]], 6:[[]], 7: [[4,6],[8],[5]], 8:[[6],[3]]}
Which returns:
[[4,3,2,1,6],[8,6],[8,3,2,1],[5,2,1]]
For now its working, but....as I said, I think its more complicated than it needs to be. So, any improvements are welcome.

ArangoDB: Traversal condition on related document

Been stuck for days with this concern, trying to accomplish this:
See the provided picture.
The black is the start vertex. Trying to get:
1: All child parts OUTBOUND (from) the start vertex
2: Condition: The children MUST have the INBOUND edge"types" and the other end a document with a variable set to "true" and of the type "type".
3: When a document of type "part" fails to met up the requirements with INBOUND document of type "type" with a attribute to "true", it stops the expand for that path then and there.
4: The documents who failed isn't included in the result either.
5: Should be compatible with any depths.
6: No subqueries (if possible without).
Example of graph
With the given information, the data model seems questionable. Why are there true and false vertices instead of a boolean edge attribute per partScrew? Is there a reason why it is modeled like this?
Using this data model, I don't see how this would be possible without subqueries. The traversal down a path can be stopped early with PRUNE, but that does not support subqueries. That only leaves FILTER for post-filtering as option, but be careful, you need to check all vertices on the path and not just the emitted vertex whether it has an inbound false type.
Not sure if it works as expected in all cases, but here is what I came up with and the query result, which looks good to me:
LET startScrew = FIRST(FOR doc IN screw LIMIT 1 RETURN doc) // Screw A
FOR v,e,p IN 1..2 OUTBOUND startScrew partScrew
FILTER (
FOR v_id IN SHIFT(p.vertices[*]._id) // ignore start vertex
FOR v2 IN 1..1 INBOUND v_id types
RETURN v2.value
) NONE == false
RETURN {
path: CONCAT_SEPARATOR(" -- ", p.vertices[*].value)
}
path
Screw A -- Part D
Screw A -- Part E
Screw A -- Part E -- Part F
Dump with test data: https://gist.github.com/Simran-B/6bd9b154d1d1e2e74638caceff42c44f

Spring Integration aggregator's release strategy based on last modified

I'm trying to implement the following scenario:
I get a bunch of files that have common file pattern, i.e. doc0001_page0001, doc0001_page0002, doc0001_page0003, doc0002_page0001 (where doc0001 would be one document consisting of 3 pages that I would need to merge, doc0002 would only have 1 page)
I want to aggregate them in a way that I will release a group only if all of the files for specific document are gathered (doc0001 after 3 files were picked up, doc0002 after 1 file)
My idea was to read the files in an alphabetical order and wait for 2 seconds after a group was last modified to release it (g.getLastModified() is smaller than the current time minus 2 seconds)
I've tried the following without success:
return IntegrationFlows.from(Files.inboundAdapter(tmpDir.getRoot())
.patternFilter("*.json")
.useWatchService(true)
.watchEvents(FileReadingMessageSource.WatchEventType.CREATE,
FileReadingMessageSource.WatchEventType.MODIFY),
e -> e.poller(Pollers.fixedDelay(100)
.errorChannel("filePollingErrorChannel")))
.enrichHeaders(h -> h.headerExpression("CORRELATION_PATTERN", "headers[" + FileHeaders.FILENAME + "].substring(0,7)")) // docxxxx.length()
.aggregate(a -> a.correlationExpression("headers['CORRELATION_PATTERN']")
.releaseStrategy(g -> g.getLastModified() < System.currentTimeMillis() - 2000)) .channel(MessageChannels.queue("fileReadingResultChannel"))
.get();
Changing the release strategy to the following also didn't work:
.aggregate(a -> a.correlationExpression("headers['CORRELATION_PATTERN']")
.releaseStrategy(g -> {
Stream<Message<?>> stream = g.getMessages()
.stream();
Long timestamp = (Long) stream.skip(stream.count() - 1)
.findFirst()
.get()
.getHeaders()
.get(MessageHeaders.TIMESTAMP);
System.out.println("Timestamp: " + timestamp);
return timestamp.longValue() < System.currentTimeMillis() - 2000;
}))
Am I misunderstanding the release strategy concept?
Also, is it possible to print something out from the releaseStrategy block? I wanted to compare the timestamp (see System.out.println("Timestamp: " + timestamp);)
Right, since you don't know the whole sequence for message group, you don't have any other choice unless to use a groupTimeout. The regular releaseStrategy works only when a message arrives to the aggregator. Since at the point of one message you don't have enough info to release the group, it is going to sit in the group store forever.
The groupTimeout option has been introduced to the aggregator especially for this kind of use-cases when we definitely would like to release a group without enough messages to group normally.
You may consider to use a groupTimeoutExpression instead of constant-based groupTimeout. The MessageGroup is a root evaluation context object for SpEL, so you will be able to get access to the mentioned lastModified for it.
The .sendPartialResultOnExpiry(true) is right option to deal with here.
See more info in the docs: https://docs.spring.io/spring-integration/reference/html/#agg-and-group-to
I found a solution to that with a different approach. I still don't understand why the above one wasn't working.
I've also found a cleaner way of defining the correlation function.
IntegrationFlows.from(Files.inboundAdapter(tmpDir.getRoot())
.patternFilter("*.json")
.useWatchService(true)
.watchEvents(FileReadingMessageSource.WatchEventType.CREATE, FileReadingMessageSource.WatchEventType.MODIFY), e -> e
.poller(Pollers.fixedDelay(100)))
.enrichHeaders(h -> h.headerFunction(IntegrationMessageHeaderAccessor.CORRELATION_ID, m -> ((String) m
.getHeaders()
.get(FileHeaders.FILENAME)).substring(0, 17)))
.aggregate(a -> a.groupTimeout(2000)
.sendPartialResultOnExpiry(true))
.channel(MessageChannels.queue("fileReadingResultChannel"))
.get();

DAG source return false on emitFromTraverser and processor wait for all element loaded by source before start processing

USECASE
HazelcastJet version 0.6.1
Hazelcast version 3.10.2
Given this (simpified version) of a DAG
VERTICES
S1
Source that emits 5 items of type A (read from DB with partitioning)
Local parallelism = 1
S2
Source that emits 150K items of type B (Iterator that read from DB in batch of 100 with partitioning)
Local parallelism = 1
AD
Processor that adapts types A->A1 and B->B1 and emits one by one
FA
Processors.filterP that accepts only items of type A1 and emits one by one
FB
Processors.filterP that accepts only items of type B1 and emits one by one
CL
Processor that first accumulate all items of type A1, then when it receive an item of type B1, enriches it with some staff got from proper A1, and emit, one by one.
WR
Sink that writes B1
Local parallelism = 1
NOTE:
Just to give meaning to the filter processor: in the DAG there are other sources that flows into the same adapter AD and then goes to other paths using filter processors.
EDGES
S1 --> AD
S2 --> AD
AD --> FA (from ordinal 0)
AD --> FB (from ordinal 1)
FA --> CL (to ordinal 0 with priority 0 distributed and broadcast)
FB --> CL (to ordinal 1 with priority 1)
CL --> WR
PROBLEM
If source S2 have "few" items to load (i.e. 15K) the emitFromTraverser never returns false.
If source S2 have "many" items to load (i.e. 150K) the emitFromTraverser returns false after:
All A1 items have been processed by CL
About 30% of B1 items have already been transmitted to CL but no one have been processed by CL (DiagnosticProcessor log that element are sent to CL but not processed)
S2 code for reference:
protected void init(Context context) throws Exception {
super.init(context);
this.iterator = new BQueryIterator(querySupplier, batchSize);
this.traverser = Traversers.traverseIterator(this.iterator);
}
public boolean complete() {
boolean result = emitFromTraverser(this.traverser);
return result;
}
QUESTION
Is it correct that CL doesn't process items until source ends?
Is the usage of priority + distributed + broadcast correct on CL Vertex?
UPDATE
It seems that the completeEdge on CL edge 1 is never called.
Someone can tell me why?
Thanks!
You suffer from a deadlock caused by priority. Your DAG branches from AD and then rejoins in CL, but with a priority.
AD --+---- FA ----+-- CL
\ /
+-- FB --+
Setting a priority causes that no item from lower-priority edge is processed before all items from higher-priority edge are processed. AD will eventually get blocked by backpressure from the lower-priority path, which is not processed by CL. So AD is blocked because it can't emit to the lower priority edge and CL is blocked, because it it's still waiting for items from the higher priority edge, resulting in a deadlock.
In your case, you can resolve it by making 2 AD vertices, each processing items from one of the sources:
S1 --- AD1 ----+--- CL
/
S2 --- AD2 --+
After a while I've understood what's the problem...
CL processor cannot know when all the A1 items have been processed because all items they all come from the AD processor.
So it need to wait for all sources coming from AD before starting the processing of B1 items.
Not sure but probably after a lot of items B loaded, all Inboxes buffers in the DAG become full and can't accept any other B from S2, but at the same time cannot process B1 items to continue: that's the deadlock.
Maybe DAG would be able to detect this?
I don't know Jet so deeply but it would be nice to have that warning.
Maybe is there some logging to enable?
I hope someone can confirm my answer and suggest how to improve and detect these problems.

Resources