Spring Integration - Parallel ordered processing

Spring Integration - Parallel ordered processing - spring-integration

In the application i work on, there is a continuous flow of messages coming from a TCP socket. Messages have different types. Different types of messages should be processed in parallel. But each specific type of message must be processed in the order they come in.
I used ExecutorChannel from spring integration and it solves parallel processing need. I created a channel for each specific type of message.
But i cannot guarantee ordered processing of messages for specific types.
Is there a way to do ordered processing with publish/subscribe channels while also using parallel processing?

Consider to use the same ExecutorChannel but as an input for each type. The trick that each of them should be configured with executors with a single thread. So, you have as much single-threaded executor channels as you have message types.
Another trick is like QueueChannel for each type and polling endpoints with the fixed-delay as subscribers to those queues.
One more option available since the current Spring Integration 5.0 is FluxMessageChannel. The ordering is guaranteed by the internal Reactor's Flux and the parallelism will be achieved by the subscribers - the processing messages in Flux is happened in the subscriber thread.

Related

Is it possible to make a Poller (or PollableMessageSource) to poll messages as List?

Following the example found in GitHub https://github.com/spring-cloud/spring-cloud-gcp/tree/master/spring-cloud-gcp-samples/spring-cloud-gcp-pubsub-polling-binder-sample regarding polling messages from a PubSub subscription, I was wondering...
Is it possible to make a PollableMessageSource retrieve List<Message<?>> instead of a single message per poll?
I've seen the #Poller notation only being used in Source typed objects, never in Processor or Sink. Is it possible to use in such context when for example using #StreamListener or with a functional approach?

The PollableMessageSource binding and Source stream applications are fully based on the Poller and MessageSource abstraction from Spring Integration where its contract is to produce a single message to the channel configured. The point of the messaging is really to process a single message not affecting others. The failure for one message doesn't mean to fail others in the flow.
On the other hand you probably mean GCP Pub/Sub messages to be produced as a list in the Spring message payload. That is really possible, but via some custom code from Pub/Sub consumer and MessageSource impl. Although I would think twice to expect some batched from the source. Probably you may utilize an aggregator to build some small windows if your further logic is about processing as list. But again: it is going to be a single Spring message.
May be better to start thinking about a reactive function implementation where you indeed can expect a Flux<Message<?>> as an input and Spring Cloud Stream framework will take care for you how to emit the data from Pub/Sub into the reactive stream you expect.
See more info in docs: https://docs.spring.io/spring-cloud-stream/docs/3.1.0/reference/html/spring-cloud-stream.html#_reactive_functions_support

How splitting in spring integration works for web container?

I want to use Spring Integration for HTTP inbound message processing.
I know, that it spring integration channel would run on a container thread, but if I want to use splits,
what threads would be used?
How the result of split would be returned to the initial web request thread?

(Note: I am not 100% sure if I understand you use case, but as a general remark:)
The spring integration spitter splits a message in multiple "smaller" messages. This is unrelated to multi-threading, that is, it does not per-se imply that the smaller messages are processed in parallel. It is still a sequential stream of smaller messages.
You can then process the smaller messages in parallel, by defining a handler with a given parallelism and you can define that this handler uses a dedicated thread pool.
(Sorry if this does not answer your question, please clarify).

How Spring Cloud Stream prevents the application’s instances from receiving duplicate messages?

Spring Cloud Stream is based on At least once method,This means that in some rare cases a duplicate message can arrive at an endpoint.
Does Spring Cloud Stream keep a buffer of already received messages?
The IdempotentReceiver in Enterprise Integration Patterns book suggests :
Design a receiver to be an Idempotent Receiver,one that can safely receive the same message multiple times.
Does Spring Cloud Stream control duplicate messages in consumers?
Update:
A paragraph from Spring Cloud Stream says :
4.5.1. Durability
Consistent with the opinionated application model of Spring Cloud Stream, consumer group subscriptions are durable. That is, a binder implementation ensures that group subscriptions are persistent and that, once at least one subscription for a group has been created, the group receives messages, even if they are sent while all applications in the group are stopped.
Anonymous subscriptions are non-durable by nature. For some binder implementations (such as RabbitMQ), it is possible to have non-durable group subscriptions.
In general, it is preferable to always specify a consumer group when binding an application to a given destination. When scaling up a Spring Cloud Stream application, you must specify a consumer group for each of its input bindings. Doing so prevents the application’s instances from receiving duplicate messages (unless that behavior is desired, which is unusual).

I think your assumption on the responsibility of the spring-cloud-stream framework are incorrect.
Spring-cloud-stream in a nutshell is a framework responsible for connecting and adapting producers/consumers provided by the developer to the message broker(s) exposed by the spring-cloud-stream binder (e.g., Kafka, Rabbit, Kinesis etc).
So connecting to a broker, receiving message from the broker, deserialising it, invoking user code, serialising message and sending it back to the broker is in the scope of framework responsibility. So you can look at it as purely infrastructure.
What you're describing is more of an application concern since the actual receiver is something that user would develop as part of the spring-cloud-stream development experience, hence responsibility for idempotence would reside with such user.
Also, on top of that most brokers already handle idempotency (in a way) by ensuring that a particular message has been delivered only once. That said, if someone sends identical message to such broker, it will have no idea that it is duplicate so the requirement for idempotency and/or deduplication is still valid, but as you can see it is not as straight forward given the amount of factor that are in play where your understanding of idempotence could be different from mine, hence our approaches could be different as well.
One last thing (partially to prove my last point): can safely receive the same message multiple times. - That is all it states, but what does safely really mean to you vs. me vs. some other person?

If you are concerned about a case where the application receives and processes message from the broker but crashes before it acknowledges the message, that can happen. Spring cloud stream app starters provides support for auto-configuration of a persistent message metadata store which backs Spring Integration's IdempotentReceiverInterceptor. An example of this is in the SFTP source app starter. By default, the sftp source uses an in-memory metadata store, so it would not survive a restart, but can be customized to use a persistent store.

what is best practice to consume messages from multiple kafka topics?

I need to consumer messages from different kafka topics,
Should i create different consumer instance per topic and then start a new processing thread as per the number of partition.
or
I should subscribe all topics from a single consumer instance and the should start different processing threads
Thanks & regards,
Megha

The only rule is that you have to account for what Kafka does and doesn't not guarantee:
Kafka only guarantees message order for a single topic/partition. edit: this also means you can get messages out of order if your single topic Consumer switches partitions for some reason.
When you subscribe to multiple topics with a single Consumer, that Consumer is assigned a topic/partition pair for each requested topic.
That means the order of incoming messages for any one topic will be correct, but you cannot guarantee that ordering between topics will be chronological.
You also can't guarantee that you will get messages from any particular subscribed topic in any given period of time.
I recently had a bug because my application subscribed to many topics with a single Consumer. Each topic was a live feed of images at one image per message. Since all the topics always had new images, each poll() was only returning images from the first topic to register.
If processing all messages is important, you'll need to be certain that each Consumer can process messages from all of its subscribed topics faster than the messages are created. If it can't, you'll either need more Consumers committing reads in the same group, or you'll have to be OK with the fact that some messages may never be processed.
Obviously one Consumer/topic is the simplest, but it does add some overhead to have the additional Consumers. You'll have to determine whether that's important based on your needs.
The only way to correctly answer your question is to evaluate your application's specific requirements and capabilities, and build something that works within those and within Kafka's limitations.

This really depends on logic of your application - does it need to see all messages together in one place, or not. Sometimes, consumption from single topic could be easier to implement in terms of business logic of your application.

Sharing EventHub between Azure Fabric reliable actors

I'm having an application where I map devices from the physical world to Reliable Actors in Azure Fabric. Each time I receive a message from a device, I want to push a message to an event hub.
What I'm doing right now is creating/using/closing the EventHubClient object for each message.
This is very inefficient (it takes about 1500ms) but it solves an issue I had in the past where I was keeping the EventHubClient in memory. When I have a lot of devices, the underlying virtual machine can quickly run out of network connections.
I'm thinking about creating a new actor that would be responsible for pushing data to the EventHub (by keeping the EventHubClient alive). Because of the turned based concurrency model of Reliable Actors, I'm not sure it's a good idea. If I get 10 000 devices pushing data "at the same time", each of their actors will block to push the message to the new actor that pushes message to the EventHub.
What is the recommended approach for this scenario ?
Thanks,

One approach would be to create a stateless service that is responsible for pushing messages to the EventHub. Each time an Actor receives a message from the device (by the way, how are they communicating with actors?) the Actor calls the stateless service. The stateless service in turn would be responsible for creating, maintining and disposing of one EventHubClient per service. Reliable Service would not introduce the same 'overhead' when it comes to handling incoming messages as a Reliable Actor would. If it is important for your application that the messages reach the EventHub in strictly the same order that they were produced in then you would have to do this with a Stateful Service and a Reliable Queue. (Note, this there is on the other hand no guarantee that Actors would be able to finish handling incoming messages in the same order as they are produced)
You could then fine tune-tune the solution by experimenting with the instance count (https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-availability-services) to make sure you have enough instances to handle the throughput of incoming messages. How many instances are roughly determined by the number of nodes and cores per node, although other factors may also affect.
Devices communicate with your Actors, the Actors in turn communicate with the Service (may be Stateless or Stateful if you want to queue message, see below), each Service manages an EventHubClient that can push messages to the EventHub.
If your cluster is unable to support an instance count for this service that is high enough (a little simplified: more instances = higher throughput), then you may need to create it as a Stateful Service instead and put messages in a Reliable Queue in the Service and then have the the RunAsync for the Service processing the queue in order. This could take the pressure of peaks in performance.
The Service Fabric Azure-Samples WordCount shows how you work with different Partitions to make the messages from Actors target different instances (or really partitions).
A general tip would be to not try to use Actors for everything (but for the right things they are great and reduces complexity a lot), the Reliable Services model support a lot more scenarios and requirements and could really complement your Actors (rather than trying to make Actors do something they are not really designed for).

You could use a pub/sub pattern here (use the BrokerService).
By decoupling event publishing from event processing, you don't need to worry about the turn based concurrency model.
Publishers:
The Actor sends out messages by simply publishing them to a BrokerService.
Subscribers
Then you use one or more Stateless Services or (different) Actors as subscribers of the events.
They would send them into EventHub in their own pace.
Event Hub Client
Using this approach you'd have full control over the EventHubClient instance counts and lifetimes.
You could increase event processing power by simply adding more subscribers.

In my opinion you should directly call from your actors the event hub in a background thread with an internal memory queue. You should aggregate messages and use SendBatch to improve performance.
The event hub is able to receive the load by himself.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string