Spring Integration Feed Inbound Channel Adapter duplicate entries - spring-integration

I am using Spring Integration to consume RSS feeds using its inbound channel adapter and writing the feeds to a database table.
To prevent duplicate entries when the process is stopped/started, I have enabled the PropertiesPersistingMetadataStore. As a secondary measure, on the database table, I also have a unique constraint across the feed id/feed entry link columns.
This seems to be working fine but I have noticed on some restarts (not all the time) that I am getting some DB exception errors where it is trying to insert the same RSS feed item again.
Under what conditions would I being getting these duplicate errors and is there anyway I can get round them?

The PropertiesPersistingMetadataStore only persists its state on a normal application shutdown (when the bean is destroy()ed by the application context).
However, it implements Flushable so you can either call flush() on it in your flow after persisting.
You could use transaction synchronization to flush the store after the db transaction commits with the after commit expression #metadataStore.flush().
Or, you could use a more robust persistent store, such as Redis, which persists on each update.

Related

How to control idempotency of messages in an event-driven architecture?

I'm working on a project where DynamoDB is being used as database and every use case of the application is triggered by a message published after an item has been created/updated in DB. Currently the code follows this approach:
repository.save(entity);
messagePublisher.publish(event);
Udi Dahan has a video called Reliable Messaging Without Distributed Transactions where he talks about a solution to situations where a system can fail right after saving to DB but before publishing the message as messages are not part of a transaction. But in his solution I think he assumes using a SQL database as the process involves saving, as part of the transaction, the correlationId of the message being processed, the entity modification and the messages that are to be published. Using a NoSQL DB I cannot think of a clean way to store the information about the messages.
A solution would be using DynamoDB streams and subscribe to the events published either using a Lambda or another service to transformed them into domain-specific events. My problem with this is that I wouldn't be able to send the messages from the domain logic, the logic would be spread across the service processing the message and the Lambda/service reacting over changes and the solution would be platform-specific.
Is there any other way to handle this?
I can't say a specific solution based on DynamoDB since I've not used this engine ever. But I've built an event driven system on top of MongoDB so I can share my learnings you might find useful for your case.
You can have different approaches:
1) Based on an event sourcing approach you can just save the events/messages your use case produce within a transaction. In Mongo when you are just inserting/appending new items to the same collection you can ensure atomicity. Anyway, if the engine does not provide that capability the query operation is so centralized that you are reducing the possibility of an error at minimum.
Once all the events are stored, you can then consume them and project them to a given state and then persist the updated state in another transaction.
Here you have to deal with eventual consistency as data will be stale in your read model until you have projected the events.
2) Another approach is applying the UnitOfWork pattern where you cache all the query operations (insert/update/delete) to save both events and the state. Once your use case finishes, you execute all the cached queries against the database (flush). This way although the operations are not atomic you are again centralizing them quite enough to minimize errors.
Of course the best is to use an ACID database if you require that capability and any other approach will be a workaround to get close to it.
About publishing the events I don't know if you mean they are published to a messaging transportation mechanism such as rabbitmq, Kafka, etc. But that must be a background process where you fetch the events from the DB and publishes them in order to break the 2 phase commit within the same transaction.

Detect a new record was added to cassandra table

I have a requirement: when a new comment is posted, i want to get all previous comment's owner id and send a notification.
Problem here is how will i know that a new comment was added to cassandra table. What will the solution for this kind of requirement ?
If you want to use only cassandra, without changes, it's impossible.
With changes, you have three options:
You can use cassandra as embedded service in java. Here is a simple and short how to: http://prettyprint.me/prettyprint.me/2010/02/14/running-cassandra-as-an-embedded-service/index.html
You can create a wrapper for your cassandra connection. An Application which handles the Cassandra Connection and is available via API for your other application.
Cassandra has a trigger functionality. (Never used it and never heard that someone is using it)
I prefer the second solution. Here are the reasons why:
It's simpler to create.
You can handler all your views in this application.
You can validate the input, resolve relations, logging data etc.
You can simply push the new added comment to kafka or another message queue.
This could be a setup:
Create a new comment -> call a backend api -> call the cassandra database interface -> push a new message to kafka -> send the data to all kafka consumer

spring batch design advice for processing 50k files

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records
Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

Is it possible to specify the Update Sequence from which to start a CouchDB replication?

I'm using CouchDB in an application that has Account objects, Event objects and Ticket objects (one-to-many-to-many). The basic setup is that each Account has its own database, in which all the Event and Ticket objects are stored, and each Ticket object stores a reference to its parent Event object. The events and tickets are then consumed by a mobile client application following the _changes feed for the database.
I've come to a point where the primary database has become so big that the _changes feed being consumed has become unbearably slow. So my plan is to have another database for each Event object that only contains tickets for that event, but I still need to keep the primary database for other reasons. So I'm setting up a replication from the primary database to the event database using a filter that only allows tickets for the correct event to pass through.
Because creating an event is a discreet operation, I know that the replication for a new event can start from the update sequence for the primary database before the event is created, negating the need for it to run through all the previous documents that are irrelevant. So my question is: when I start a replication can I specify from where it should start? Basically like passing a ?since=XXX to the _changes feed on the mobile client.
Yes, that's possible. Just add the since_seq attribute to the replication document.

Custom Logging mechanism: Master Operation with n-Operation Details or Child operations

I'm trying to implement logging mechanism in a Service-Workflow-hybrid application. The requirements for logging is that instead for independent log action, each log must be considered as a detail operation and placed against a parent/master operation. So, it's a parent-child and goes to database table(s). This is the primary reason, NLog failed.
To help understand better, I'm diving in a generic detail. This is how the application flow goes:
Now, the Main entry point of the application (normally called Program.cs) is Platform. It initializes an engine that is capable of listening incoming calls from ISDN lines, VoIP, or web services. The interface is generic, so any call that reaches the Platform triggers OnConnecting(). OnConnecting() is a thread-safe event and can be triggered as many times as system requires.
Within OnConnecting(), a new instance of our custom Workflow manager is launched and the context is a custom object called ProcessingInfo:
new WorkflowManager<ZeProcessingInfo>();
Where, ZeProcessingInfo:
var ZeProcessingInfo = new ProcessingInfo(this, new LogMaster());
As you can see, the ProcessingInfo is composed of Platform itself and a new instance of LogMaster. LogMaster is defined in an independent assembly.
Now this LogMaster is available throughout the WorkflowManager, all the Workflows it launches, all the activities within any running Workflow, and passed on to external code called from within any Activity. Now, when a new LogMaster is initialized, a Master Operation entry is created in the database and this LogMaster object now lives until this call is ended after a series of very serious roller coaster rides through different workflows. Upon every call of OnConnecting(), a new Master Operation is created and maintained.
The LogMaster allows for calling a AddDetail() method that adds new child detail under the internally stored Master Operation (distinguished through a Guid Primary Key). The LogMaster is built upon Entity Framework.
And, I'm able to log under the same Master Operation as many times as I require. But the application requirements are changing and there is a need to log from other assemblies now. There is a Platform Server assembly witch is a Windows Service that acts as a server listening to web service based calls and once a client calls a method, OnConnecting in Platform is triggered.
I need a mechanism to somehow retrieve the related LogMaster object so that I can add detail to the same Master Operation. But Platform Server is the once triggering the OnConnecting() on the Platform and thus, instantiating LogMaster. This creates a redundancy loop.
Also, failure scenarios are being considered as well. If LogMaster fails, need to revert to Event Logging from Database Logging. If Event Logging is failed (or not allowed through unified configuration), need to revert to file-based (XML) logging.
I hope I have given a rough idea. I don't expect code but I need some strategy for a very seamless plug-able configurable logging mechanism that supports Master-Child operations.
Thanks for reading. Any help would be much appreciated.
I've read this question a number of times and it was pretty hard to figure out what was going on. I don't think your diagram helps at all. If your question is about trying to retrieve the master log record when writing child log records then I would forget about trying to create normalised data in the log tables. You will just slow down the transactional system in trying to do so. You want the log/audit records to write as fast as possible and you can later aggregate them when you want to read them.
Create a de-normalised table for the logs entries and use a single Guid in that table to track the session/parent log master. Yes this will be a big table but it will write fast.
As for guaranteed delivery of log messages to a destination, I would try not to create multiple destinations as combining them later will be a nightmare but rather use something like MSMQ to emit the audit logs as fast as possible and have another service pick them up and process them in a guaranteed delivery manner. ETW (Event Logging) is not guaranteed under load and you will not know that it has failed.

Resources