How to ensure idempotency on event hub on consumers that only stores aggregated information?

How to ensure idempotency on event hub on consumers that only stores aggregated information? - azure

I'm working on an event-driven micro-services architecture and I'm using eventhubs to send a lot of data (around 20-30k events per minute) to multiple consumer groups and I'm using Azure Functions EventHubTrigger to process these events.
The data I'm passing around has a unique identifier and my other consumers can guarantee idempotency since I'm storing them on their data stores as well upon processing - so if the unique event identifier already exists, I can skip processing for that specific event.
I do however have one service that only does data-aggregation for reporting to a relational database - doing counts, sums, and what-not. Pretty much upserts so that I can do some queries against it to produce reports - and I did see quite a bit of events that have been processed multiple times.
So an idea that I had was to just have some sort of event store. Redis with TTL, or Azure Table Storage, or even a table on my relational database that only contains a single field with a unique constraint so I can do a transaction on the whole event processing.
Is there a better way to do this?

Related

Implementing transactional outbox event architecture in azure

I have a large message submitted to a rest service, it could be 100k or 50mb. I need to process it asynchronously, and it looks like the transactional outbox event pattern is suitable for my needs.
Effectively, my service would commit the data to a database, along with an event record in a single transaction. A process would poll for events, and push the event to a message queue. THe event contains a reference to the data in the db, usually through a unique identifier of some sort. The consumer of the queue would query the data from the database, do whatever it needs to do and then remove the event record from the database.
This pattern is well documented. Here and here are two places.
I have a reasonable understanding of how one could implement this on-premis, in a .net/sql server environment that we are familiar with. In azure what would this look like? are there other ways I can transitionally write to the database and a queue that do not require the outbox pattern, or following the outbox pattern, what would be the mechanism that polls for events in the db, and what would provide the queue service?

Usually if you want to use the transactional outbox event pattern in azure you can use a logic app or an azure function to get events in the db and send them to the queue.
Doing that would be great if you use cosmos change feed so that your architecture is also reactive and will perform well with less resource consumption.
To avoid this pattern well....you should find a queue in azure that is able to be in transaction with your db and for what I know is not possible at least that you don ' t use a 3rd part queue .

How to send message to Microsoft EventHub with Db Transaction?

I want to send the event to Microsoft Event-hub with Db transaction:
Explanation:
User hit a endpoint of order creation.
OrderService accept the order and put that order into the db.
Now Order service want to send that orderId as event to another services using the Event-hub.
How can I achieve transactional behaviour for step 2 and 3?
I know these solutions:
Outbox pattern: Where I put message in another table with order creation transaction. And there is one cron/scheduler, that takes the message from table and mark them delivered. and next time cron will take only not delivered messages.
Use Database audit log and library that taken of this things. Library will bind the database table to Event-hub. Then on every update library will send that change to Event-hub.
I wanted to know is there any in-built transactional feature in Event-hub?
Or
Is there any better way to handle this thing?

There is no concept of transactions within Event Hubs at present. I'm not sure, given the limited context that was shared, that Event Hubs is the best fit for your scenario. Azure Service Bus has transaction support and may be a more natural fit for your intended flow.
In this kind of distributed scenario, regardless of which message broker you decide on, I would advise embracing eventual consistency and considering a pattern similar to:
Your order creation endpoint receives a request
The order creation endpoint assigns a unique identifier for the request and emits the event to Event Hubs; if the send was successful it returns a 202 (Accepted) to the caller and a Retry-After header to indicate to the caller that they should wait for that period of time before checking the status of that order's creation.
Some process is responsible for reading events from the Event Hub and creating that order within the database. Depending on your ecosystem's tolerance, this may be a dedicated process or could be something like an Azure Function with an Event Hubs trigger.
Other event consumers interested in orders will also see the creation request and will call into your order service or database for the details using the unique identifier that as assigned by the order creation endpoint; this may or may not be the official order number within the system.

How to reliably store event to Azure CosmosDB and dispatch to Event Grid exactly once

I'm experimenting with event sourcing / cqrs pattern using serverless architecture in Azure.
I've chosen Cosmos DB document database for Event Store and Azure Event Grid for dispachting events to denormalizers.
How do I achieve that events are reliably delivered to Event Grid exactly once, when the event is stored in Cosmos DB? I mean, if delivery to Event Grid fails, it shouldn't be stored in the Event Store, should it?

Look into Cosmos Db Change Feed. Built in event raiser/queue for each change in db. You can register one or many listeners/handlers. I.e. Azure functions.
This might be exactly what you are asking for.
Some suggest you can go directly to cosmos db and attach eventgrid at the backside of changefeed.

You cannot but you shouldn't do it anyway. Maybe there are some very complicated methods using distributed transactions but they are not scalable. You cannot atomically store and publish events because you are writing to two different persistences, with different transactional boundaries. You can have a synchronous CQRS monolith, but only if you are using the same technology for the events persistence and readmodels persistence.
In CQRS the application is split in Write/Command and Read/Query sides (this long video may help). You are trying to unify the two parts into a single one, a downgrade if you will. Instead you should treat them separately, with different models (see Domain driven design).
The Write side should not depend on the outcome from the Read side. This means, that after the Event store persist the events, the Write side is done. Also, the Write side should contain all the data it needs to do its job, the emitting of events based on the business rules.
If you have different technologies in the Write and Read part then your Read side should be decoupled from the Write side, that is, it should run in a separate thread/process.
One way to do this is to have a thread/process that listens to appends to the Event store, fetch new events then publish them to the Event Grid. If this process fails or is restarted, it should resume from where it left off. I don't know if CosmosDB supports this but MongoDB (also a document database) has the rslog that you can tail to get the new events, in a few milliseconds.

How to control idempotency of messages in an event-driven architecture?

I'm working on a project where DynamoDB is being used as database and every use case of the application is triggered by a message published after an item has been created/updated in DB. Currently the code follows this approach:
repository.save(entity);
messagePublisher.publish(event);
Udi Dahan has a video called Reliable Messaging Without Distributed Transactions where he talks about a solution to situations where a system can fail right after saving to DB but before publishing the message as messages are not part of a transaction. But in his solution I think he assumes using a SQL database as the process involves saving, as part of the transaction, the correlationId of the message being processed, the entity modification and the messages that are to be published. Using a NoSQL DB I cannot think of a clean way to store the information about the messages.
A solution would be using DynamoDB streams and subscribe to the events published either using a Lambda or another service to transformed them into domain-specific events. My problem with this is that I wouldn't be able to send the messages from the domain logic, the logic would be spread across the service processing the message and the Lambda/service reacting over changes and the solution would be platform-specific.
Is there any other way to handle this?

I can't say a specific solution based on DynamoDB since I've not used this engine ever. But I've built an event driven system on top of MongoDB so I can share my learnings you might find useful for your case.
You can have different approaches:
1) Based on an event sourcing approach you can just save the events/messages your use case produce within a transaction. In Mongo when you are just inserting/appending new items to the same collection you can ensure atomicity. Anyway, if the engine does not provide that capability the query operation is so centralized that you are reducing the possibility of an error at minimum.
Once all the events are stored, you can then consume them and project them to a given state and then persist the updated state in another transaction.
Here you have to deal with eventual consistency as data will be stale in your read model until you have projected the events.
2) Another approach is applying the UnitOfWork pattern where you cache all the query operations (insert/update/delete) to save both events and the state. Once your use case finishes, you execute all the cached queries against the database (flush). This way although the operations are not atomic you are again centralizing them quite enough to minimize errors.
Of course the best is to use an ACID database if you require that capability and any other approach will be a workaround to get close to it.
About publishing the events I don't know if you mean they are published to a messaging transportation mechanism such as rabbitmq, Kafka, etc. But that must be a background process where you fetch the events from the DB and publishes them in order to break the 2 phase commit within the same transaction.

I am not sure which NoSQL is suitable for my scenario

I am trying to design create a cloud based system (IaaS) that will gather data from sensors (water pollution related activity) and upon certain events will decide to process the data for a specific sensor.
Data characteristics are:
1. For each sensor data is being sent once every couple of days (up to 6 times a month)
2. each sensor reading contains about 5000 events that are encapsulated in 50-100 messages that are sent to the server (such "session" takes about 20 minutes where messages are sent every 5 seconds)
3. I am building the system to handle rate of 30,000 messages per second.
4. processing of the data shouldn't be real time , I have about 10 minutes once the "session" is finished to do the processing.
5. 90% of the sessions are not interesting and can be thrown away once they are finished. the other 10% have event or event encapsulated in the messages that according to them I need to decide if I need to process the entire session data and send an alert to the sensor that there is a pollution.
I created a tool that generates 5000 messages per second and I am trying to figure out which database would be the most optimal for my scenario.
These are the databases I am thinking to try:
Cassandra - I will save for each session an in memory collection of keys. the keys are for the messages that are stored in cassandra. Once I detect a message that contains bad readings I will need to pull all of the other messages in the "session" and process them (that means 50-100 requests to cassandra). My concern here is about write performance (since I have many read and write operations) + I don't have a good strategy for deleting the 90% not needed sessions.
Couchbase - I will save a document for each "session" according to sensorID and will append each message to the document. Once I detect a message that contains bad readings I will only need to send one request for the document. My concern here is about the read performance.
Redis - use it like cassandra. I assume performance will be the best but I will need to handle the sharding and replication of data myself in order not to reach the memory limit
I would love to hear which option would be the most appropriate
thanks

Reg. Redis – You may consider using a DAAS (Data as a Service). The service will manage for you all the instances, clusters, scaling, data persistence and high availability settings.
One example, is Redis Cloud by Redis Labs

This is an interesting one. If we go to basics of CAP Theorem and try to choose one DB based upon need of consistency, availability, and partition tolerance.
For High consistency and availability- Choose MySQL, PostgreSQL,Greenplum, Vertica, Neo4J.
For High availability and partition tolerance- Use Cassandra,Voldemort,Dynamo,CouchDB, Riak
For High consistency and partition tolerance- Use HBase, Redis, MongoDB,
BerkeleyDB, BigTable
So my Vote is for Cassandra here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string