store messages of messaging application - cassandra

Actually, I am developing a messaging app and use cassandra as a database and kafka as a message broker.
My question is:
Do I need to store the messages between users in cassandra? If I do so, then size of my database will grow really fast.
As I am using a messaging queue, the messages are stored as long as these were not delievered. I have heard, that messaging apps (such as Facebook Messanger, WhatsApp) does not store the message content between users in a database, but only use a queuing system (XMPP, MQTT) which deletes messages as soon as they are delieverd. So no need for storing in external database. Am I right?
What is best practice? Besides, do I need to store the messaging content from the legal perspective (government or the like) for a period of time (for example, 2 years)?
Looking at http://www.planetcassandra.org/apache-cassandra-use-cases/, there are a lot using cassandra as a database backend for messaging apps. However, it is a antipattern to use cassandra as a message-queue (see cassandra docs).

Using Cassandra as a queue is clearly an anti pattern
However Cassandra is a good fit to store messages, read my blog post on KillrChat: http://www.doanduyhai.com/blog/?p=1859 for a possible data model for message storage

Related

What is the advantage and disadvantage when considering Kafka as a storage?

I have 2 approaches:
Approach #1
Kafka --> Spark Stream (processing data) --> Kafka -(Kafka Consumer)-> Nodejs (Socket.io)
Approach #2
Kafka --> Kafka Connect (processing data) --> MongoDB -(mongo-oplog-watch)-> Nodejs (Socket.io)
Note: in Approach #2, I use mongo-oplog-watch to check when inserting data.
What is the advantage and disadvantage when using Kafka as a storage vs using another storage like MongoDB in real-time application context?
Kafka topics typically have a retention period (default to 7 days) after which they will be deleted. Though, there is no hard rule that we must not persist in Kafka.
You can set the topic retention period to -1 (reference)
The only problem, I know of persisting data in Kafka, is security. Kafka, out of the box (atleast as of now) doesn't provide Data-at-rest encryption. You need to go with a custom solution (or a home-grown one) to have that.
Protecting data-at-rest in Kafka with Vormetric
A KIP is also there, but it is Under discussion
Add end to end encryption in Kafka (KIP)
MongoDB on the other hand seems to provide Data-at-rest encryption.
Security data at rest in MongoDB
And most importantly, it also depends on the type of the data that you are going to store and what you want to do with it.
If you are dealing with data that is quite complex (not easy as Key-Value i.e., give the key and get the value model), for example, like querying by indexed fields etc (as you do typically with logs), then MongoDB could probably make sense.
In simple words, if you are querying by more than one field (other than the key), then storing it in MongoDB could make sense, if you intend to use Kafka for such a purpose, you would probably end up with creating a topic for every field that should be queried... which is too much.

How to control idempotency of messages in an event-driven architecture?

I'm working on a project where DynamoDB is being used as database and every use case of the application is triggered by a message published after an item has been created/updated in DB. Currently the code follows this approach:
repository.save(entity);
messagePublisher.publish(event);
Udi Dahan has a video called Reliable Messaging Without Distributed Transactions where he talks about a solution to situations where a system can fail right after saving to DB but before publishing the message as messages are not part of a transaction. But in his solution I think he assumes using a SQL database as the process involves saving, as part of the transaction, the correlationId of the message being processed, the entity modification and the messages that are to be published. Using a NoSQL DB I cannot think of a clean way to store the information about the messages.
A solution would be using DynamoDB streams and subscribe to the events published either using a Lambda or another service to transformed them into domain-specific events. My problem with this is that I wouldn't be able to send the messages from the domain logic, the logic would be spread across the service processing the message and the Lambda/service reacting over changes and the solution would be platform-specific.
Is there any other way to handle this?
I can't say a specific solution based on DynamoDB since I've not used this engine ever. But I've built an event driven system on top of MongoDB so I can share my learnings you might find useful for your case.
You can have different approaches:
1) Based on an event sourcing approach you can just save the events/messages your use case produce within a transaction. In Mongo when you are just inserting/appending new items to the same collection you can ensure atomicity. Anyway, if the engine does not provide that capability the query operation is so centralized that you are reducing the possibility of an error at minimum.
Once all the events are stored, you can then consume them and project them to a given state and then persist the updated state in another transaction.
Here you have to deal with eventual consistency as data will be stale in your read model until you have projected the events.
2) Another approach is applying the UnitOfWork pattern where you cache all the query operations (insert/update/delete) to save both events and the state. Once your use case finishes, you execute all the cached queries against the database (flush). This way although the operations are not atomic you are again centralizing them quite enough to minimize errors.
Of course the best is to use an ACID database if you require that capability and any other approach will be a workaround to get close to it.
About publishing the events I don't know if you mean they are published to a messaging transportation mechanism such as rabbitmq, Kafka, etc. But that must be a background process where you fetch the events from the DB and publishes them in order to break the 2 phase commit within the same transaction.

I am not sure which NoSQL is suitable for my scenario

I am trying to design create a cloud based system (IaaS) that will gather data from sensors (water pollution related activity) and upon certain events will decide to process the data for a specific sensor.
Data characteristics are:
1. For each sensor data is being sent once every couple of days (up to 6 times a month)
2. each sensor reading contains about 5000 events that are encapsulated in 50-100 messages that are sent to the server (such "session" takes about 20 minutes where messages are sent every 5 seconds)
3. I am building the system to handle rate of 30,000 messages per second.
4. processing of the data shouldn't be real time , I have about 10 minutes once the "session" is finished to do the processing.
5. 90% of the sessions are not interesting and can be thrown away once they are finished. the other 10% have event or event encapsulated in the messages that according to them I need to decide if I need to process the entire session data and send an alert to the sensor that there is a pollution.
I created a tool that generates 5000 messages per second and I am trying to figure out which database would be the most optimal for my scenario.
These are the databases I am thinking to try:
Cassandra - I will save for each session an in memory collection of keys. the keys are for the messages that are stored in cassandra. Once I detect a message that contains bad readings I will need to pull all of the other messages in the "session" and process them (that means 50-100 requests to cassandra). My concern here is about write performance (since I have many read and write operations) + I don't have a good strategy for deleting the 90% not needed sessions.
Couchbase - I will save a document for each "session" according to sensorID and will append each message to the document. Once I detect a message that contains bad readings I will only need to send one request for the document. My concern here is about the read performance.
Redis - use it like cassandra. I assume performance will be the best but I will need to handle the sharding and replication of data myself in order not to reach the memory limit
I would love to hear which option would be the most appropriate
thanks
Reg. Redis – You may consider using a DAAS (Data as a Service). The service will manage for you all the instances, clusters, scaling, data persistence and high availability settings.
One example, is Redis Cloud by Redis Labs
This is an interesting one. If we go to basics of CAP Theorem and try to choose one DB based upon need of consistency, availability, and partition tolerance.
For High consistency and availability- Choose MySQL, PostgreSQL,Greenplum, Vertica, Neo4J.
For High availability and partition tolerance- Use Cassandra,Voldemort,Dynamo,CouchDB, Riak
For High consistency and partition tolerance- Use HBase, Redis, MongoDB,
BerkeleyDB, BigTable
So my Vote is for Cassandra here.

How to implement something similar to Storm DRPC in Samza?

I have samza job with a number of tasks, each of which holds some state in its embedded store. I want to expose this store for reading to outside world via some kind of RPC mechanism. What could be the best solution for this?
Here is one paragraph in Samza documentation about it:
Samza does not currently have an equivalent API to DRPC,
but you can build it yourself using Samza’s stream
processing primitives.
The only solution which comes to my mind is to make my tasks, in addition to normal processing, to consume request messages with some correlation IDs on a special request topic, and to put response messages with the same correlation IDs into special response topic. So it's like RPC-over-Kafka solution which seems to me suboptimal.
Any thoughts are welcome!
As far as I remember the embedded store is backed up in a Kafka topic. When you set something in the store, the message is produced to the topic. Thus you can consume this topic and you can "clone" the embedded store to a different database. Then you can query the database. Or you can use just the database instead of the embedded store. But this approach could lead to performance issues in your Samza job...

SocketIO scaling architecture and large rooms requirements

We are using socketIO on a large chat application.
At some points we want to dispatch "presence" (user availability) to all other users.
io.in('room1').emit('availability:update', {userid='xxx', isAvailable: false});
room1 may contains a lot of users (500 max). We observe a significant raise in our NodeJS load when many availability updates are triggered.
The idea was to use something similar to redis store with Socket IO. Have web browser clients to connect to different NodeJS servers.
When we want to emit to a room we dispatch the "emit to room1" payload to all other NodeJS processes using Redis PubSub ZeroMQ or even RabbitMQ for persistence. Each process will itself call his own io.in('room1').emit to target his subset of connected users.
One of the concern with this setup is that the inter-process communication may become quite busy and I was wondering if it may become a problem in the future.
Here is the architecture I have in mind.
Could you batch changes and only distribute them every 5 seconds or so? In other words, on each node server, simply take a 'snapshot' every X seconds of the current state of all users (e.g. 'connected', 'idle', etc.) and then send that to the other relevant servers in your cluster.
Each server then does the same, every 5 seconds or so it sends the same message - of only the changes in user state - as one batch object array to all connected clients.
Right now, I'm rather surprised you are attempting to send information about each user as a packet. Batching seems like it would solve your problem quite well, as it would also make better use of standard packet sizes that are normally transmitted via routers and switches.
You are looking for this library:
https://github.com/automattic/socket.io-redis
Which can be used with this emitter:
https://github.com/Automattic/socket.io-emitter
About available users function, I think there are two alternatives,you can create a "queue Users" where will contents "public data" from connected users or you can use exchanges binding information for show users connected. If you use an "user's queue", this will be the same for each "room" and you could update it when an user go out, "popping" its state message from queue (Although you will have to "reorganize" all queue message for it).
Nevertheless, I think that RabbitMQ is designed for asynchronous communication and it is not very useful approximation have a register for presence or not from users. I think it's better for applications where you don't know when the user will receive the message and its "real availability" ("fire and forget architectures"). ZeroMQ require more work from zero but you could implement something more specific for your situation with a better performance.
An publish/subscribe example from RabbitMQ site could be a good point to begin a new design like yours where a message it's sent to several users at same time. At summary, I will create two queues for user (receive and send queue messages) and I'll use specific exchanges for each "room chat" controlling that users are in each room using exchange binding's information. Always you have two queues for user and you create exchanges to binding it to one or more "chat rooms".
I hope this answer could be useful for you ,sorry for my bad English.
This is the common approach for sharing data across several Socket.io processes. You have done well, so far, with a single process and a single thread. I could lamely assume that you could pick any of the mentioned technologies for communicating shared data without hitting any performance issues.
If all you need is IPC, you could perhaps have a look at Faye. If, however, you need to have some data persisted, you could start a Redis cluster with as many Redis masters as you have CPUs, though this will add minor networking noise for Pub/Sub.

Resources