readonly transaction with Spanner R2DBC and Spring Data - google-cloud-spanner

Is there a possibility to use read-only transactions with Spring Data R2DBC and especially with Google Spanner DB Backend? RO-transactions are supported in the Spanner R2DBC driver and offer much scalability advantage (no locking!). However, it is not R2DBC standard and I haven't found any support for this in the Spring Data R2DBC documentation.

Based on my research it looks like this is not possible yet.
I filed a feature request asking to support extended features in Spring Data R2DBC here: https://github.com/GoogleCloudPlatform/cloud-spanner-r2dbc/issues/314
The goal will be to allow you to make read only transactions using the Transaction annotation, like:
#Transactional(readonly = true)
public void readAndSaveRecords(..) {
}

Related

periodic refresh of static data in Structure Streaming and Stateful Streaming

I am trying to implement 5 min batch monitoring using spark structured streaming where read from kafka and look up on (1 huge and 1 smaller) diff static datasets as part of ETL logic and call rest API to send final results to an external application (out of billions of records from kafka only less than 100 will be out to rest API after ETL).
How to achieve refreshing static look ups with out restarting the whole streaming application ? (StreamingQueryListener using StreamingQueryManager.addListener method to have our own logic of refreshing/recreating static df via StreamingQuery.AwaitTermination ? or use persist and unpersis cache ? or any other better ideas ?)
Note : Went through below article but not sure if hbase is better option as its an old one.
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
Once a record is enriched with look up information and applied some rules/conditions , we need to start keep track of it to send updates until it completed its lifecycle of an event as per custom logic via rest API. So hoping flatmapwithGroupState implementation helps here to keep track of event state. Please suggest best options here.
Managing group state with in HDFS vs using HBase. Please suggest best options from an operationalization and monitoring point of view in production environment where support team has minimal knowledge of Spark. If we use HDFS for state maintenance, how to keep it up with event state tracking in case of rest API fails to send updates to end user/system?

What is the advantage and disadvantage when considering Kafka as a storage?

I have 2 approaches:
Approach #1
Kafka --> Spark Stream (processing data) --> Kafka -(Kafka Consumer)-> Nodejs (Socket.io)
Approach #2
Kafka --> Kafka Connect (processing data) --> MongoDB -(mongo-oplog-watch)-> Nodejs (Socket.io)
Note: in Approach #2, I use mongo-oplog-watch to check when inserting data.
What is the advantage and disadvantage when using Kafka as a storage vs using another storage like MongoDB in real-time application context?
Kafka topics typically have a retention period (default to 7 days) after which they will be deleted. Though, there is no hard rule that we must not persist in Kafka.
You can set the topic retention period to -1 (reference)
The only problem, I know of persisting data in Kafka, is security. Kafka, out of the box (atleast as of now) doesn't provide Data-at-rest encryption. You need to go with a custom solution (or a home-grown one) to have that.
Protecting data-at-rest in Kafka with Vormetric
A KIP is also there, but it is Under discussion
Add end to end encryption in Kafka (KIP)
MongoDB on the other hand seems to provide Data-at-rest encryption.
Security data at rest in MongoDB
And most importantly, it also depends on the type of the data that you are going to store and what you want to do with it.
If you are dealing with data that is quite complex (not easy as Key-Value i.e., give the key and get the value model), for example, like querying by indexed fields etc (as you do typically with logs), then MongoDB could probably make sense.
In simple words, if you are querying by more than one field (other than the key), then storing it in MongoDB could make sense, if you intend to use Kafka for such a purpose, you would probably end up with creating a topic for every field that should be queried... which is too much.

Transaction Synchronization in Spring Kafka

I want to synchronize a kafka transaction with a repository transaction:
#Transactional
public void syncTransaction(){
myRepository.save(someObject)
kafkaTemplate.send(someEvent)
}
Since the merge (https://github.com/spring-projects/spring-kafka/issues/373) and according to the doc this is possible. Nevertheless i have problems to understand and implement that feature.
Looking at the example in https://docs.spring.io/spring-kafka/reference/html/#transaction-synchronization I have to create a MessageListenerContainer to listen to my own events.
Do I still have to send my events using the KafkaTemplate?
Does the MessageListenerContainer prohibit the sending to the broker?
And if i understand correctly the kafkaTemplate und the kafkaTransactionManager have to use the same producerFactory in which i have to enable Transaction setting a transactionIdPrefix. And in my example i have to set the TransactionManager of the messageListenerContainer to the DataSourceTransactionManager. Is that correct?
From my perspective it looks weird that I send an event via kafkaTemplate, listen to my own event and forward the event using the kafkaTemplate again.
I would really help me if i can get an example for a simple synchronization of a kafka transaction with a repository transaction and an explanation.
If the listener container is provisioned with a KafkaTransactionManager, the container will create a producer which will be used by any downstream kafka template and the container will send the offsets to the transaction for you.
If the container has some other transaction manager, the container can't send the offsets since it doesn't have access to the producer (or template).
Another solution is to annotate your method with #Transactional (with the datasource TM) and configure the container with a kafka TM.
That way, your DB tx will commit just before the thread returns to the container which will then send the offsets to the kafka transaction and commit it.
See the framework test cases for examples.
#Eike Behrends to have a db + kafka transaction, you can use ChainedTransactionManager and define it this way :
#Bean
public KafkaTransactionManager kafkaTransactionManager() {
KafkaTransactionManager ktm = new KafkaTransactionManager(producerFactory());;
ktm.setTransactionSynchronization(AbstractPlatformTransactionManager.SYNCHRONIZATION_ON_ACTUAL_TRANSACTION);
return ktm;
}
#Bean
#Primary
public JpaTransactionManager transactionManager(EntityManagerFactory em) {
return new JpaTransactionManager(em);
}
#Bean(name = "chainedTransactionManager")
public ChainedTransactionManager chainedTransactionManager(JpaTransactionManager jpaTransactionManager,
KafkaTransactionManager kafkaTransactionManager) {
return new ChainedTransactionManager(kafkaTransactionManager, jpaTransactionManager);
}
You need to annotate your transactional db+kafka methods #Transactional("chainedTransactionManager")
(you can see the issue on spring-kafka project : https://github.com/spring-projects/spring-kafka/issues/433 )
You say :
From my perspective it looks weird that I send an event via
kafkaTemplate, listen to my own event and forward the event using the
kafkaTemplate again.
Have you tried this ? If so can you provide an example please ?
For achieving your target you should use a different "eventually consistent" approach like CDC (Change Data Capture). There are no atomic transactions between Kafka writes and any other system (e.g. a database) - aka XA transactions. It is a complete paradigm swift when you have distributed services (some call them microservices) that in your case probably communicate by producing/ consuming to/ from Kafka topics.
TL;DR: just use upsert / merge.
Accidentally seen this old topic and after so many years people still struggle.
Just want to share simplest and most native approach to deal with such systems as kafka.
The real issue why people come here for an answer is old approach of distributed transactions. And most ones want to synchronize non-transactional (kafka named it's functionality as transactions but they are "special" actually) kafka with some ACID database.
If your service is working within idempotent environment - everything downstream should be idempotent too.
Just make sure your operations to underlying storage are idempontent, the simplest approach are upsert / merge (depends on the storage).
P.s. CDC is a thing, but it requires much more labor cost and is unnecessary in most typical cases.
MORE :
If you want to dig about why kafka "transactions" are special, here are good starting points (explained within eos):
for newer versions: https://www.youtube.com/watch?v=j0l_zUhQaTc
for older: https://www.youtube.com/watch?v=zm5A7z95pdE
EDIT
Very interesting why this answer got downvotes... Just check this issue/comments/related issues https://github.com/spring-projects/spring-data-commons/issues/2232 - thats why one would not want to use ChainedTransactionManager for business-critical Transactions (it can't act as a real 2PC by design).

store messages of messaging application

Actually, I am developing a messaging app and use cassandra as a database and kafka as a message broker.
My question is:
Do I need to store the messages between users in cassandra? If I do so, then size of my database will grow really fast.
As I am using a messaging queue, the messages are stored as long as these were not delievered. I have heard, that messaging apps (such as Facebook Messanger, WhatsApp) does not store the message content between users in a database, but only use a queuing system (XMPP, MQTT) which deletes messages as soon as they are delieverd. So no need for storing in external database. Am I right?
What is best practice? Besides, do I need to store the messaging content from the legal perspective (government or the like) for a period of time (for example, 2 years)?
Looking at http://www.planetcassandra.org/apache-cassandra-use-cases/, there are a lot using cassandra as a database backend for messaging apps. However, it is a antipattern to use cassandra as a message-queue (see cassandra docs).
Using Cassandra as a queue is clearly an anti pattern
However Cassandra is a good fit to store messages, read my blog post on KillrChat: http://www.doanduyhai.com/blog/?p=1859 for a possible data model for message storage

How to implement something similar to Storm DRPC in Samza?

I have samza job with a number of tasks, each of which holds some state in its embedded store. I want to expose this store for reading to outside world via some kind of RPC mechanism. What could be the best solution for this?
Here is one paragraph in Samza documentation about it:
Samza does not currently have an equivalent API to DRPC,
but you can build it yourself using Samza’s stream
processing primitives.
The only solution which comes to my mind is to make my tasks, in addition to normal processing, to consume request messages with some correlation IDs on a special request topic, and to put response messages with the same correlation IDs into special response topic. So it's like RPC-over-Kafka solution which seems to me suboptimal.
Any thoughts are welcome!
As far as I remember the embedded store is backed up in a Kafka topic. When you set something in the store, the message is produced to the topic. Thus you can consume this topic and you can "clone" the embedded store to a different database. Then you can query the database. Or you can use just the database instead of the embedded store. But this approach could lead to performance issues in your Samza job...

Resources