Cleaning state object using wildcard in spark structured streaming - apache-spark

We are planning to use spark structured streaming's arbitrary state processing for one of our use cases where we want to maintain certain attributes of user in the state. The query runs for a multi-tenant environment and hence we have kept the state key as <tenantID>:<userId>. Now, when the tenant is deleted, we want to remove the state as soon as possible.
So, what we are looking for is to delete all the state keys <tenantId>:*, when we receive a tenant delete event.
Is there a way we can achieve that with arbitrary state processing in spark structured streaming?

Related

periodic refresh of static data in Structure Streaming and Stateful Streaming

I am trying to implement 5 min batch monitoring using spark structured streaming where read from kafka and look up on (1 huge and 1 smaller) diff static datasets as part of ETL logic and call rest API to send final results to an external application (out of billions of records from kafka only less than 100 will be out to rest API after ETL).
How to achieve refreshing static look ups with out restarting the whole streaming application ? (StreamingQueryListener using StreamingQueryManager.addListener method to have our own logic of refreshing/recreating static df via StreamingQuery.AwaitTermination ? or use persist and unpersis cache ? or any other better ideas ?)
Note : Went through below article but not sure if hbase is better option as its an old one.
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
Once a record is enriched with look up information and applied some rules/conditions , we need to start keep track of it to send updates until it completed its lifecycle of an event as per custom logic via rest API. So hoping flatmapwithGroupState implementation helps here to keep track of event state. Please suggest best options here.
Managing group state with in HDFS vs using HBase. Please suggest best options from an operationalization and monitoring point of view in production environment where support team has minimal knowledge of Spark. If we use HDFS for state maintenance, how to keep it up with event state tracking in case of rest API fails to send updates to end user/system?

Is spark JDBC sink transactionally safe at node level?

I have a question related to opening a transaction at partition level. If I use jdbc connector to write to database (postgess), will partition specific writes at worker node be transactionally safe i.e.
If a worker node goes down while writing the data, will the rows related to this partition/ worker node be rolled back?
There is a transaction boundary on the partition (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L588)
But if there are failures afterwards before the task is marked as SUCCESS, for example with a network issue or timeout, then you might still get multiple writes

What is the advantage and disadvantage when considering Kafka as a storage?

I have 2 approaches:
Approach #1
Kafka --> Spark Stream (processing data) --> Kafka -(Kafka Consumer)-> Nodejs (Socket.io)
Approach #2
Kafka --> Kafka Connect (processing data) --> MongoDB -(mongo-oplog-watch)-> Nodejs (Socket.io)
Note: in Approach #2, I use mongo-oplog-watch to check when inserting data.
What is the advantage and disadvantage when using Kafka as a storage vs using another storage like MongoDB in real-time application context?
Kafka topics typically have a retention period (default to 7 days) after which they will be deleted. Though, there is no hard rule that we must not persist in Kafka.
You can set the topic retention period to -1 (reference)
The only problem, I know of persisting data in Kafka, is security. Kafka, out of the box (atleast as of now) doesn't provide Data-at-rest encryption. You need to go with a custom solution (or a home-grown one) to have that.
Protecting data-at-rest in Kafka with Vormetric
A KIP is also there, but it is Under discussion
Add end to end encryption in Kafka (KIP)
MongoDB on the other hand seems to provide Data-at-rest encryption.
Security data at rest in MongoDB
And most importantly, it also depends on the type of the data that you are going to store and what you want to do with it.
If you are dealing with data that is quite complex (not easy as Key-Value i.e., give the key and get the value model), for example, like querying by indexed fields etc (as you do typically with logs), then MongoDB could probably make sense.
In simple words, if you are querying by more than one field (other than the key), then storing it in MongoDB could make sense, if you intend to use Kafka for such a purpose, you would probably end up with creating a topic for every field that should be queried... which is too much.

Is it possible to have external trigger in Cassandra?

I need a worker to subscribe to new data entries in a column family.
I have to invoke the services consuming data on the producer side, or poll the column family for new data, which is a waste of resources and also leads to some extended latency.
I want some external service to be invoked when new data is written to column family. Is it possible to invoke an external service, such as an REST endpoint upon new data arrival?
There are two features, triggers and CDC (change data capture) that may work. You can create a trigger to receive updates and execute the http request, or you can use CDC to get a per replica copy of the mutations as a log to walk through.
CDC is better for consistency, since a trigger fires before mutations applied, your API endpoint may be notified but then have the mutation fail to apply so your at an inconsistent state. But triggers are easier since you dont need to worry about deduplication since its only 1 per query vs 1 per replica. Or you can use both, triggers that update a cached state and then CDC with a map reduce job to fix any inconsistencies.

spring batch design advice for processing 50k files

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records
Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

Resources