I am working with a spring batch application for the first time and since the framework is way too flexible, I have a few questions on performance and best practices implementing jobs which I couldn't find clear answers in the spring docs.
My Goals:
read an ASCII file with fixed column length values sent by a third-party with previously specified layout (STEP 1 reader)
validate the read values and register (Log file) the errors (custom messages)
Apply some business logic on the processor to filter any undesirable lines (STEP 1 processor)
write the valid lines on oracle database (STEP 1 writer)
After the execution of the previous step update a table on the database with the the step 1 finish timestamp (STEP 2 tasklet)
Send an email when the job is stopped with a summary of the quantities already processed, errors and written lines, start time and finish time (Are These informations on the jobRepository meta-data?)
Assumptions:
The file is incremental, so the third party always sends the prior file lines (possible with some values changes) and any new lines (~120Million lines on total). A new file is sent every 6 months.
we must validate if input file lines while processing (Are required values present? Some can be converted to number and Dates?)
The job must be stoppable/restartable since is intended to run on a Time window.
What I planning to do:
To achieve some performance on reading and writing I am avoiding use of Spring's out-of-the-box reflection beans and using jdbcBatchWriter to write the processed lines to the database.
The FileReader reads the lines with a custom FieldSetMapper, transform all the columns with FieldSet.readString method (this implies no ParseException on Reading). A Bean injected on the Processor performs parsing and validation, so this way we can avoid skipping exceptions during reading which seems an expensive operation and can count the invalid lines to pass through future steps, saving the info on the step/job execution context.
The processor bean should convert the object read an return a Wrapper with the original object, the parsed values (i.e., Dates and Longs), the first eventual Exception thrown by the Parsing and a boolean that indicates whether the validation was successful or not. After the parsing another CustomProcessor check if the register should be inserted on the database by querying similar or identical registers already inserted. This business rule could imply in a query into the database per valid line in the worst scenario.
A jdbcItemWriter discards null values returned by the processors and writes valid registers to the database.
So The real Questions regarding batch processing:
What are some performance tips that a I could use to improve the batch performance? In a preliminary attempt the load of a perfect valid mock input file into the database led to 15 hours of processing without querying the database to verify if the processed register should be inserted. What could be the local processing simplest solution?
Have you seen partitioning ? http://docs.spring.io/spring-batch/reference/html/scalability.html and this may also helpful remote chunking with the control on reader in spring batch
Related
Are there any recommended architectural patterns with Service Bus for ensuring ordered processing of nested groups of messages which are sent out of order? We are using Sessions, but when it comes down to ensuring that a set of Sessions must be processed sequentially in a certain order before moving onto another set of Sessions, the architecture becomes cumbersome very quickly. This question might best be illustrated with an example.
We are using Service Bus to integrate changes in real-time from a database to a third-party API. Every N minutes, we get notified of a new 'batch' of changes from the database which consists of individual records of data across different entities. We then transform/map each record and send it along to an API. For example, a 'batch' of changes might include 5 new/changed 'Person' records, 3 new/changed 'Membership' records, etc.
At the outer-most level, we must always process one entire batch before we can move on to another batch of data, but we also have a requirement to process each type of entity in a certain order. For example, all 'Person' changes must be processed for a given batch before we can move on to any other objects.
There is no guarantee that these records will be queued up in any order which is relevant to how they will need to be processed, particularly within a 'batch' of changes (e.g. the data from different entity types will be interleaved).
We actually do not necessarily need to send the individual records of entity data in any order to the API (e.g. it does not matter in which order I send those 5 Person records for that batch, as long as they are all sent before the 3 Membership records for that batch). However, we do group the messages into Sessions by entity type so that we can guarantee homogeneous records in a given session and target all records for that entity type (this also helps us support a separate requirement we have when calling the API to send a batch of records when possible instead of an individual call per record to avoid API rate limiting issues). Currently, our actual Topic Subscription containing the record data is broken up into Sessions which are unique to the entity type and the batch.
"SessionId": "Batch1234\Person"
We are finding that it is cumbersome to manage the requirement that all changes for a given batch must be processed before we move on to the next batch, because there is no Session which reliably groups those "groups of entities" together (let alone processing those groups of entities themselves in a certain order). There is, of course, no concept of a 'session of sessions', and we are currently handling this by having a separate 'Sync' queue to represent an entire batch of changes which needs to be processed what sessions of data are contained in that batch:
"SessionId": "Batch1234",
"Body":
{
"targets": ["Batch1234\Person", "Batch1234\Membership", ...]
}
This is quite cumbersome, because something (e.g. a Durable Azure Function) now has to orchestrate the entire process by watching the Sync queue and then spinning off separate processors that it oversees to ensure correct ordering at each level (which makes concurrency management and scalability much more complicated to deal with). If this is indeed a good pattern, then I do not mind implementing the extra orchestration architecture to ensure a robust, scalable implementation. However, I cannot help from feeling that I am missing something or not thinking about the architecture the right way.
Is anyone aware of any other recommended pattern(s) in Service Bus for handling ordered processing of groups of data which themselves contain groups of data which must be processed in a certain order?
For the record I'm not a service bus expert, specifically.
The entire batch construct sounds painful - can you do away with it? Often if you have a painful input, you'll have a painful solution - the old "crap in, crap out" maxim. Sometimes it's just hard to find an elegant solution.
Do the 'sets of sessions' need to be processed in a specific order?
Is a 'batch' of changes = a session?
I can't think of a specific pattern, but a "divide and conquer" approach seems reasonable (which is roughly what you have already?):
Watch for new batches, when one occurs hand it off to a BatchProcessor.
BatchProcessor applies all the rules to the batch, as you outlined.
Consider having the BatchProcessor dump it's results on a queue of some kind which is the source for the API - that way you have some kind of isolation between the batch processing and the API.
I am trying to execute a getRange command in fdbCli but it fails with
FDBException: Transaction is too old to perform reads or be committed
What is the meaning of this particular exception?
Does it mean by query took more than 5 sec to complete?
Fdb keeps a list of the transaction started within 5 sec. Also, data nodes only keep versions of the last 5sec. So if the read version is smaller than the last version kept by dataNodes, the dataNodes have no way to answer the request. That's why fdb throws this exception. the trick to evade from such exceptions is to split one huge time taking transaction to many small transactions. I also noticed fdb performs really well if the transaction time < 300ms.
Firstly - yes, you are correct (your query took more than 5 seconds to complete).
If the read request’s timestamp is older than 5 seconds, the storage server may have already flushed the data from its in-memory multi-version data structure to its on-disk single-version data structure. This means the storage server does not have the data older than the 5 seconds. So the client will receive the error you've mentioned.
NB: You can avoid this problem via the use of a RecordCursor and by using passing a continuation to your query.
More on continuations here.
I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.
I want to create a job in spring data which should consist of two steps:
Step 1 - First step reads certain transactions from database and produces a list of record Ids that will be sent to step 2 via jobContext attribute.
Step 2 - This should be a partition step: The slave steps should be partitioned based on the list obtained from step 1 (each thread gets a different Id from the list) and perform their read/process/write operations without interfering with each other.
My problem is that even though I want to partition data based on the list produced by step 1, spring configures step 2 (and thus, calls the partitioner's partition() method) before step 1 even starts, so I cannot inject the partitioning criteria on time. I tried using #StepScope on the partitioner bean, but it still attempts to create the partitions before the job starts.
Is there a way to dynamically create the step partitions during runtime, or an alternative way to divide a step into threads based on the list provided by step 1?
Some background:
I am working on a batch job using spring batch which has to process Transactions stored in a database. Every transaction is tied to an Account (in a different table), which has an accountBalance that also needs to be updated whenever the transaction is processed.
Since I want to perform these operations using multi-threading, I thought a good way to avoid collisions would be to group transactions based on their accountId, and have each thread process only the transactions that belong to that specific accountId. This way, no two threads will attempt to modify the same Account at the same time, as their Transactions will always belong to different Accounts.
However, I cannot know which accountIds need to be processed until I get the list of transactions to process and extract the list from there, so I need to be able to provide the list to partition during runtime. Thtat's why I thought I could generate that list in a previous step, and then have the next step partition and process the data accordingly.
Is the approach I am taking plausible with this setup? Or should I just look for a different solution?
I couldn't find a way to partition the data mid-job like I wanted, so I had to use this workaround:
Instead of dividing the job in two steps, I moved the logic from step 1 (the "setup step") into a service method that returns the list of transactions to process, and added a call to that method inside the partition() method in my partitioner, allowing me to create the partitions based on the returned list.
This achieves the same result in my case, although I'm still interested in knowing if it is possible to configure the partitions mid-job, since this solution would not work if I had to perform more complex processing or writing in the setup step and wanted to configure exception handling policies and such. It probably would not work either if the setup step was placed in the middle of a step chain instead of at the start.
I execute batch update which modifies few rows within few column families. In case of TimedOutException some data could be modified, but possibly not whole set....
In order to implement compensating transaction, I would need to know what data (rows) was modified - is there a way to find this out? Does exception contain this information?
Thanks,
Maciej
Creating a system that can scale out means taking some trade-offs - one of these is facilitating "idempotent" operations in your application.
This means that you would either:
assume that the data was written somewhere and that the node will
eventually become consistent
fire the entire contents of the write again, perhaps sleeping a given amount of time or
at a less restrictive consistency level
A good description of this approach can be found in section 6 of Pat Helland's "Building on Quicksand" paper: http://arxiv.org/pdf/0909.1788