Partition data mid-job on Spring Batch - multithreading

I want to create a job in spring data which should consist of two steps:
Step 1 - First step reads certain transactions from database and produces a list of record Ids that will be sent to step 2 via jobContext attribute.
Step 2 - This should be a partition step: The slave steps should be partitioned based on the list obtained from step 1 (each thread gets a different Id from the list) and perform their read/process/write operations without interfering with each other.
My problem is that even though I want to partition data based on the list produced by step 1, spring configures step 2 (and thus, calls the partitioner's partition() method) before step 1 even starts, so I cannot inject the partitioning criteria on time. I tried using #StepScope on the partitioner bean, but it still attempts to create the partitions before the job starts.
Is there a way to dynamically create the step partitions during runtime, or an alternative way to divide a step into threads based on the list provided by step 1?
Some background:
I am working on a batch job using spring batch which has to process Transactions stored in a database. Every transaction is tied to an Account (in a different table), which has an accountBalance that also needs to be updated whenever the transaction is processed.
Since I want to perform these operations using multi-threading, I thought a good way to avoid collisions would be to group transactions based on their accountId, and have each thread process only the transactions that belong to that specific accountId. This way, no two threads will attempt to modify the same Account at the same time, as their Transactions will always belong to different Accounts.
However, I cannot know which accountIds need to be processed until I get the list of transactions to process and extract the list from there, so I need to be able to provide the list to partition during runtime. Thtat's why I thought I could generate that list in a previous step, and then have the next step partition and process the data accordingly.
Is the approach I am taking plausible with this setup? Or should I just look for a different solution?

I couldn't find a way to partition the data mid-job like I wanted, so I had to use this workaround:
Instead of dividing the job in two steps, I moved the logic from step 1 (the "setup step") into a service method that returns the list of transactions to process, and added a call to that method inside the partition() method in my partitioner, allowing me to create the partitions based on the returned list.
This achieves the same result in my case, although I'm still interested in knowing if it is possible to configure the partitions mid-job, since this solution would not work if I had to perform more complex processing or writing in the setup step and wanted to configure exception handling policies and such. It probably would not work either if the setup step was placed in the middle of a step chain instead of at the start.

Related

Azure Service Bus: Ordered Processing of Session Sequences

Are there any recommended architectural patterns with Service Bus for ensuring ordered processing of nested groups of messages which are sent out of order? We are using Sessions, but when it comes down to ensuring that a set of Sessions must be processed sequentially in a certain order before moving onto another set of Sessions, the architecture becomes cumbersome very quickly. This question might best be illustrated with an example.
We are using Service Bus to integrate changes in real-time from a database to a third-party API. Every N minutes, we get notified of a new 'batch' of changes from the database which consists of individual records of data across different entities. We then transform/map each record and send it along to an API. For example, a 'batch' of changes might include 5 new/changed 'Person' records, 3 new/changed 'Membership' records, etc.
At the outer-most level, we must always process one entire batch before we can move on to another batch of data, but we also have a requirement to process each type of entity in a certain order. For example, all 'Person' changes must be processed for a given batch before we can move on to any other objects.
There is no guarantee that these records will be queued up in any order which is relevant to how they will need to be processed, particularly within a 'batch' of changes (e.g. the data from different entity types will be interleaved).
We actually do not necessarily need to send the individual records of entity data in any order to the API (e.g. it does not matter in which order I send those 5 Person records for that batch, as long as they are all sent before the 3 Membership records for that batch). However, we do group the messages into Sessions by entity type so that we can guarantee homogeneous records in a given session and target all records for that entity type (this also helps us support a separate requirement we have when calling the API to send a batch of records when possible instead of an individual call per record to avoid API rate limiting issues). Currently, our actual Topic Subscription containing the record data is broken up into Sessions which are unique to the entity type and the batch.
"SessionId": "Batch1234\Person"
We are finding that it is cumbersome to manage the requirement that all changes for a given batch must be processed before we move on to the next batch, because there is no Session which reliably groups those "groups of entities" together (let alone processing those groups of entities themselves in a certain order). There is, of course, no concept of a 'session of sessions', and we are currently handling this by having a separate 'Sync' queue to represent an entire batch of changes which needs to be processed what sessions of data are contained in that batch:
"SessionId": "Batch1234",
"Body":
{
"targets": ["Batch1234\Person", "Batch1234\Membership", ...]
}
This is quite cumbersome, because something (e.g. a Durable Azure Function) now has to orchestrate the entire process by watching the Sync queue and then spinning off separate processors that it oversees to ensure correct ordering at each level (which makes concurrency management and scalability much more complicated to deal with). If this is indeed a good pattern, then I do not mind implementing the extra orchestration architecture to ensure a robust, scalable implementation. However, I cannot help from feeling that I am missing something or not thinking about the architecture the right way.
Is anyone aware of any other recommended pattern(s) in Service Bus for handling ordered processing of groups of data which themselves contain groups of data which must be processed in a certain order?
For the record I'm not a service bus expert, specifically.
The entire batch construct sounds painful - can you do away with it? Often if you have a painful input, you'll have a painful solution - the old "crap in, crap out" maxim. Sometimes it's just hard to find an elegant solution.
Do the 'sets of sessions' need to be processed in a specific order?
Is a 'batch' of changes = a session?
I can't think of a specific pattern, but a "divide and conquer" approach seems reasonable (which is roughly what you have already?):
Watch for new batches, when one occurs hand it off to a BatchProcessor.
BatchProcessor applies all the rules to the batch, as you outlined.
Consider having the BatchProcessor dump it's results on a queue of some kind which is the source for the API - that way you have some kind of isolation between the batch processing and the API.

Lock CustomRecord Serials Table

We have a Fulfillment script in 1.0 that pulls a Serial number from the custom record based on SKU and other parameters. There is a seach that is created based on SKU and the fist available record is used. One of the criteria for search is that thee is no end user associated with the key.
We are working on converting the script to 2.0. What I am unable to figure out is, if the script(say the above functionality is put into Map function for a MR script) will run on multiple queues/instances, does that mean that there is a potential chance that 2 instance might hit the same entry of the Custom record? What is a workaround to ensure that X instances of Map function dont end us using the same SN/Key? The way this could happen in 2.0 would be that 2 instance of Map make a search request on Custom record at same time and get the same results since the first Map has not completed processing and marked the key as used(updating the end user information on key).
Is there a better way to accomplish this in 2.0 or do I need to go about creating another custom record that script would have to read to be able to pull key off of. Also is there a wait I can implement if the table is locked?
Thx
Probably the best thing to do here would be to break your assignment process into two parts or restructure it so you end up with a Scheduled script that you give an explicit queue. That way your access to Serial Numbers will be serialized and no extra work would need to be done by you. If you need hint on processing large batches with SS2 see https://github.com/BKnights/KotN-Netsuite-2 for a utility script that you can require for large batch processing.
If that's not possible then what I have done is the following:
Create another custom record called "Lock Table". It must have at least an id and a text field. Create one record and note its internal id. If you leave it with a name column then give it a name that reflects its purpose.
When you want to pull a serial number you:
read from Lock Table with a lookup field function. If it's not 0 then do a wait*.
If it's 0 then generate a random integer from 0 to MAX_SAFE_INTEGER.
try to write that to the "Lock Table" with a submit field function. Then read that back right away. If it contains your random number then you have the lock. If it doesn't then wait*.
If you have the lock then go ahead and assign the serial number. Release the lock by writing back a 0.
wait:
this is tough in NS. Since I am not expecting the s/n assignment to take much time I've sometimes initiated a wait as simply looping through what I hope is a cpu intensive task that has no governance cost until some time has elapsed.

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

Designing a Windows Azure Tables database for storing checkboxes states

With zero experience designing non-relational databases (Azure Storage Tables, to be specific), I'm having trouble coming up with a good design to store the data for my application.
The application is really simple. It is basically a multi-user To-Do list:
User selects a "Procedure".
User gets presented with webpage with several checkboxes.
User starts checking checkboxes.
Each check/uncheck gets stored in the DB.
For example, let's say that we have a procedure to obtain Milk:
Procedure 1 - How to obtain Milk:
[_] Step 1 - Open fridge
[_] Step 2 - Get Milk
[_] Step 3 - Close fridge
Alice decides to execute this procedure, so she creates a new execution and starts checking checkboxes:
Procedure 1, Execution 1:
Executor(s): Alice
[X] Step 1 - Open fridge
[X] Step 2 - Get Milk
[_] Step 3 - Close fridge
Bob, also decides to execute this procedure, but not together with Alice. So, Bob creates a new execution. Charlie, on the other hand wants to help Bob, so instead of creating a new execution he joins Bob's execution:
Procedure 1, Execution 2:
Executor(s): Bob, Charlie
[_] Step 1 - Open fridge
[X] Step 2 - Get Milk
[_] Step 3 - Close fridge
In summary, we can have multiple procedures, and each procedure can have multiple executions:
So, we need a way to store procedures (a list of checkboxes); executions (who, when, checkboxes states); and the history of checks/unchecks.
This is what I have come up with so far:
Create three tables: Procedures, Executions, Actions.
The Procedures table stores what checkboxes are there in each procedure.
The Executions table stores who and when initiated the execution of a Procedure, and the checkboxes states.
The Actions table stores every checkbox check and uncheck, including who and when.
I'm not too happy with this approach for a number of reasons. For instance, every time a user clicks on a checkbox we need to update the Executions table row and insert a new row into the Actions table at the same time. Also, I'm not sure if this design will scale for a really large number of Procedures, Executions, and Actions.
What would be a good way to store this data using Azure Storage Tables, or a similar NoSQL store? How would you go about designing this database? And, how would you partition the data (row keys, partition keys)?
First, you don't need to coerce Azure tables into a relational structure. They're very fast and very cheap, and designed so you can dump blocks of data in and worry about the structure when you retrieve it.
Second, correctly identifying and structuring your partition keys makes retrieval even faster.
Third, Azure tables don't have to have uniform structures. You can store different kinds of data within one table, even with the same partition keys. This opens up possibilities not available to an RDBMS.
So how are you planning to retrieve the data? What are the use cases?
Let's say your primary use case is to retrieve the data by time, like an audit log. In that case, I would suggest this approach:
Put your procedures, executions, and actions all within the same table.
Create a new table for each unit of time that gives you tens of thousands to hundreds of thousands of rows per table, or some other unit that makes sense. (For one project I've done recently, the application's event log uses one table per month, with each table growing to around 100,000 rows.)
Create a partition key that gives you hundreds to thousands of rows per partition. (We use hours remaining until DateTimeOffset.MaxValue. When you query an Azure table without using a partition key, you see the lowest partitions first. This descending-hourly scheme means the most recent hour's entries are at the top of the results pane in our Azure tool.)
Structure your row keys to be human-readable. Remember they need to be unique within the table. So possibly a row key like Procedure_Bob_ID12345_20140514-134630Z_unique where unique is a counter or hash would work.
When you query for data, pull back the entire partition--remember, it's just a few hundred rows--and filter the results in memory, where it's faster.
Say you have a second use case where you need to retrieve data by user name. Simple: within the same table, add a second row containing the same data but with a partition key based on the user name (bob_execution_20140514).
Another thing to consider is storing the entire procedure etc. object graphs in the table. Getting back to our logging example, a log entry might have detailed information, so we just plop an entire block of JSON right in the table. (We're usually retrieving it in an Azure cloud service, so the network throughput isn't a meaningful constraint as Azure-to-Azure speeds within the same region are gigabits per second.)
Depending on usage approach use either Procedure ID or a combination of ProcedureID-ExecutionID. Don't worry about building a quasi-relational model - just choose the right partition key based on how you are most likely to create or consume the data in the majority of cases (i.e. will you care more about procedures, executions, assignees or steps in the longer term and how might you retrieve all items related to a single entity such a procedure in a single query?)
Depending on volume of steps in a procedure you might not even care too much about how step values are tracked (maybe using an integer or enum that could be combined via a bitwise operator?) see - Most common C# bitwise operations on enums
The selection of PK, RK and other table properties depends on how you are going to use the data, your dominant query and application behavior. The storage team blob (http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx) has guidance on this for common scenarios.

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

Resources