Hibernate Implementing a Queue - multithreading

I have a Quartz schedule which is inserting data in TblTransactions table. I want to run another Quartz schedule with multiple instances/threads which will fetch records from TblTransactions, do some processing and delete the records.
How do i ensure that a record fetched by a thread doesn't get fetched by another thread?
Can I integrate oracle advanced queue with hibernate? What else options can I consider?
I am using Hibernate with Oracle 11g.

It could get very tricky not to get the same record twice if multiple threads are reading the same table, even if you somehow mark them as fetched in the database (the other thread could read the row before the transaction commits).
The way I would implement this is to use a single thread to fetch the records, then split them up for processing and delegate N records to each processor-thread, and use Futures or callbacks to track the progress (so if some processor-thread fails, I know to re-submit the records for processing and/or log/email the error to alert admins so they know to check it out in case of invalid data or such).
Either the processor-threads could take care of removing the processed records themselves when they complete (either immediately after a single record has been processed, or remove all in one go after all records have been processed), or you could have a mapping in the fetch-thread to map records to processor-thread, and once the thread finishes successfully, remove all the records it processed.
If the fetching-operation would be called periodically, and there could still be old records in processing, you'd probably need the mapping in fetch-thread -side to know if the fetched records contain such records that are already in processing from an earlier fetch-run.

Related

Is it bad to run cron jobs to poll from a huge table of scheduled job records?

I've a table which a cron job would poll at every minute to send out messages to other services. The records in the table are essentially activities that are scheduled to run at a certain time. The cron job simply checks to see which of those activities are ready to be run and send a message of that activity through SQS to the other services.
When an activity is found to be ready to run by the cron job, that record will be marked as done after sending a message through SQS. There is an API which allows other services to check whether a scheduled activity has already been done. So keeping a history of those done records is needed.
My concern here, however, is whether a design like this is scalable in the long run. There are around 200k scheduled activities a day, or even more on some days. Since I'm keeping the records by marking them as done after they are completed, I'm worried that the table will eventually get very huge with ten over millions of rows and become an issue for the cron job to run as frequently.
Even with a properly indexed table, is my concern valid? Otherwise, what other alternatives can I design it if I had to somehow persist those scheduled activities for a cron or something to poll and check when they are ready to run?
I'm using Postgres database.
As long as the number of rows that the cron job's query has to fetch stays constant and you can use an index, the size of the table won't matter.
Index scans are O(n) with respect to the number of rows scanned and O(log(n)) with respect to the table size. To be more specific, increasing the table size by a factor between 10 and 200 (smaller size of the index key leads to better fan-out) will make an index scan use one more block, and that block is normally cached.
If the table gets large, you might still want to consider partitioning, but mostly so that you can get rid of old data efficiently.
With the right index, the cron job should have no serious problem. You can have a partial/filtered index, like
create index on jobs (id) where status <> 'done'.
To keep the size of the index small. The query has to match the index where clause.
I used (id) just because an empty list is not allowed and so something has to be there. Based on your comment, schedule_dt might be a better choice. If you include all the columns you select, you can get an index-only scan. But if you don't, it will still use the index, it just has to visit the table to fetch the columns for those specific rows. I suspect the index only scan attempt won't be worth it to you as the pages you need probably won't be marked all visible, as modifications were made to neighboring tuples just one minute ago.
However, it does seem a bit odd to mark a job as done when it has only been scheduled, rather than being done.
There is an API which allows other services to check whether a scheduled activity has already been done.
A table that increases in size without bound is likely to present management problems apart from the cron job. Surely the services aren't going to have to look back months in order to do this, are they? Could you delete 'done' jobs after a few days? What if a service tries to look up a job and rather than finding it 'done', it just doesn't find it at all?
I don't think the cron job is inherently a problem, but it would be cleaner not to have it. Why doesn't whoever inserts the job just invoke SQS in real time?

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?
tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

Explicit table locking to disable DELETES?

Using Oracle 11gR2:
We already have a process that cleans up particular tables by deleting records from them that are past a specified retention date (based on the comparison between the timestamp from when the record finished processing and the retention date). I am currently writing code that will alert my team if this process fails. The only way I can see this process possibly failing is if DELETEs are disabled on the particular table it is trying to clean up.
I want to test the alerts to make sure they work and look correct by having the process fail. If I temporarily exclusively lock the table, will that disable DELETEs and cause the procedure that deletes records to fail? Or does it only disable DDL operations? Is there a better way to do this?
Assuming that "fail" means "throw an error" rather than, say, exceeding some performance bound, locking the table won't accomplish what you want. If you locked every row via a SELECT FOR UPDATE in one session, your delete job would block forever waiting for the first session to release its lock. That wouldn't throw an error and wouldn't cause the process to fail for most definitions. If your monitoring includes alerts for jobs that are running longer than expected, however, that would work well.
If your monitoring process only looks to see if the process ran and encountered an error, the easiest option would be to put a trigger on the table that throws an error when there is a delete. You could also create a child table with a foreign key constraint that would generate an error if the delete tried to delete the parent row while a child row exists. Depending on how the delete process is implemented, you probably could engineer a second process that would produce an ORA-00060 deadlock for the process you are monitoring but that is probably harder to implement than the trigger or the child table.

jdbc inbound adapter with a stored procedure cursor

Jdbc inbound channel adapter relies on the update query to mark the already processed records and thats how we can retrieve only the non-processed records in the subsequent polls. This makes sense but I am working with a table that doesnt have a column that I can modify to indicate this record being processed.
I was wondering if I can use a stored procedure which returns a cursor and somehow that will help with not having to load all the lets say million records in memory and still be able to process lets say 1000 every poll.
Edit: I am working with oracle
Yes, you can use stored procedure on the matter. For this purpose Spring Integration suggests <int-jdbc:stored-proc-inbound-channel-adapter> component.
Here you can find the sample.

Strategies for checking inactivity on Azure

I have a table in Azure Table Storage, with rows that are regularly updated by various processes. I want to efficiently monitor when rows haven't been updated within a specific time period, and to cause alerts to be generated if that occurs.
Most task scheduler implementations I've seen for Azure function by making sure only one worker will perform a given job at a time. However, setting up a scheduled task that waits n minutes, and then queries the latest time-stamp to determine if action should be taken, seems inefficient since the work won't be spread across workers. It also seems generally inefficient to have to poll so many records.
An example use of this would be to send an email to a user that hasn't logged into a web site in the last 30 days. Assume that the number of users is a "large number" for the purposes of producing an efficient algorithm.
Does anyone have any recommendations for strategies that could be used to check for recent activity without forcing only one worker to do the job?
Keep a LastActive table with a timestamp as a rowkey (DateTime.UtcNow.Ticks.ToString("d19")). Update it by doing a batch transaction that deletes the old row and inserts the new row.
Now the query for inactive users is just something like from user in LastActive where user.PartitionKey == string.Empty && user.RowKey < (DateTime.UtcNow - TimeSpan.FromDays(30)).Ticks.ToString("d19") select user. That will be quite efficient for any size table.
Depending on what you're going to do with that information, you might want to then put a message on a queue and then delete the row (so it doesn't get noticed again the next time you check). Multiple workers can now pull those queue messages and take action.
I'm confused about your desire to do this on multiple worker instances... you presumably want to act on an inactive user only once, so you want only one instance to do the check. (The work of sending emails or whatever else you're doing can then be spread about by using a queue, but that initial check should be done by exactly one instance.)

Resources