Is it bad to run cron jobs to poll from a huge table of scheduled job records? - node.js

I've a table which a cron job would poll at every minute to send out messages to other services. The records in the table are essentially activities that are scheduled to run at a certain time. The cron job simply checks to see which of those activities are ready to be run and send a message of that activity through SQS to the other services.
When an activity is found to be ready to run by the cron job, that record will be marked as done after sending a message through SQS. There is an API which allows other services to check whether a scheduled activity has already been done. So keeping a history of those done records is needed.
My concern here, however, is whether a design like this is scalable in the long run. There are around 200k scheduled activities a day, or even more on some days. Since I'm keeping the records by marking them as done after they are completed, I'm worried that the table will eventually get very huge with ten over millions of rows and become an issue for the cron job to run as frequently.
Even with a properly indexed table, is my concern valid? Otherwise, what other alternatives can I design it if I had to somehow persist those scheduled activities for a cron or something to poll and check when they are ready to run?
I'm using Postgres database.

As long as the number of rows that the cron job's query has to fetch stays constant and you can use an index, the size of the table won't matter.
Index scans are O(n) with respect to the number of rows scanned and O(log(n)) with respect to the table size. To be more specific, increasing the table size by a factor between 10 and 200 (smaller size of the index key leads to better fan-out) will make an index scan use one more block, and that block is normally cached.
If the table gets large, you might still want to consider partitioning, but mostly so that you can get rid of old data efficiently.

With the right index, the cron job should have no serious problem. You can have a partial/filtered index, like
create index on jobs (id) where status <> 'done'.
To keep the size of the index small. The query has to match the index where clause.
I used (id) just because an empty list is not allowed and so something has to be there. Based on your comment, schedule_dt might be a better choice. If you include all the columns you select, you can get an index-only scan. But if you don't, it will still use the index, it just has to visit the table to fetch the columns for those specific rows. I suspect the index only scan attempt won't be worth it to you as the pages you need probably won't be marked all visible, as modifications were made to neighboring tuples just one minute ago.
However, it does seem a bit odd to mark a job as done when it has only been scheduled, rather than being done.
There is an API which allows other services to check whether a scheduled activity has already been done.
A table that increases in size without bound is likely to present management problems apart from the cron job. Surely the services aren't going to have to look back months in order to do this, are they? Could you delete 'done' jobs after a few days? What if a service tries to look up a job and rather than finding it 'done', it just doesn't find it at all?
I don't think the cron job is inherently a problem, but it would be cleaner not to have it. Why doesn't whoever inserts the job just invoke SQS in real time?

Related

Test functionality manually by adding maximum records in SQL table, how should i add those many records?

I have to test functionality manually where , if a background job fails then a record is added in a table having columns Exception Id, Job Id, Job Name, Exception, Method Name, Service Name, etc. columns. At present there are around 25 background jobs in the application and i want to test the impact of adding maximum records in the table. How should I do that? Or i have to manually fail jobs to add maximum record in the table (which is practically not possible, as the we can add unlimited records in the table)Is there any way?
I tried to manually fail the job, so that records can be added in the table, however, adding thousands of records manually is practically impossible.
How should i add maximum records in the table manually/ How should i test this scenario?
Can you try doing the following :
Create various Windows batch jobs covering the 25 background jobs.
Create Windows task scheduler such that it is continuously triggered. You may try scheduling the job execution on the same day or spread across multiple days.
What this will do is you no longer need to run the background jobs manually as the task scheduler will take care of this.

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

Dynamodb infrequently scheduled scan

I am implementing a session table with nodejs which will grow to a huge number of items. each hash key is a uuid representing a user.
In order to delete the expired sessions, I must scan the table for expired attribute and delete old sessions. I am planning to do this scan once a few days, and other than that, I don't really need high read capacity.
I came out with 2 solutions, and i would like to hear some feedback about them.
1) UpdateTable to higher capacities for only that scheduled routine, and after the scan is done, simply reduce the table capacities to it's original values.
2) Perform the scan, and when retrieving the 'LastEvaluatedKey' after an x*MB read, create a initiation delay (for not consuming all read/sec units), and then continue the scan with 'ExclusiveStartKey'.
If you're doing a scan, option 1 is your best best. This is the only real way to guarantee that you won't effect your application performance while the scan is ongoing.
The only thing you need to be sure of is that you only run this operation once a day -- I believe you can only DOWNGRADE throughput units on a DynamoDB table 2x's per day (at most).
This is an old question, but I saw it through a related question.
There is now a much better native solution: DynamoDB Time to Live
It allows you to specify one attribute per table that serves as the time to live value for each item. You can then set the attribute per item with a Unix-Timestamp that specifies when the item should be deleted.
Within about 24 hours of that timestamp, the item will be deleted at no additional charge.

Adding and retrieving sorted counts in Cassandra

I have a case where I need to record a user action in Cassandra, then later retrieve a sorted list of users with the highest number of that action in an arbitrary time period.
Can anyone suggest a way to store and retrieve this data in a pre-aggregated method?
Outside of Cassandra I would recommend using stream-summary or count min sketch you would be able to solve this with much less space and have immediate results. Just update and periodically serialize and persist it (assuming you don't need guaranteed accuracy)
In Cassandra you can keep a row per period of time like by hours and have a counter per user in that row, incrementing them on use. Then use a batch job to run through them and find the heavy hitters. You would be constrained to having the minimal queryable time be 1 hour and it wont be particularly cheap or fast to compute but it would work.
Generally it would be good treating these as a log of operation, every time there is an event store it and have batch jobs do analytics against it with hadoop or custom. If need it realtime id recommend the above approach of keeping stream summaries in memory.

Strategies for checking inactivity on Azure

I have a table in Azure Table Storage, with rows that are regularly updated by various processes. I want to efficiently monitor when rows haven't been updated within a specific time period, and to cause alerts to be generated if that occurs.
Most task scheduler implementations I've seen for Azure function by making sure only one worker will perform a given job at a time. However, setting up a scheduled task that waits n minutes, and then queries the latest time-stamp to determine if action should be taken, seems inefficient since the work won't be spread across workers. It also seems generally inefficient to have to poll so many records.
An example use of this would be to send an email to a user that hasn't logged into a web site in the last 30 days. Assume that the number of users is a "large number" for the purposes of producing an efficient algorithm.
Does anyone have any recommendations for strategies that could be used to check for recent activity without forcing only one worker to do the job?
Keep a LastActive table with a timestamp as a rowkey (DateTime.UtcNow.Ticks.ToString("d19")). Update it by doing a batch transaction that deletes the old row and inserts the new row.
Now the query for inactive users is just something like from user in LastActive where user.PartitionKey == string.Empty && user.RowKey < (DateTime.UtcNow - TimeSpan.FromDays(30)).Ticks.ToString("d19") select user. That will be quite efficient for any size table.
Depending on what you're going to do with that information, you might want to then put a message on a queue and then delete the row (so it doesn't get noticed again the next time you check). Multiple workers can now pull those queue messages and take action.
I'm confused about your desire to do this on multiple worker instances... you presumably want to act on an inactive user only once, so you want only one instance to do the check. (The work of sending emails or whatever else you're doing can then be spread about by using a queue, but that initial check should be done by exactly one instance.)

Resources