Strategies for checking inactivity on Azure - azure

I have a table in Azure Table Storage, with rows that are regularly updated by various processes. I want to efficiently monitor when rows haven't been updated within a specific time period, and to cause alerts to be generated if that occurs.
Most task scheduler implementations I've seen for Azure function by making sure only one worker will perform a given job at a time. However, setting up a scheduled task that waits n minutes, and then queries the latest time-stamp to determine if action should be taken, seems inefficient since the work won't be spread across workers. It also seems generally inefficient to have to poll so many records.
An example use of this would be to send an email to a user that hasn't logged into a web site in the last 30 days. Assume that the number of users is a "large number" for the purposes of producing an efficient algorithm.
Does anyone have any recommendations for strategies that could be used to check for recent activity without forcing only one worker to do the job?

Keep a LastActive table with a timestamp as a rowkey (DateTime.UtcNow.Ticks.ToString("d19")). Update it by doing a batch transaction that deletes the old row and inserts the new row.
Now the query for inactive users is just something like from user in LastActive where user.PartitionKey == string.Empty && user.RowKey < (DateTime.UtcNow - TimeSpan.FromDays(30)).Ticks.ToString("d19") select user. That will be quite efficient for any size table.
Depending on what you're going to do with that information, you might want to then put a message on a queue and then delete the row (so it doesn't get noticed again the next time you check). Multiple workers can now pull those queue messages and take action.
I'm confused about your desire to do this on multiple worker instances... you presumably want to act on an inactive user only once, so you want only one instance to do the check. (The work of sending emails or whatever else you're doing can then be spread about by using a queue, but that initial check should be done by exactly one instance.)

Related

How to architect job scheduler

I am building a job scheduler and I am stuck between two approaches. I have two types of jobs, ones that are scheduled for a specific date and ones that run hourly. For the specific date ones, I poll my database table that stores the jobs and post the results to a rabbitmq message broker where specific workers process them. This works well for more defined tasks like sending reminder notifications or emails. For the hourly jobs, I have a cron expression based job running and have the logic directly in the function, so it does not go to a queue. Usually, these are jobs to clean up my database or set certain values based on previous day activity, etc.
I am wondering what the best way to architect this is. Does it make sense to have all these smaller jobs running on a cadence as microservices and listen on a queue? Should I group all of them together into one service? Should I combine all logic of both types into one large worker app?
In my opinion as described you have two different systems that are doing two different things and this is a problem.
For example consider sending an email. With the design you have right now you have to write your email code twice -- once for messages that are sent to the queue and once for the ones sent from the cron.
Ideally you want to make supporting the system in your design as easy as possible. For example if your design uses a queue then ALL actions should work the same way -- get a command message and parameters off the queue and execute them. Your cron job and your scheduler would both add messages to the queue. Supporting a new action would just mean coding it once and then both systems could add a message to the queue.
The same should be true of your data model. The data model should support both types of schedule but share as much as possible.
The last time I created a system like this I did something along the following lines:
I had an events table -- this table had a list of events to fire are specific dates and times in the future..
I had a retur table -- this table had a list of recurring events (eg every week on a tuesday at this time.)
There was a process -- it would look at the events table to see if there was something that needed to fire. If there was it would fire it. THEN it would remove it from the events table (no longer in the future) and log it had been run. It would also check if this was a event from the recur table -- if it was it would add the next future event to the events table.
Note, this design only has two tables here in this simplified explanation of how the events table worked, but in reality there were a number of others in the complete data model. For example I had a number of tables to support different event types. (Email templates for the email events, etc).

Is it bad to run cron jobs to poll from a huge table of scheduled job records?

I've a table which a cron job would poll at every minute to send out messages to other services. The records in the table are essentially activities that are scheduled to run at a certain time. The cron job simply checks to see which of those activities are ready to be run and send a message of that activity through SQS to the other services.
When an activity is found to be ready to run by the cron job, that record will be marked as done after sending a message through SQS. There is an API which allows other services to check whether a scheduled activity has already been done. So keeping a history of those done records is needed.
My concern here, however, is whether a design like this is scalable in the long run. There are around 200k scheduled activities a day, or even more on some days. Since I'm keeping the records by marking them as done after they are completed, I'm worried that the table will eventually get very huge with ten over millions of rows and become an issue for the cron job to run as frequently.
Even with a properly indexed table, is my concern valid? Otherwise, what other alternatives can I design it if I had to somehow persist those scheduled activities for a cron or something to poll and check when they are ready to run?
I'm using Postgres database.
As long as the number of rows that the cron job's query has to fetch stays constant and you can use an index, the size of the table won't matter.
Index scans are O(n) with respect to the number of rows scanned and O(log(n)) with respect to the table size. To be more specific, increasing the table size by a factor between 10 and 200 (smaller size of the index key leads to better fan-out) will make an index scan use one more block, and that block is normally cached.
If the table gets large, you might still want to consider partitioning, but mostly so that you can get rid of old data efficiently.
With the right index, the cron job should have no serious problem. You can have a partial/filtered index, like
create index on jobs (id) where status <> 'done'.
To keep the size of the index small. The query has to match the index where clause.
I used (id) just because an empty list is not allowed and so something has to be there. Based on your comment, schedule_dt might be a better choice. If you include all the columns you select, you can get an index-only scan. But if you don't, it will still use the index, it just has to visit the table to fetch the columns for those specific rows. I suspect the index only scan attempt won't be worth it to you as the pages you need probably won't be marked all visible, as modifications were made to neighboring tuples just one minute ago.
However, it does seem a bit odd to mark a job as done when it has only been scheduled, rather than being done.
There is an API which allows other services to check whether a scheduled activity has already been done.
A table that increases in size without bound is likely to present management problems apart from the cron job. Surely the services aren't going to have to look back months in order to do this, are they? Could you delete 'done' jobs after a few days? What if a service tries to look up a job and rather than finding it 'done', it just doesn't find it at all?
I don't think the cron job is inherently a problem, but it would be cleaner not to have it. Why doesn't whoever inserts the job just invoke SQS in real time?

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

Running query on database after a document/row is of certain age

What is the best practice for running a database-query after any document in a collection become of certain age?
Let's say this is a node.js web-system with mongoDB, with a collection of posts. After a new post is inserted, it should be updated with some data after 60 minutes.
Would a cron-job that checks all posts with (age < one hour) every minute or two be the best solution? What would be the least stressing solution if this system has >10.000 active users?
Some ideas:
Create a second collection as a queue with a "time to update" field which would contain the time at which the source record needs to be updated. Index it, and scan through looking for values older than "now".
Include the field mentioned above in the original document and index it the same way
You could just clear the value when done or reset it to the next 60 minutes depending on behavior (rather than inserting/deleting/inserting documents into the collection).
By keeping the update-collection distinct, you have a better chance of always keeping the entire working set of queued updates in memory (compared to storing the update info in your posts).
I'd kick off the update not as a web request to the same instance of Node but instead as a separate process so as to not block user-requests.
As to how you schedule it -- that's up to you and your architecture and what's best for your system. There's no right "best" answer, especially if you have multiple web servers or a sharded data system.
You might use a capped collection, although you'd run the risk of potentially losing records needing to be updated (although you'd gain performance)

Atomic operations in azure table storage

I am looking to implement a page view counter in azure table storage. If say two users visit the page at the same time, and the current value on PageViews = 100, is it guaranteed that the PageViews = 102 after the update operation?
The answer depends on how you implement your counter. :-)
Table storage doesn't have an "increment" operator, so you'd need to read the current value (100) and update it to the new value (101). Table storage employs optimistic concurrency, so if you do what comes naturally when using the .NET storage client library, you'd likely see an exception when two processes tried to do this simultaneously. This would be the flow:
Process A reads the value of PageViews and receives 100.
Process B reads the value of PageViews and receives 100.
Process A makes a conditional update to PageViews that means "set PageViews to 101 as long as it's currently 100." This succeeds.
Process B performs the same operations and fails, because the precondition (PageViews == 100) is false.
The obvious thing to do when you receive the error is to repeat the process. (Read the current value, which is now 101, and update to 102.) This will always (eventually) result in your counter having the correct value.
There are other possibilities, and we did an entire Cloud Cover episode about how to implement a truly scalable counter: http://channel9.msdn.com/Shows/Cloud+Cover/Cloud-Cover-Episode-43-Scalable-Counters-with-Windows-Azure.
What's described in that video is probably overkill if collisions are unlikely. I.e., if your hit rate is one-per-second, the normal "read, increment, write" pattern will be safe and efficient. If, on the other hand, you receive 1000 hits per second, you'll want to do something smarter.
EDIT
Just wanted to clarify for people who read this to understand optimistic concurrency... the conditional operation isn't really "set PageViews to 101 as long as it's currently 100." It's more like "set PageViews to 101 as long as it hasn't changed since the last time I looked at it." (This is accomplished by using the ETag that came back in the HTTP request.)
You could also rethink the 'count' part. Why not turn this into a 2 step process?
Step 1 - Recording Page Views
Each time someone views a page add a record to a table (let's call it PageViews). The info you would add in one of these stores would be the following:
PartitionKey = PageName
RowKey = Random GUID
After a few views you would have something like this:
MyPage.aspx - someGuid
MyPage.aspx - someGuid
SomePage.aspx - someGuid
MyPage.aspx - someGuid
Step 2 - Counting Page Views
What we want to do now is get all those records, count them, increase a counter somewhere and delete all records. Let's assume you have multiple workers running. Both your workers would have a loop randomly running between 1 and 10 minutes. Each time the worker's time elapsed it will take the lease on a blob if no lease has been taken yet (this should always be the same blob, you can use AutoRenewLease).
The first worker getting the lock can go ahead and do the counting:
Get all records from the PageViewRecordings table or from cache
Count all page views per page
Update count somewhere
Delete the records that were taken into account when counting
The issue here is that it's very hard to turn this into an idempotent process. What happens if your instance crashes between the count and the delete? You'll have an increased page count, but since the items were not deleted they'll be added to the total count the next time you process them.
This is why I would suggest the following. In the same table (PageViews), you will also record the total page views, in that same partition. But the data will be a bit different (this will be a single record in that partition holding the total count):
PartitionKey = PageName
RowKey = Guid.Empty (just don't use a random guid, this way we know the difference between a recorded page view and the record holding the total count).
Count = The current page view count
This is perfectly possible because Table Storage is schema less. And why are we doing this? Because we do have transactions if we limit ourselves to the same table + partition with a maxmium of 100 entities. What can we do with this?
Using Take, we get 100 records from that table + partition.
The first record we'll get is the 'counter' record. Why? Because its rowkey is Guid.Empty and sorting is lexicographical
Count these records (-1 because the first record isn't a page view, it's just our counter placeholder)
Update the Count property of the counter record
Delete the 99 (or less) other records
SaveChanges using Batch.
Repeat until there is only 1 record left (the counter record).
And each X minutes your workers will see if there isn't a lease on the blob, get a lease and restart the process.
Is this answer clear enough or should I add some code?
I came along with the same question. With Azure python library, I'm developing a simple counter increment using eTag and If-Match instead of lock. The basic idea is to retry to increase the counter until the update successfully runs under a certain criteria, which is no other updates interfere this running update. If the request of updates are heavy, sharding should be invoked.
https://github.com/flyakite/simple-scalable-datastore/blob/master/datastore/azuretable.py
If using Azure Websites, then Azure Queues and WebJobs is another option.
In one scenario of mine though I am actually going to take the sharding approach and have WebJobs update the aggregates periodically. An Azure Table Storage Table of UserPageViews with PartitionKey = User and RowKey = Page . Two simultaneous users with the same user id will not be allowed.

Resources