update 40+ million entities in azure table with many instances how to handle concurrency issues - azure

So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.
My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.
I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?
Any better ideas?

Very interesting question!!! Extending #Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))
Assumptions:
Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
You have one separate worker role which is doing all the reading of entities.
You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.
Possible Solution:
First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.
Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.
Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.
What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.
Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.
As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.

If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.
If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.
In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.
As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.

I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.

Related

Azure Service Bus: Ordered Processing of Session Sequences

Are there any recommended architectural patterns with Service Bus for ensuring ordered processing of nested groups of messages which are sent out of order? We are using Sessions, but when it comes down to ensuring that a set of Sessions must be processed sequentially in a certain order before moving onto another set of Sessions, the architecture becomes cumbersome very quickly. This question might best be illustrated with an example.
We are using Service Bus to integrate changes in real-time from a database to a third-party API. Every N minutes, we get notified of a new 'batch' of changes from the database which consists of individual records of data across different entities. We then transform/map each record and send it along to an API. For example, a 'batch' of changes might include 5 new/changed 'Person' records, 3 new/changed 'Membership' records, etc.
At the outer-most level, we must always process one entire batch before we can move on to another batch of data, but we also have a requirement to process each type of entity in a certain order. For example, all 'Person' changes must be processed for a given batch before we can move on to any other objects.
There is no guarantee that these records will be queued up in any order which is relevant to how they will need to be processed, particularly within a 'batch' of changes (e.g. the data from different entity types will be interleaved).
We actually do not necessarily need to send the individual records of entity data in any order to the API (e.g. it does not matter in which order I send those 5 Person records for that batch, as long as they are all sent before the 3 Membership records for that batch). However, we do group the messages into Sessions by entity type so that we can guarantee homogeneous records in a given session and target all records for that entity type (this also helps us support a separate requirement we have when calling the API to send a batch of records when possible instead of an individual call per record to avoid API rate limiting issues). Currently, our actual Topic Subscription containing the record data is broken up into Sessions which are unique to the entity type and the batch.
"SessionId": "Batch1234\Person"
We are finding that it is cumbersome to manage the requirement that all changes for a given batch must be processed before we move on to the next batch, because there is no Session which reliably groups those "groups of entities" together (let alone processing those groups of entities themselves in a certain order). There is, of course, no concept of a 'session of sessions', and we are currently handling this by having a separate 'Sync' queue to represent an entire batch of changes which needs to be processed what sessions of data are contained in that batch:
"SessionId": "Batch1234",
"Body":
{
"targets": ["Batch1234\Person", "Batch1234\Membership", ...]
}
This is quite cumbersome, because something (e.g. a Durable Azure Function) now has to orchestrate the entire process by watching the Sync queue and then spinning off separate processors that it oversees to ensure correct ordering at each level (which makes concurrency management and scalability much more complicated to deal with). If this is indeed a good pattern, then I do not mind implementing the extra orchestration architecture to ensure a robust, scalable implementation. However, I cannot help from feeling that I am missing something or not thinking about the architecture the right way.
Is anyone aware of any other recommended pattern(s) in Service Bus for handling ordered processing of groups of data which themselves contain groups of data which must be processed in a certain order?
For the record I'm not a service bus expert, specifically.
The entire batch construct sounds painful - can you do away with it? Often if you have a painful input, you'll have a painful solution - the old "crap in, crap out" maxim. Sometimes it's just hard to find an elegant solution.
Do the 'sets of sessions' need to be processed in a specific order?
Is a 'batch' of changes = a session?
I can't think of a specific pattern, but a "divide and conquer" approach seems reasonable (which is roughly what you have already?):
Watch for new batches, when one occurs hand it off to a BatchProcessor.
BatchProcessor applies all the rules to the batch, as you outlined.
Consider having the BatchProcessor dump it's results on a queue of some kind which is the source for the API - that way you have some kind of isolation between the batch processing and the API.

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

Cassandra: rotating lists

Suppose I store a list of events in a Cassandra row, implemented with composite columns:
{
event:123 => 'something happened'
event:234 => 'something else happened'
}
It's almost fine by me, and, as far as I understand, that's a common pattern. Comparing to having a single column event with the jsonized list, that scales better since it's easy to add a new item to the list without reading it first and then writing back.
However, now I need to implement these two requirements:
I don't want to add a new event if the last added one is the same,
I want to keep only N last events.
Is there any standard way of doing that with the best possible performance? (Any storage schema changes are ok).
Checking whether or not things already exist, or checking how many that exist and removing extra items, are both read-modify-write operations, and they don't fit very well with the constraints of Cassandra.
One way of keeping only the N last events is to make sure they are ordered so that you can do a range query and read the N last (for example prefixing the column key with a timestamp/TimeUUID). This wouldn't remove the outdated events, that you need to do as a separate process, but by doing it this way the code that queries the data will only see the last N, which is the real requirement if I interpret things correctly. The garbage collection of old events is just an optimization to avoid keeping things that will never be needed again.
If the requirement isn't a strict N events, but events that are not older than T you can of course use the TTL feature, but I assume that it's not an option for you.
The first requirement is trickier. You can do a read before ever write and check if you have an item, but that would be slow, and unless you do some kind of locking outside of Cassandra there is no guarantee that two writers won't do both do a read and then both do a write, so that neither sees the other's write. Maybe that's not a problem for you, but there's no good way around it. Cassandra doesn't do CAS.
The way I've handled similar situations when using Cassandra is to keep a cache in the application nodes of what has been written, and check that before writing. You then need to make sure that each application node sees all events for the same row, and that events for the same row aren't distributed over multiple application nodes. One way of doing that is to have a message queue system in front of your application nodes, and divide the event stream over several queues by the same key as you use as row key in the database.

Strategies for checking inactivity on Azure

I have a table in Azure Table Storage, with rows that are regularly updated by various processes. I want to efficiently monitor when rows haven't been updated within a specific time period, and to cause alerts to be generated if that occurs.
Most task scheduler implementations I've seen for Azure function by making sure only one worker will perform a given job at a time. However, setting up a scheduled task that waits n minutes, and then queries the latest time-stamp to determine if action should be taken, seems inefficient since the work won't be spread across workers. It also seems generally inefficient to have to poll so many records.
An example use of this would be to send an email to a user that hasn't logged into a web site in the last 30 days. Assume that the number of users is a "large number" for the purposes of producing an efficient algorithm.
Does anyone have any recommendations for strategies that could be used to check for recent activity without forcing only one worker to do the job?
Keep a LastActive table with a timestamp as a rowkey (DateTime.UtcNow.Ticks.ToString("d19")). Update it by doing a batch transaction that deletes the old row and inserts the new row.
Now the query for inactive users is just something like from user in LastActive where user.PartitionKey == string.Empty && user.RowKey < (DateTime.UtcNow - TimeSpan.FromDays(30)).Ticks.ToString("d19") select user. That will be quite efficient for any size table.
Depending on what you're going to do with that information, you might want to then put a message on a queue and then delete the row (so it doesn't get noticed again the next time you check). Multiple workers can now pull those queue messages and take action.
I'm confused about your desire to do this on multiple worker instances... you presumably want to act on an inactive user only once, so you want only one instance to do the check. (The work of sending emails or whatever else you're doing can then be spread about by using a queue, but that initial check should be done by exactly one instance.)

Resources