How to use SplitOn with singleInstance? - azure

I have a logic app with a sql trigger that gets multiple rows.
I need to split on the rows so that I have a better overview about the actions I do per row.
Now I would like that the logic app is only working on one row at a time.
What would be the best solution for that since
"operationOptions": "singleInstance", and
"runtimeConfiguration": {
"concurrency": {
"runs": 1
}
},
are not working with splitOn.
I was also thinking about calling another logic app and have the logic app use a runtimeConfiguration but that sounds just like an ugly workaround.
Edit:
The row is atomic, and no sorting is needed. Each row can be worked on separately and independent of other data.
As fare as I can tell I wouldn't use a foreach for that since than one failure within a row will lead to a failed logic app.
If one dataset (row) other should also be tried and the error should be easily visible.

Yes, you are seeing the expected behavior. Keep in mind, the split happens in the trigger, not the workflow. BizTalk works the same way except it's a bit more obvious there.
You don't want concurrent processing, you want ordered processing. Right now, the most direct way to handle this is by Foreach'ing over the collection. Though waiting ~3 weeks might be a better option.
One decision point will be weather the atomicity is the collection or the item. Also, you'll need to know if overlapping batches are ok or not.
For instance, if you need to process all items in order, with batch level validation, Foreach with concurrency = 1 is what you need.

Today (as of 2018-03-06) concurrency control is not supported for split-on triggers.
Having said that, concurrency control should be enabled for all trigger types (including split-on triggers) within the next 2-3 weeks.
In the interim you could remove the splitOn property on your trigger and set its concurrency limit to 1. This will start a single run for the entire collection of items, but you can use a foreach loop in your definition to limit concurrency as well. The drawback here is that the trigger will wait until the run as a whole is completed (all items are processed), so the throughput will not be optimal.

Related

About speedy mass deletion of users in Kentico10

I want to delete more than 1 million User information in Kentico10.
I tried to delete it with UserInfoProvider.DeleteUser (); (see the following documentation), but it is expected that it will take nearly one year with a simple calculation.
https://docs.kentico.com/api10/configuration/users#Users-Deletingauser
Because it's a simple calculation, I think it's actually a bit shorter, but it still takes time.
Is there any other way to delete users in a short time?
Of course make sure you have a backup of your database before you do any of this.
Depending on the features you're using, you could get away with a SQL statement. Due to the complexities of the references of a user to multiple other tables, the SQL statement can get pretty complex and you need to make sure you remove the other references before removing the actual user record.
I'd highly recommend an API approach and delete users through the API so it removes all the references for you automatically. In your API calls make sure you wrap the delete action in the following so it stops the logging of the events and other labor-intensive activities not needed.
using (var context = new CMSActionContext())
{
context.DisableAll();
// delete your user
}
In your code, I'd only select the top 100 or so at a time and delete them in batches. Assuming you don't need this done all in one run, you could let the scheduled task run your custom code for a week and see where you're at.
If all else fails, figure out how to delete the user and the 70+ foreign key references and you'll be golden.
Why don't you delete them with SQL query? - I believe it will be much faster.
Bulk delete functionality exist starting from version 10.
UserInfoProvider has BulkDelete method. Actually any InfoProvider object inhereted from AbstractInfoProvider has BulkDelete method.

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

CouchDB _change Feed

for my application I implemented a logical seperation of my documents with a type attribute. I have several views. I implemented for every view a dedicated change feed which gets triggerd if a certain document was added or updated. At the moment the performance is quite well, do I have to expect a slow down in the future?
Well, every filter function associated with your feed is executed once for each new (or updated) document. So, you may expect a slowdown with a large number of concurrent inserts and updates. It's not something related to the database dimension, but to the number of concurrent updates.

update 40+ million entities in azure table with many instances how to handle concurrency issues

So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.
My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.
I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?
Any better ideas?
Very interesting question!!! Extending #Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))
Assumptions:
Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
You have one separate worker role which is doing all the reading of entities.
You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.
Possible Solution:
First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.
Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.
Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.
What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.
Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.
As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.
If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.
If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.
In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.
As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.
I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.

Cassandra: rotating lists

Suppose I store a list of events in a Cassandra row, implemented with composite columns:
{
event:123 => 'something happened'
event:234 => 'something else happened'
}
It's almost fine by me, and, as far as I understand, that's a common pattern. Comparing to having a single column event with the jsonized list, that scales better since it's easy to add a new item to the list without reading it first and then writing back.
However, now I need to implement these two requirements:
I don't want to add a new event if the last added one is the same,
I want to keep only N last events.
Is there any standard way of doing that with the best possible performance? (Any storage schema changes are ok).
Checking whether or not things already exist, or checking how many that exist and removing extra items, are both read-modify-write operations, and they don't fit very well with the constraints of Cassandra.
One way of keeping only the N last events is to make sure they are ordered so that you can do a range query and read the N last (for example prefixing the column key with a timestamp/TimeUUID). This wouldn't remove the outdated events, that you need to do as a separate process, but by doing it this way the code that queries the data will only see the last N, which is the real requirement if I interpret things correctly. The garbage collection of old events is just an optimization to avoid keeping things that will never be needed again.
If the requirement isn't a strict N events, but events that are not older than T you can of course use the TTL feature, but I assume that it's not an option for you.
The first requirement is trickier. You can do a read before ever write and check if you have an item, but that would be slow, and unless you do some kind of locking outside of Cassandra there is no guarantee that two writers won't do both do a read and then both do a write, so that neither sees the other's write. Maybe that's not a problem for you, but there's no good way around it. Cassandra doesn't do CAS.
The way I've handled similar situations when using Cassandra is to keep a cache in the application nodes of what has been written, and check that before writing. You then need to make sure that each application node sees all events for the same row, and that events for the same row aren't distributed over multiple application nodes. One way of doing that is to have a message queue system in front of your application nodes, and divide the event stream over several queues by the same key as you use as row key in the database.

Resources