Can multiple virtual workers share a same collection? (BluePrism)

Can multiple virtual workers share a same collection? (BluePrism) - blueprism

I am working on a problem where i have to write data from a csv file into a collection.
Ex: I have a csv file with 20 items. These items are added to the queue. Each time a case from the queue is processed, i am writing the item number into a collection. At the end, i am converting the collection to csv format. This works perfectly with 1 virtual worker. However, when i use multiple workers, they are only writing the item number that they are processing into the collection therefore the collection doesn't have the item number's of the cases worked by another virtual worker.
Is there a way for multiple workers to share a collection so that i don't have to loose any information before converting it to a csv file.? Basically i want all the items worked into a collection regardless of which worker worked on it.
Thanks in advance. Let me know if you require more information regarding this issue.

For all I know, there is no way for more multiple users to access the same collection during runtime. You can try different approaches however:
Do you have the item number in the starting CSV file as well? If so, then just add item number to queue data when adding items to the queue and make the last working resource (the one processing last pending item) loop through worked items in this batch (you can use tags to differentiate batches of work) and then pass data to the collection.
If you receive item number while processing it, then just add it to queue data once it's processed and then loop through the queue once whole batch is processed, just like described above.
Queue data is stored in the queue until you delete them manually while collection's data is being removed once the session is over, so you will be able to retrieve queue data anytime, which is an advantage in my opinion.
You can add single item's data into the CSV file directly after each item is worked. You would need to add some logic, so the resource acquires lock before trying to write data into file, to avoid possible exceptions when 2 or more resources try to access the file at the same time.
Hope this helps.

Related

How do i loop over the results of a data copy in data factory

Hi guys I'm struggling with a data pipeline.
I have a pipeline where I first fetch some data from an api.
This data contains among other things a column of ids.
I've set up a datacopy and I'm saving the json result in a blob.
What I want to do next is to iterate over all the ids and do an api call for those ids.
But I cant for the life of me figure out how to iterate over the ids.
I've looked in to using a lookup and for-each but seems that lookup is limited to 5000 results, I have just over 70k.
Any pointers for me?

As a workaround you could partition and store the API call results into smaller JSON files. Then use multiple pipeline according to the number of files you got, and iterate over to achieve this.
As the ForEach activity can do maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items. Follow workaround for just the Lookup part.
Design a two-level pipeline where the outer pipeline iterates over an
inner pipeline, which retrieves data that doesn't exceed the maximum
rows or size.
Example:
Here I would get details from API and store as a number of JSON blobs to help feed small chunks of data to next LookupActivity.
Use GetMetadata Activity to get the know the number of partitioned files to iterate on and their name to pass to parameterized source dataset of LookupActivity going forward.
Use execute pipeline to call another pipeline, which would have the LookupActivity and WebActivity to call for the ids
Inside the child pipeline you have a LookupActivity which has parameterized source files to look at. When the ForEach activity iterates, for each file the child pipeline is triggered with one file at source of LookupActivity. This solves the limitation issue.
You can store the lookup result in variable or use as is dynamic expression.

Create flat feeds in batch process

I want to know if we can create flat feeds in a batch process.
Let me give you some context: We want to create feeds for university students based on the courses they are taking. We want them to see feeds for every course that is available at their campus. We have the list of those courses in our MongoDB. Is it possible to create a flat feed for each course on that list in a batch process? There are more than 5000 courses in total.

Definitely. Also, until you actually put activities into feeds, nothing is being done.
Feed group should be created at your dashboard, for example course in this case. Then, actual feed, for example math-101, will come to existence when you push data to it.
Nonexisting feed read will simply return empty result set (assuming access policies permit).
nonexisting:math-101: error since feed group is missing
nonexisting:nonexisting: error as previous
course:nonexisting: empty response
course:math-101: your data
Finally, if you're ingesting a lot of data, there is an import mechanism to process your data efficiently.

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.

If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.

After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx

If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

update 40+ million entities in azure table with many instances how to handle concurrency issues

So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.
My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.
I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?
Any better ideas?

Very interesting question!!! Extending #Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))
Assumptions:
Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
You have one separate worker role which is doing all the reading of entities.
You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.
Possible Solution:
First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.
Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.
Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.
What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.
Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.
As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.

If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.
If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.
In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.
As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.

I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.

Processing a stream in Node where action depends on asynchronous calls

I am trying to write a node program that takes a stream of data (using xml-stream) and consolidates it and writes it to a database (using mongoose). I am having problems figuring out how to do the consolidation, since the data may not have hit the database by the time I am processing the next record. I am trying to do something like:
on order data being read from stream
look to see if customer exists on mongodb collection
if customer exists
add the order to the document
else
create the customer record with just this order
save the customer
My problem is that two 'nearby' orders for a customer cause duplicate customer records to be written, since the first one hasn't been written before the second one checks to see if it there.
In theory I think I could get around the problem by pausing the xml-stream, but there is a bug preventing me from doing this.

Not sure that this is the best option, but using async queue was what I ended up doing.
At the same time as I was doing that a pull request for xml-stream (which is what I was using to process the stream) that allowed pausing was added.

Is there a unique field on the customer object in the data coming from the stream? You could add a unique restriction to your mongoose schema to prevent duplicates at the database level.
When creating new customers, add some fallback logic to handle the case where you try to create a customer but that same customer is created by another save at the same. When this happens try the save again but first fetch the other customer first and add the order to the fetched customer document

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string