Create flat feeds in batch process - getstream-io

I want to know if we can create flat feeds in a batch process.
Let me give you some context: We want to create feeds for university students based on the courses they are taking. We want them to see feeds for every course that is available at their campus. We have the list of those courses in our MongoDB. Is it possible to create a flat feed for each course on that list in a batch process? There are more than 5000 courses in total.

Definitely. Also, until you actually put activities into feeds, nothing is being done.
Feed group should be created at your dashboard, for example course in this case. Then, actual feed, for example math-101, will come to existence when you push data to it.
Nonexisting feed read will simply return empty result set (assuming access policies permit).
nonexisting:math-101: error since feed group is missing
nonexisting:nonexisting: error as previous
course:nonexisting: empty response
course:math-101: your data
Finally, if you're ingesting a lot of data, there is an import mechanism to process your data efficiently.

Related

How do i loop over the results of a data copy in data factory

Hi guys I'm struggling with a data pipeline.
I have a pipeline where I first fetch some data from an api.
This data contains among other things a column of ids.
I've set up a datacopy and I'm saving the json result in a blob.
What I want to do next is to iterate over all the ids and do an api call for those ids.
But I cant for the life of me figure out how to iterate over the ids.
I've looked in to using a lookup and for-each but seems that lookup is limited to 5000 results, I have just over 70k.
Any pointers for me?
As a workaround you could partition and store the API call results into smaller JSON files. Then use multiple pipeline according to the number of files you got, and iterate over to achieve this.
As the ForEach activity can do maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items. Follow workaround for just the Lookup part.
Design a two-level pipeline where the outer pipeline iterates over an
inner pipeline, which retrieves data that doesn't exceed the maximum
rows or size.
Example:
Here I would get details from API and store as a number of JSON blobs to help feed small chunks of data to next LookupActivity.
Use GetMetadata Activity to get the know the number of partitioned files to iterate on and their name to pass to parameterized source dataset of LookupActivity going forward.
Use execute pipeline to call another pipeline, which would have the LookupActivity and WebActivity to call for the ids
Inside the child pipeline you have a LookupActivity which has parameterized source files to look at. When the ForEach activity iterates, for each file the child pipeline is triggered with one file at source of LookupActivity. This solves the limitation issue.
You can store the lookup result in variable or use as is dynamic expression.

Pulling more than 10 records for proof of delivery API from NetSuite

Is there any way we can pull more than 10 records from workwave POD Api for more than 10 records?
Whenever I call workwave API through Map/Reduce script it's giving me an error message to slow down. Has anyone got this experience and how they did they manage to achieve this?
Thanks
If you're using the List Orders or Get Orders API, there is a throttling limit - "Leaky bucket (size:10, refill: 1 per minute)". However, both those APIs allow for retrieving multiple orders in a single call. My suggestion would be to restructure your script so that instead of making the call to Workwave in the reduce stage for a single order, you make it in the Get Input Data stage for all orders you want to operate on, and map the relevant data to the corresponding NetSuite data in the Map stage before passing in through to the Reduce stage.
In other words, you make one call listing multiple order ids rather than multiple calls listing one order id.

How to filter data in extractor?

I've got a long-running pipeline that has some failing items (items that at the end of the process are not loaded because they fail database validation or something similar).
I want to rerun the pipeline, but only process the items that failed the import on the last run.
I have the system in place where I check each item ID (that I received from external source). I do this check in my loader. If I already have that item ID in the database, I skip loading/inserting that item in the database.
This works great. However, it's slow, since I do extract-transform-load for each of these items, and only then, on load, I query the database (one query per item) and compare item IDs.
I'd like to filter-out these records sooner. If I do it in transformer, I can only do it per item again. It looks like extractor could be the place, or I could pass records to transformer in batches and then filter+explode the items in (first) transformer.
What would be better approach here?
I'm also thinking about reusability of my extractor, but I guess I could live with the fact that one extractor does both extract and filter. I think the best solution would be to be able to chain multiple extractors. Then I'd have one that extracts the data and another one that filters the data.
EDIT: Maybe I could do something like this:
already_imported_item_ids = Items.pluck(:item_id)
Kiba.run(
Kiba.parse do
source(...)
transform do |item|
next if already_imported_item_ids.include?(item)
item
end
transform(...)
destination(...)
end
)
I guess that could work?
A few hints:
The higher (sooner) in the pipeline, the better. If you can find a way to filter out right from the source, the cost will be lower, because you do not have to manipulate the data at all.
If you have a scale small enough, you could load only the full list of ids at the start in a pre_process block (mostly what you have in mind in your code sample), then compare right after the source. Obviously it doesn't scale infinitely, but it can work a long time depending on your dataset size.
If you need to have a higher scale, I would advise to either work with a buffering transform (grouping N rows) that would achieve a single SQL query to verify the existence of all the N rows ids in the target database, or work with groups of rows then explode indeed.

Can multiple virtual workers share a same collection? (BluePrism)

I am working on a problem where i have to write data from a csv file into a collection.
Ex: I have a csv file with 20 items. These items are added to the queue. Each time a case from the queue is processed, i am writing the item number into a collection. At the end, i am converting the collection to csv format. This works perfectly with 1 virtual worker. However, when i use multiple workers, they are only writing the item number that they are processing into the collection therefore the collection doesn't have the item number's of the cases worked by another virtual worker.
Is there a way for multiple workers to share a collection so that i don't have to loose any information before converting it to a csv file.? Basically i want all the items worked into a collection regardless of which worker worked on it.
Thanks in advance. Let me know if you require more information regarding this issue.
For all I know, there is no way for more multiple users to access the same collection during runtime. You can try different approaches however:
Do you have the item number in the starting CSV file as well? If so, then just add item number to queue data when adding items to the queue and make the last working resource (the one processing last pending item) loop through worked items in this batch (you can use tags to differentiate batches of work) and then pass data to the collection.
If you receive item number while processing it, then just add it to queue data once it's processed and then loop through the queue once whole batch is processed, just like described above.
Queue data is stored in the queue until you delete them manually while collection's data is being removed once the session is over, so you will be able to retrieve queue data anytime, which is an advantage in my opinion.
You can add single item's data into the CSV file directly after each item is worked. You would need to add some logic, so the resource acquires lock before trying to write data into file, to avoid possible exceptions when 2 or more resources try to access the file at the same time.
Hope this helps.

Truncate feeds in getStream

I would like to limit the number of feed updates (records) in my GetStream app. I want to keep each feed at a constant length of 500 items.
I make heavy use of the 'to:' field, which results in a lot of feeds of different lengths. I want them all to grow to 500 items, so I would rather not remove items by date.
For what it's worth, I store all the updates in my own database which results in a replica of the network activity.
What would be a good way of keeping my feeds short?
There's no straightforward way to limit your feeds to 500 items. There's 2 ways to remove activities from Stream:
the removeActivity method, which will remove 1 activity at a time via the foreign_id or activity id (https://getstream.io/docs/js/#removing-activities)
the "Truncate Data" button on the dashboard for your app, which will remove all activities in Stream.
It might be possible to get the behavior you're looking for by keeping track of all activities that you're adding to Stream, then periodically culling the ones that put you over 500.
Hopefully this helps!

Resources