How do i loop over the results of a data copy in data factory

How do i loop over the results of a data copy in data factory - azure

Hi guys I'm struggling with a data pipeline.
I have a pipeline where I first fetch some data from an api.
This data contains among other things a column of ids.
I've set up a datacopy and I'm saving the json result in a blob.
What I want to do next is to iterate over all the ids and do an api call for those ids.
But I cant for the life of me figure out how to iterate over the ids.
I've looked in to using a lookup and for-each but seems that lookup is limited to 5000 results, I have just over 70k.
Any pointers for me?

As a workaround you could partition and store the API call results into smaller JSON files. Then use multiple pipeline according to the number of files you got, and iterate over to achieve this.
As the ForEach activity can do maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items. Follow workaround for just the Lookup part.
Design a two-level pipeline where the outer pipeline iterates over an
inner pipeline, which retrieves data that doesn't exceed the maximum
rows or size.
Example:
Here I would get details from API and store as a number of JSON blobs to help feed small chunks of data to next LookupActivity.
Use GetMetadata Activity to get the know the number of partitioned files to iterate on and their name to pass to parameterized source dataset of LookupActivity going forward.
Use execute pipeline to call another pipeline, which would have the LookupActivity and WebActivity to call for the ids
Inside the child pipeline you have a LookupActivity which has parameterized source files to look at. When the ForEach activity iterates, for each file the child pipeline is triggered with one file at source of LookupActivity. This solves the limitation issue.
You can store the lookup result in variable or use as is dynamic expression.

Related

Processing grouped data in Lookup+Foreach activity in ADF

I am looking for an ADF solution to introduce workload management for a metadata driven ingestion system.
In the pipeline, I read data from some metadata table into a lookup activity and say the data looks something like this
ObjectName,Tshirtsize,TaskGroup,IncrementalLoadFlag,InitialLoadFlag
Asset1,Large,1,N,Y
Asset2,Large,1,N,Y
Asset3,Large,1,N,Y
Asset4,Small,2,N,Y
Now I have to process this data in a foreach in sequential manner based on the value of TaskGroup, as in my first batch, I need to process the 3 tables which has the same TaskGroup and copy them asynchronously after determining the load flags.
However as far as I have seen foreach it will iterate every item of the lookup output one after the another and as a result I am not able to iterate on the grouped data (based on TaskGroup) for bulk load.
Is there a solution how this scenario can be implemented?

How to filter data in extractor?

I've got a long-running pipeline that has some failing items (items that at the end of the process are not loaded because they fail database validation or something similar).
I want to rerun the pipeline, but only process the items that failed the import on the last run.
I have the system in place where I check each item ID (that I received from external source). I do this check in my loader. If I already have that item ID in the database, I skip loading/inserting that item in the database.
This works great. However, it's slow, since I do extract-transform-load for each of these items, and only then, on load, I query the database (one query per item) and compare item IDs.
I'd like to filter-out these records sooner. If I do it in transformer, I can only do it per item again. It looks like extractor could be the place, or I could pass records to transformer in batches and then filter+explode the items in (first) transformer.
What would be better approach here?
I'm also thinking about reusability of my extractor, but I guess I could live with the fact that one extractor does both extract and filter. I think the best solution would be to be able to chain multiple extractors. Then I'd have one that extracts the data and another one that filters the data.
EDIT: Maybe I could do something like this:
already_imported_item_ids = Items.pluck(:item_id)
Kiba.run(
Kiba.parse do
source(...)
transform do |item|
next if already_imported_item_ids.include?(item)
item
end
transform(...)
destination(...)
end
)
I guess that could work?

A few hints:
The higher (sooner) in the pipeline, the better. If you can find a way to filter out right from the source, the cost will be lower, because you do not have to manipulate the data at all.
If you have a scale small enough, you could load only the full list of ids at the start in a pre_process block (mostly what you have in mind in your code sample), then compare right after the source. Obviously it doesn't scale infinitely, but it can work a long time depending on your dataset size.
If you need to have a higher scale, I would advise to either work with a buffering transform (grouping N rows) that would achieve a single SQL query to verify the existence of all the N rows ids in the target database, or work with groups of rows then explode indeed.

Create flat feeds in batch process

I want to know if we can create flat feeds in a batch process.
Let me give you some context: We want to create feeds for university students based on the courses they are taking. We want them to see feeds for every course that is available at their campus. We have the list of those courses in our MongoDB. Is it possible to create a flat feed for each course on that list in a batch process? There are more than 5000 courses in total.

Definitely. Also, until you actually put activities into feeds, nothing is being done.
Feed group should be created at your dashboard, for example course in this case. Then, actual feed, for example math-101, will come to existence when you push data to it.
Nonexisting feed read will simply return empty result set (assuming access policies permit).
nonexisting:math-101: error since feed group is missing
nonexisting:nonexisting: error as previous
course:nonexisting: empty response
course:math-101: your data
Finally, if you're ingesting a lot of data, there is an import mechanism to process your data efficiently.

Add timestamp column to file in Node.js Google Cloud function

Right now I'm copying files on Google Cloud Storage to Bigquery using the following line in node.js:
const bigquery = new BigQuery();
bigquery.dataset(xx).table(xx).load(storage.bucket(bucketName).file(fileName));
But now I'd like to add a new timestamp column to this file. So how can I do this?
So two questions I could think of:
First read this file into some data structure like array:
array = FunctionToReadFileNameToArray(FileName);
Do we have such a function? Suppose we have, then it's quite easy to manipulate upon the array to add timestamp column.
Second, load the new array data into bigquery. But I only find one way to insert streaming data:
bigquery.dataset(xx).table(xx).insert(rows);
And here rows is different data structure like dictionary/map but not array. So how can we load array into bigquery?
Thanks

I'm going to assume you have a file (Object) of structured records (JSON, XML, CSV). The first task would appear to be opening that GCS object for reading. You would then read one record at a time. You would then augment that record with your desired extra column (timestamp) and then invoke the insert() API. This API can take a single object to be inserted or an array of objects.
However ... if this is a one-time event or can be performed in batch ... you may find it cheaper to read the GCS object and write a new GCS object containing your desired data and THEN load the data into BQ as a unit. Looking at the pricing for BQ, we seem to find that streaming inserts are charged at $0.01 per 200MB in addition to the storage costs which would be bypassed for a GCS object load as a unit. My own thinking is that doing extra work to save pennies is a poor use of time/money but if you are processing TB of data over months, it may add up.

Can multiple virtual workers share a same collection? (BluePrism)

I am working on a problem where i have to write data from a csv file into a collection.
Ex: I have a csv file with 20 items. These items are added to the queue. Each time a case from the queue is processed, i am writing the item number into a collection. At the end, i am converting the collection to csv format. This works perfectly with 1 virtual worker. However, when i use multiple workers, they are only writing the item number that they are processing into the collection therefore the collection doesn't have the item number's of the cases worked by another virtual worker.
Is there a way for multiple workers to share a collection so that i don't have to loose any information before converting it to a csv file.? Basically i want all the items worked into a collection regardless of which worker worked on it.
Thanks in advance. Let me know if you require more information regarding this issue.

For all I know, there is no way for more multiple users to access the same collection during runtime. You can try different approaches however:
Do you have the item number in the starting CSV file as well? If so, then just add item number to queue data when adding items to the queue and make the last working resource (the one processing last pending item) loop through worked items in this batch (you can use tags to differentiate batches of work) and then pass data to the collection.
If you receive item number while processing it, then just add it to queue data once it's processed and then loop through the queue once whole batch is processed, just like described above.
Queue data is stored in the queue until you delete them manually while collection's data is being removed once the session is over, so you will be able to retrieve queue data anytime, which is an advantage in my opinion.
You can add single item's data into the CSV file directly after each item is worked. You would need to add some logic, so the resource acquires lock before trying to write data into file, to avoid possible exceptions when 2 or more resources try to access the file at the same time.
Hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do i loop over the results of a data copy in data factory - azure

Related

Processing grouped data in Lookup+Foreach activity in ADF

How to filter data in extractor?

Create flat feeds in batch process

Add timestamp column to file in Node.js Google Cloud function

Can multiple virtual workers share a same collection? (BluePrism)

Categories

Resources