supply two inputs to set_downstream() in Prefect - prefect

Right now I have a setup where a third flow depends on two prior flows.
It should look something like
third_flow.set_downstream([flow_foo, flow_bar])
But instead I have it unnecessarily chained like this:
flow_foo.set_downstream(flow_bar)
flow_bar.set_downstream(third_flow)
The order that foo and bar run in doesn't matter, it's only important that they both complete before third_flow runs.
What's the best way to express this in Prefect?
EDIT: I'm not on Prefect v2 at this point

There are two options:
flow_foo.set_dependencies(task=first_task, upstream_tasks=[another_task, yet_another]) for imperative API
Much easier is to set those when you call tasks: task3(upstream_tasks=[task1, task2]) for functional API
And if you are getting started, it got much easier in Prefect 2 since you can run any Python, no DAG structure required: https://docs.prefect.io

Related

kiba-etl Pattern to split transformations into independent pipelines

Kiba is a very small library, and it is my understanding that most of its value is derived from enforcing a modular architecture of small independent transformations.
However, it seems to me that the model of a series of serial transformations does not fit most of the ETL problems we face. To explain the issue, let me give a contrived example:
A source yields hashes with the following structure
{ spend: 3, cost: 7, people: 8, hours: 2 ... }
Our prefered output is a list of hashes where some of the keys might be the same as those from the source, though the values might differ
{ spend: 8, cost: 10, amount: 2 }
Now, calculating the resulting spend requires a series of transformations: ConvertCurrency, MultiplyByPeople etc. etc. And so does calculating the cost: ConvertCurrencyDifferently, MultiplyByOriginalSpend.. Notice that the cost calculations depend on the original (non transformed) spend value.
The most natural pattern would be to calculate the spend and cost in two independent pipelines, and merge the final output. A map-reduce pattern if you will. We could even benefit from running the pipelines in parallel.
However in my case it is not really a question of performance (as the transformations are very fast). The issue is that since Kiba applies all transforms as a set of serial steps, the cost calculations will be affected by the spend calculations, and we will end up with the wrong result.
Does kiba have a way of solving this issue? The only thing I can think of is to make sure that the destination names are not the same as the source names, e.g. something like 'originSpend' and 'finalSpend'. It still bothers me however that my spend calculation pipeline will have to make sure to pass on the full set of keys for each step, rather than just passing the key relevant to it, and then merging in the Cost keys in the end. Or perhaps one can define two independent kiba jobs, and have a master job call the two and merge their result in the end? What is the most kiba-idiomatic solution to this?
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
I think I lack extra details to be able to properly answer your main question. I will get in touch via email for this round, and will maybe comment here later for public visibility.
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
The main focus of Kiba ETL today is: components reuse, lower maintenance cost, modularity and ability to have a strong data & process quality.
Parallelisation is supported to some extent though, via different patterns.
Using Kiba Pro parallel transform to run sister jobs
If your main input is something that you can manage to "partition" with a low volume of items (e.g. database id ranges, or a list of files), you can use Kiba Pro parallel transform like this:
source ... # something that generate list of work items
parallel_transform(max_threads: 10) do |group_items|
Kiba.run(...)
end
This works well if there is no output at all, or not much output, coming to the destinations of the sister jobs.
This works with threads but one can also "fork" here for extra performance.
Using process partitioning
In a similar fashion, one can structure their jobs in a way where each process will only process a subset of the input data.
This way one can start say 4 processes (via cron jobs, or monitored via a parent tool), and pass a SHARD_NUMBER=1,2,3,4, which is then used by the source for input-load partitioning.
But!
I'm pretty sure your problem, as you said, is more about workflow control & declarations & ability to express what you need to be done, rather than performance.
I'll reach out and we'll discuss that.

Is there a generic dag/task can be written in Airflow?

Detailed question: I have a scenario in my mind, and would like to take a suggestion from experts!
in the attached image, first set of image shows current workflow. below is the expected workflow. I can't combine all the workflows into one dag due to different source data refresh time.
I was thinking like, is it possible to create one dag named generic and use the individual tasks in other child dags like templates?
or, how can i call these generic dags from my child dags?
looking for your valuable suggestion to make the work flow optimised and easy to maintain when any update needed in extract and load part. as of now, its really complex as I need to touch all the 4 dags individually!

Gherkin: Is it correct to repeat steps?

I am reading a lot about Gherkin, and I had already read that it was not good to repeat steps, and for this it is necessary to use the keyword "Background", but in the example of this page they are repeating the same "Given" again and again, Could it be that I am doing wrong? I need to know your opinion about it:
Like with several things, this a topic that will generate different opinions. On this particular example I would have moved the "Given that I select the post" to the Background section as this seems to be a pre-requisite to all scenarios on this feature. Of course this would leave the scenarios in the feature without an actual Given section but those would be incorporated from the Background section on execution.
I have also seen cases where sometimes the decision of moving steps to the Background is a trade-off between having more or less feature files and how these are structured. For example, if there are 10 scenarios for a particular feature with a lot of similar steps between them - but there are 1 or 2 scenarios which do not require a particular step, then those 1 or 2 scenarios would have to moved into a new feature file in order to have the exact same steps on the Background section of the original feature.
Of course it is correct to keep the scenarios like this. From a tester's perspective, the Scenarios/Test cases should run independently, therefore, you can keep these tests separately for each functionality.
But in case you are doing an integration testing, then some of these test cases can be merged, thus you can cover multiple test cases in one scenario.
And the "given" statement is repeating, therefore you can put that in the background, so you don't have to call it in each scenarios.
Note: These separate scenarios will be handy when you run the scripts separately with annotation tags, when you just have to check for a specific functionality, or a bug fix.

Azure Webjobs: One Job with several Functions, or several Jobs with 1 function each?

How do I decide between creating several WebJobs with 1 function each and bundling several functions into one or only a few WebJobs?
Thanks
There is no straight answer to your question. Sorry.
Usually you group functions by workflow or role. For example if you have a workflow that contains a function that resizes an image, then a function that applies a watermark and another one that replicates the images then it makes sense to put all the functions together because they are related. You are more likely to change all of them when you modify the flow.
On the other hand, you might argue that functions should be separated. Unless you change the input/output, there is no reason to modify more than one function. However, if you need to change more than one function, you will end up editing more projects.
As you see, both arguments have pros/cons and there is really no right answer.
Try to experiment and see which approach works better for your solution.
PS: The only guideline that I can give is: if the functions are really small (a few lines of code), probably it is easier to put them in the same webjob because there is quite some overhead in maintaining multiple assemblies.

Run a subroutine after all Given (or Then or When) steps

I'd like to have my testers be able to organize their Given (or When or Then) steps in any order. This means the Given steps will be accumulating actions to take (database insertions, page visits, etc). Before the When steps execute, I'd like to execute the accumulation of actions to take from the Given steps. Is there a hook to do that?
I don't know of a hook to achieve what you want, but I believe that the problem is that you are not cuking your scenarios properly.
It sounds as though you (it would've helped if you'd included an example scenario!) are writing imperative instead of declarative scenarios. See here for examples of an imperative and declarative scenarios.
Also scenarios should be written in a technology-agnostic way so that anyone in the business can understand them, hence you should not include steps which detail "database insertion" actions.
If you were to write your scenario in a declarative fashion (i.e. detailing what action you want to execute without detailing exactly how that action will be executed) then there would be no need to execute an "accumulation of actions".
Another benefit of declarative scenarios is that they are more explicit in stating what the scenario is trying to achieve e.g With the following:
When I enter "email#domain.com" in "email"
And I enter "password1" in "password"
And I tap "login"
A reader has to deduce what the purpose of these steps are, whereas with:
Given I login using valid credentials
It's clear what the steps intent is.

Resources