Detailed question: I have a scenario in my mind, and would like to take a suggestion from experts!
in the attached image, first set of image shows current workflow. below is the expected workflow. I can't combine all the workflows into one dag due to different source data refresh time.
I was thinking like, is it possible to create one dag named generic and use the individual tasks in other child dags like templates?
or, how can i call these generic dags from my child dags?
looking for your valuable suggestion to make the work flow optimised and easy to maintain when any update needed in extract and load part. as of now, its really complex as I need to touch all the 4 dags individually!
Related
Kiba is a very small library, and it is my understanding that most of its value is derived from enforcing a modular architecture of small independent transformations.
However, it seems to me that the model of a series of serial transformations does not fit most of the ETL problems we face. To explain the issue, let me give a contrived example:
A source yields hashes with the following structure
{ spend: 3, cost: 7, people: 8, hours: 2 ... }
Our prefered output is a list of hashes where some of the keys might be the same as those from the source, though the values might differ
{ spend: 8, cost: 10, amount: 2 }
Now, calculating the resulting spend requires a series of transformations: ConvertCurrency, MultiplyByPeople etc. etc. And so does calculating the cost: ConvertCurrencyDifferently, MultiplyByOriginalSpend.. Notice that the cost calculations depend on the original (non transformed) spend value.
The most natural pattern would be to calculate the spend and cost in two independent pipelines, and merge the final output. A map-reduce pattern if you will. We could even benefit from running the pipelines in parallel.
However in my case it is not really a question of performance (as the transformations are very fast). The issue is that since Kiba applies all transforms as a set of serial steps, the cost calculations will be affected by the spend calculations, and we will end up with the wrong result.
Does kiba have a way of solving this issue? The only thing I can think of is to make sure that the destination names are not the same as the source names, e.g. something like 'originSpend' and 'finalSpend'. It still bothers me however that my spend calculation pipeline will have to make sure to pass on the full set of keys for each step, rather than just passing the key relevant to it, and then merging in the Cost keys in the end. Or perhaps one can define two independent kiba jobs, and have a master job call the two and merge their result in the end? What is the most kiba-idiomatic solution to this?
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
I think I lack extra details to be able to properly answer your main question. I will get in touch via email for this round, and will maybe comment here later for public visibility.
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
The main focus of Kiba ETL today is: components reuse, lower maintenance cost, modularity and ability to have a strong data & process quality.
Parallelisation is supported to some extent though, via different patterns.
Using Kiba Pro parallel transform to run sister jobs
If your main input is something that you can manage to "partition" with a low volume of items (e.g. database id ranges, or a list of files), you can use Kiba Pro parallel transform like this:
source ... # something that generate list of work items
parallel_transform(max_threads: 10) do |group_items|
Kiba.run(...)
end
This works well if there is no output at all, or not much output, coming to the destinations of the sister jobs.
This works with threads but one can also "fork" here for extra performance.
Using process partitioning
In a similar fashion, one can structure their jobs in a way where each process will only process a subset of the input data.
This way one can start say 4 processes (via cron jobs, or monitored via a parent tool), and pass a SHARD_NUMBER=1,2,3,4, which is then used by the source for input-load partitioning.
But!
I'm pretty sure your problem, as you said, is more about workflow control & declarations & ability to express what you need to be done, rather than performance.
I'll reach out and we'll discuss that.
I asked in my previous question if Karate is capable of executing tests on specific data sets (For instance, based on priority p0,p1) given in a csv file.
Now my second question is if Karate is capable of executing tests on specific data sets in a csv file in parallel?
Example: DataProvider supports data-provider-thread-count. Here's an example of usage.
I've read the documentation in regards to parallel execution in Karate, however I did not find anything on this type of parallel feature. Can you please let me know if this is possible in Karate. Thank you.
Yes if you use a Scenario Outline each row will run in parallel. And this applies to even the "Dynamic" Scenario Outline as explained here: https://github.com/intuit/karate#dynamic-scenario-outline
Karate runs each Scenario in parallel and behind the scenes, each Examples row is turned into a Scenario. A few paragraphs below it is mentioned in the docs: https://intuit.github.io/karate/#parallel-stats
I am reading a lot about Gherkin, and I had already read that it was not good to repeat steps, and for this it is necessary to use the keyword "Background", but in the example of this page they are repeating the same "Given" again and again, Could it be that I am doing wrong? I need to know your opinion about it:
Like with several things, this a topic that will generate different opinions. On this particular example I would have moved the "Given that I select the post" to the Background section as this seems to be a pre-requisite to all scenarios on this feature. Of course this would leave the scenarios in the feature without an actual Given section but those would be incorporated from the Background section on execution.
I have also seen cases where sometimes the decision of moving steps to the Background is a trade-off between having more or less feature files and how these are structured. For example, if there are 10 scenarios for a particular feature with a lot of similar steps between them - but there are 1 or 2 scenarios which do not require a particular step, then those 1 or 2 scenarios would have to moved into a new feature file in order to have the exact same steps on the Background section of the original feature.
Of course it is correct to keep the scenarios like this. From a tester's perspective, the Scenarios/Test cases should run independently, therefore, you can keep these tests separately for each functionality.
But in case you are doing an integration testing, then some of these test cases can be merged, thus you can cover multiple test cases in one scenario.
And the "given" statement is repeating, therefore you can put that in the background, so you don't have to call it in each scenarios.
Note: These separate scenarios will be handy when you run the scripts separately with annotation tags, when you just have to check for a specific functionality, or a bug fix.
For example, I have a list of ids for a product that lives in a database and is updated daily. I need to be able to run a scenario that consumes that data and runs the same steps over each of the ids in order. However, the test should not stop because one of the ids failed in the scenario, similar to what cucumber does with the scenario outline type of tests.
We would also want to format the output of the cucumber test(s) so that each id is formatted as if it is a separate test or example in a "scenario outline."
I believe I did something similar some time ago. Have a look at this feature definition.
The "Then I should be able to get to the browse categories page" action is defined here and, as you can see, Category at line 59 retrieves data from this class. In this case I'm getting data from a CSV file, but you can just substitute it with your DB.
My Ruby is a bit basic so the code style might not look so good, but it is an example I had around to easily explain what I did. Hope this help!
Cucumber is not designed to write complex information in feature file ,
If your Data is complex , or dynamically generated , you should get Data in step definition and write a generic term in feature file .
That's the intention of cucumber , writing simple features so that non technical person can easily understand what the scenario is doing.
SO with my limited knowledge of test Complete scripting. It seems as though one should look at the object viewer to see your windows, and use the UI features via name mapping of these objects and clicking selecting or populating their fields.
I have a question about how to do assertions using the JSscripting tests. If i want to see if a certain window looks like a past window, what i have been doing is making a checkpoint via keyword tests at that time. I feel like i should be doing this through the api though. Is there an area that explains how to do this via code, rather then using the keyword checkpoints?
Bob, the checkpoints idea is not limited to Keyword Tests. You can use checkpoints in scripts as well. When recording a script, you just create the needed Checkpoint type via the Recording toolbar (I guess you need the Region Checkpoint in your case), and you will get the needed script generated. Based on this script, you will see how checkpoints are called from a script.
As for the documentation, the "Region Checkpoints" help topic does a good job explaining basics, and giving the links to other topics to read. And the "Creating Region Checkpoints" help topic shows the procedure step by step.
I hope this helps. Let me know if there are unclear points.