Ability to execute tests on data sets in a csv file 'in parallel' - multithreading

I asked in my previous question if Karate is capable of executing tests on specific data sets (For instance, based on priority p0,p1) given in a csv file.
Now my second question is if Karate is capable of executing tests on specific data sets in a csv file in parallel?
Example: DataProvider supports data-provider-thread-count. Here's an example of usage.
I've read the documentation in regards to parallel execution in Karate, however I did not find anything on this type of parallel feature. Can you please let me know if this is possible in Karate. Thank you.

Yes if you use a Scenario Outline each row will run in parallel. And this applies to even the "Dynamic" Scenario Outline as explained here: https://github.com/intuit/karate#dynamic-scenario-outline
Karate runs each Scenario in parallel and behind the scenes, each Examples row is turned into a Scenario. A few paragraphs below it is mentioned in the docs: https://intuit.github.io/karate/#parallel-stats

Related

How to deal with (Apache Beam) high IO bottlenecks?

Let's say to cite a simple example, that I have a very simple beam pipeline which just reads from a file and dumps the data into an output file. Now let's consider that the input file is huge (some GBs in size, the type of file you can't typically open in a text editor). Since the direct-runner implementation is quite simple (it reads the whole input set into memory), it won't be able to read and output those huge files (unless you assign an impractically high amount of memory to the java vm process); so my question is: "How do production runners like flink/spark/cloud dataflow" deal with this 'huge dataset problem'? - assuming they would not just try to put the whole file(s)/dataset into memory?" -.
I'd expect production runner's implementation need to work "in parts or batches" (like reading/processing/outputting in parts) to avoid trying to fit huge datasets into memory at any specific point in time. Can somebody please share their feedback regarding how production runners deal with this "huge data" situation?
Generalizing, please notice this applies for other input/output mechanisms too, for example if my input is a PCollection coming from huge database table (broadly speaking huge in both row-size and amount), does the internal implementation of the production's runner somehow divides the given input SQL statement into many internally generated sub statements each taking smaller subsets (for example by internally generating a count(-) statement, followed by N statements, each taking (count(-)/N) elements? the direct-runner won't do this and will just pass the given query 1:1 to the DB), or is my responsibility as a developer to "iterate in batches" and divide the problem, and if this is indeed the case, what are the best practices here, ie: having one pipeline for this or many?, and if only one then somehow parametrise the pipeline to read/write in batches? or iterate over a simple pipeline and manage necessary metadata externally to the pipeline?
thanks in advance, any feedback would be greatly appreciated!
EDIT (reflecting David's feedback):
David your feedback is highly valuable and definitely touches the point i'm interested in. Having a work discovery phase for splitting a source and and read phase to concurrently read the split-partitions is definitely what I was interested in hearing, so thanks for pointing me in the right direction. I have a couple of small follow up questions if you don't mind:
1 - The article points out under the section "Generic enumerator-reader communication mechanism" the following:
"The SplitEnumerator and SourceReader are both user implemented class.
It is not rare that the implementation require some communication
between these two components. In order to facilitate such use cases [....]"
So my question here would be, is that "splitting + reading behaviour" triggered by some user (ie. developer) provided implementation (specifically SplitEnumerator and SourceReader), or can I benefit from that out of the box without any custom code?.
2 - Probably just delving deeper into the question above; if I have a batch/bounded workload (let's say I'm using apache flink), and I'm interested in processing a "huge file" like described in the original post, will the pipeline work "out of the box" (doing the behind the scenes "work preparation phase" splits and the parallel reads), or would that require some custom code implemented by the developer?
thank's in advance for all your valuable feedback!
Note that when the inputs are bounded and known in advance (i.e., a batch workload as opposed to streaming), this is more straightforward.
In Flink, which is designed with streaming in mind, this is done by separating "work discovery" from "reading". A single SplitEnumerator runs once and enumerates the chunks to be read (the splits/partitions), and assigns them to parallel readers. In the batch case a split is defined by a range of offsets, while in the streaming case, the end offset for each split is set to LONG_MAX.
This is described in more detail in FLIP-27: Refactor Source Interface.
Just to provide some closure to this question, the justification for this question was to know if apache beam - when coupled with a production runner-(like flink or spark or google cloud dataflow), offered out of the box mechanisms for -splitting work a.k.a reading/writing manipulating - huge files (or datasources in general). The comment provided by David Anderson above proved of great value in hintintg at how Apache flink deals with this workflows.
At this point I've implemented solutions using huge files (for testing possible IO bottlenecks) with a "beam on flink" based pipeline, and I can confirm, that flink will create an excecution plan which includes splitting sources, and dividing work in such a way that no memory problem arises. Now, there can be of course conditions under which stability/"IO performance" is compromised, but at least I can confirm that the workflows carried out behind the pipeline abstraction, uses the filesystem when carriying out tasks, avoiding fitting all data in memory and thus avoiding trivial memory errors. Conclusion: yes "beam on flink" (and likely spark and dataflow too) do offer proper work preparation, work splitting and filesystem usage so that available volatile memory is used in an efficient way.
Update about datasources: Regarding DBs as datasources, Flink won't (and can't - it is not trivial) optimize/split/distribute work related to DB datasources in the same way it optimizes reading from the filesystem. There are still approaches to read huge amount of data (records) from a DB though, but the implementation details need to be addressed by the developer instead of being responsibility of the framework. I've found this article (https://nl.devoteam.com/expert-view/querying-jdbc-database-in-parallel-with-google-dataflow-apache-beam/) very helpful in addressing the point of reading massive amounts of records from a DB in beam (the article uses a cloud dataflow runner, but I used Flink and it worked just fine), splitting queries and distributing the processing.

Is there a generic dag/task can be written in Airflow?

Detailed question: I have a scenario in my mind, and would like to take a suggestion from experts!
in the attached image, first set of image shows current workflow. below is the expected workflow. I can't combine all the workflows into one dag due to different source data refresh time.
I was thinking like, is it possible to create one dag named generic and use the individual tasks in other child dags like templates?
or, how can i call these generic dags from my child dags?
looking for your valuable suggestion to make the work flow optimised and easy to maintain when any update needed in extract and load part. as of now, its really complex as I need to touch all the 4 dags individually!

Writing DRY Spock tests

Assume that I have a Spock specification that given a city and state tests for the correct zip code. Assume I have a text file of cities and states that is used to drive the tests in the where clause.
Now assume that I want to split the tests so that I can run for "Virginia" or "Maryland". The approach that I have taken is to create a new VirginiaSpec and a new MarylandSpec and in that spec, I modify the where clause.
This works, but seems inefficient because every other part of the VirginiaSpec and MarylandSpec is exactly the same. In addition, if the logic changes, then I would need to change it in every spec that I have.
So what I am looking for is an approach that allows me to have one StateSpec in which the where clause can be parameterized.
I realize I have not included a code example, however if my question is not clear, then I can provide one. Thanks for your help.
-Dan
You have a couple of options. You could put the basic setup and structure and even the test itself in a base test class, then extend that class w/ your VirginiaSpec and MarylandSpec. Your spec classes would be very small probably just defining a constant that is the state for the spec.
But that seems needless. If both the cities and states are in this file, you could just read in the file in the where section of your test.
https://snekse.github.io/test-often-and-prosper-slides/#/42
If you cannot get the WHERE section working, you could always read in your file during the setupSpec and store the data in some kind of data structure then loop through it.
Spock: Reading Test Data from CSV File
But in general, using the Where label is going to be the right answer.

Is there a way to run a cucumber scenario multiple times over a list of dynamically generated data?

For example, I have a list of ids for a product that lives in a database and is updated daily. I need to be able to run a scenario that consumes that data and runs the same steps over each of the ids in order. However, the test should not stop because one of the ids failed in the scenario, similar to what cucumber does with the scenario outline type of tests.
We would also want to format the output of the cucumber test(s) so that each id is formatted as if it is a separate test or example in a "scenario outline."
I believe I did something similar some time ago. Have a look at this feature definition.
The "Then I should be able to get to the browse categories page" action is defined here and, as you can see, Category at line 59 retrieves data from this class. In this case I'm getting data from a CSV file, but you can just substitute it with your DB.
My Ruby is a bit basic so the code style might not look so good, but it is an example I had around to easily explain what I did. Hope this help!
Cucumber is not designed to write complex information in feature file ,
If your Data is complex , or dynamically generated , you should get Data in step definition and write a generic term in feature file .
That's the intention of cucumber , writing simple features so that non technical person can easily understand what the scenario is doing.

Automating Acceptance Testing for ETL

We have a java application which essentially performs ETL - reading from and writing to files/databases with transformation rules applied in the middle.
I've started looking into automating acceptance testing for the application however am struggling to apply the frameworks i've looked at so far (concordion, cucumber etc). They seem very easy to implement for simple applications like those shown on their tutorials, but I basically have to have tests saying "I have this input file and expect this output file (or result in db table)" - with each file having 100's of fields.
I could fake it so that the input values are read from a html table (as per concordion tutorial) however that isnt really a true test.
Has anyone come across a framework that could help? or been able to use concordion for such a purpose?
Many thanks
Who is the audience of the test? If this is purely a technical exercise and there are no non technical business owners that need to interact with the test then just doing it with your favorite unit testing framework is fine. Fitnesse works best when there is collaboration for the acceptance criteria with non techies.
So no, just "file input 'a' produces file output 'b'" probably is not enough to warrant the overhead of fitnesse. I would only move it to tables if someone was going to change it on a regular basis and that person was not comfortable editing a file directly.
At a leading bank in the Netherlands we have setup testautomation with Fitnesse and ETL fixtures. It is an Agile project and for our ETL solution we use Informatica Powercenter and Oracle DB. For us our test automation/specification in Fitnesse is of great value now.
We have SLIM fixtures for truncating tables, inserting records into tables, checking table records with expected values, updating records and calling our Fitnesse workflows.
Did you try JBehave ?
read more information about it at http://www.qatestingtools.com/jbehave

Resources