Automating Acceptance Testing for ETL

Automating Acceptance Testing for ETL - cucumber

We have a java application which essentially performs ETL - reading from and writing to files/databases with transformation rules applied in the middle.
I've started looking into automating acceptance testing for the application however am struggling to apply the frameworks i've looked at so far (concordion, cucumber etc). They seem very easy to implement for simple applications like those shown on their tutorials, but I basically have to have tests saying "I have this input file and expect this output file (or result in db table)" - with each file having 100's of fields.
I could fake it so that the input values are read from a html table (as per concordion tutorial) however that isnt really a true test.
Has anyone come across a framework that could help? or been able to use concordion for such a purpose?
Many thanks

Who is the audience of the test? If this is purely a technical exercise and there are no non technical business owners that need to interact with the test then just doing it with your favorite unit testing framework is fine. Fitnesse works best when there is collaboration for the acceptance criteria with non techies.
So no, just "file input 'a' produces file output 'b'" probably is not enough to warrant the overhead of fitnesse. I would only move it to tables if someone was going to change it on a regular basis and that person was not comfortable editing a file directly.

At a leading bank in the Netherlands we have setup testautomation with Fitnesse and ETL fixtures. It is an Agile project and for our ETL solution we use Informatica Powercenter and Oracle DB. For us our test automation/specification in Fitnesse is of great value now.
We have SLIM fixtures for truncating tables, inserting records into tables, checking table records with expected values, updating records and calling our Fitnesse workflows.

Did you try JBehave ?
read more information about it at http://www.qatestingtools.com/jbehave

Related

Grouping similar strings that have misspellings, spacing differences, etc

I have a data set of about 1 million employer names. These names are from a free-form text field so they include misspellings and variations in the way they are inputted (e.g. "Amazon" .. "Amzaon" .. "Amazon.com" .. "Amazon Web Services" .. "AWS").
I want to either A) group these 1 million so I have a somewhat accurate sense of how many unique employers are in the data set or B) be able to find all variations of any given employer.
So far, I've been using the data in Tableau, then filtering on "employer name" and searching all variations of the name I can think of. But it's tedious and I'm pretty sure I'm leaving many out.
I've also used the fuzzy add-in for excel but it hasn't worked that well on misspellings, special characters...

Tableau just isn't suited for doing this kind of analysis straight out of the box, and I would highly recommend doing some pre-processing on your data before putting trying to build a workbook around it.
Like another commenter said, you could look into using Tableau Prep Builder for a one-time transformation on your data set, but if you wanted to automate this process it costs extra to add functionality to whatever Tableau Server installation you have.
If you're familiar with Python or R (and the integration between Tableau Server and those services is supported by your organization), you could look into building a script to run the transformation real-time, but it probably won't be too efficient.

Try experimenting with Tableau Prep Builder - the companion tool that comes with your Tableau Creator license. It has a group feature that is designed for just these problems.
In Prep Builder, you’ll just need to connect to your data, add a cleaning step, and then add a group to your cleaning step.

Automating Raw Export Data Cleansing for Client Onboarding - Format is Always Different

So a bit of a general question. I work as a data analyst for a startup. My primary process involves taking existing customer data a client has and cleansing/normalizing it to fit into our platform once as part of our onboarding process. A member of our team exports their data from their system they are transitioning from or, if they kept track of it in house, we receive their Excel log they used to track it. It is always in a different format and requires extensive cleansing (avg 1 min/record). We take what is usually one large table (.xlxs format), and after cleansing, split it into four .csv files; which we load as four tables on our platform.
I feel I have optimized the process quite well in terms of the process steps and cleansing with excel functions (if, concat, text-to-columns, etc). I have beginner-intermediate skills in VBA and SQL and have just scratched the surface in R; what is frustrating is that I know there is the potential to automate this process but I just don't know where to start. If anyone has experience with something like this, code, a link to an article / another thread, or just some general direction would be much appreciated. Please ask for clarification where you feel it is needed. Thanks.

This will be really hard to do in Excel. If you have the time you can try out Optimus, a Data Cleansing library written in Python and Pyspark (you don't need to know spark). Here is the webpage https://hioptimus.com.
You can create Data Pipelines with it, and I recommend that you do that, try to generalize your processes, and asking the client for more a structure way of passing the data.
The good thing is that you don't need Big Data for running Optimus, bit if you have it some day, the same code will work.
Check out the documentation for more:
http://optimus-ironmussa.readthedocs.io/en/latest/
Let me know if you have doubts!

Is there a way to run a cucumber scenario multiple times over a list of dynamically generated data?

For example, I have a list of ids for a product that lives in a database and is updated daily. I need to be able to run a scenario that consumes that data and runs the same steps over each of the ids in order. However, the test should not stop because one of the ids failed in the scenario, similar to what cucumber does with the scenario outline type of tests.
We would also want to format the output of the cucumber test(s) so that each id is formatted as if it is a separate test or example in a "scenario outline."

I believe I did something similar some time ago. Have a look at this feature definition.
The "Then I should be able to get to the browse categories page" action is defined here and, as you can see, Category at line 59 retrieves data from this class. In this case I'm getting data from a CSV file, but you can just substitute it with your DB.
My Ruby is a bit basic so the code style might not look so good, but it is an example I had around to easily explain what I did. Hope this help!

Cucumber is not designed to write complex information in feature file ,
If your Data is complex , or dynamically generated , you should get Data in step definition and write a generic term in feature file .
That's the intention of cucumber , writing simple features so that non technical person can easily understand what the scenario is doing.

Is there a way to specify generic Cucumber tests which can be included into a feature?

Is there a way to reuse cucumber scenarios across multiple features which each share the same UI characteristics?
For example if I am creating an iPhone app then I would have certain tests that I would like to run for each table view screen regardless of what is actually being displayed. These would be add a new row, delete a row, reorder etc.
Copying and pasting these into each feature breaks DRY. Is there a way to specify generic or templated tests in cucumber which I can include into each features that requires them?

Don't think of BDD as testing. Think of it as providing a set of examples which show why your app is valuable and how you can use it.
If the app uses the same code for each table, all you need is one example. I would expect it to be phrased something like:
Given Quentin has three films listed
When he adds "Kill Bill" to the film list
Then he should be reminded that he already owns it.
Given Quentin has mistakenly added "Twilight" to his favourite films
When he deletes it from the film list
Then it should no longer be in the table.
The more interesting and human you can make the scenarios, the more you'll be able to work out why your application is valuable. That's the real heart of BDD - not testing, but knowing why you're writing the code in the first place.
Talking to people who really want the app will help you find out what it should do - and if you're writing it on your own, buy a rubber duck!

Integrating with 500+ applications

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.

You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.

I would also look at CSV and then use an OLEDB connection against the CSV file for importing.

If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.

With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3

Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004

Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.

If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string