ETL: Transforming/cleaning excel files

ETL: Transforming/cleaning excel files - excel

I am working for a start up where they are getting excel files from different companies with customer information. We do not have any ETL tool at present as the work is handled manually to transform the data into required structure and load into CRM system.
My plan is to load these excel files into a database and also replicate CRM into a database and do some fuzzy maping.
Can you please recommend a light weight ETL tool to apply a few rules to clean the data and compare the existing customer data that we have?
Thanks,
mc

Getting Excel feeds is certainly very common, and you need a good process for ingesting and validating them, especially since they are often manually created or tweaked, leading to frequent data and formatting issues. Adding insult to injury, Excel has a very fuzzy concept of data types, often throwing spanners in the works.
Where possible, switch your data sources to other formats (JSON, CSV, database extract). This requires upstream work but so does troubleshooting feed issues, so switching to a better format (and defining the feed well!) pays off for both sides fairly quickly.
Process Incoming Files Example describes a general approach for reliably handling multiple feeds of incoming files, with processing and archiving of successful and failing files. The example uses my company's actionETL cross-platform .NET ETL library, but I've also used the same approach previously with other ETL tools.
Map out all current and upcoming data sources and destinations, and see which tools are a good fit. Try before you buy with your actual ETL feeds and requirements. Expect the ETL data integration to be an ongoing project since feeds and requirements never stop changing and growing.
Cheers,
Kristian

Related

How to maintain Power Query code across multiple workbooks/workstations?

I'm working on automating a lot of the data reporting in the business I work at.
It's all various tables orginating from a central database and spread out across Excel workbooks.
I'm largely limited to MS office tools at the moment.
Power Query is a great deal faster than the current methods and easier to maintain.
I notice that a lot of the reporting uses the same results over and over again. As such, I can write a query and distribute to my coworkers in an ODC file or otherwise through a file server or Teams.
However, loading an ODC loads in the raw PQ code into the file.
Which means any changes made to the master query have to be manually loaded into each file.
Is there a way update PowerQuery code across multiple worksheets?
I'm trying to avoid having to write database level queries as possible. I have minimal support on it, would prefer not to freeze the system, and learning the IBM i-series is a disproportionately larger trial.

Store the M code in flat files in a Onedrive synced folder. Then load the queries dynamically using Expression.Evaluate . Chris has a great article here https://blog.crossjoin.co.uk/2014/02/04/loading-power-query-m-code-from-text-files/

Tool for storing infromation about tables, their sources and ETL for DWH

I'm searching for tool for storing documentation about tables, datasources, etl processes and etc for my DWH.
I've seen some presentations on youtube, but I've found out, that most of the companies are using custom, own system or something like wiki ith plain text descriptions.
I think, that it is not so useful for Analysts, Mangers and other user to find out , what they need and how to use data to calculate suitable for them statistics.
Can you suggest, please, what may I use for this case? What I must read?

While Airflow was baked with some support for Apache-Atlas, in my opinion
the one of the best data-lake metadata management tools right now is Lyft's Amundsen
and they've also released lyft/amundsendatabuilder, the introduction of which says
Amundsen Databuilder is a data ingestion library, which is inspired by
Apache Gobblin. It could be used in an orchestration
framework(e.g. Apache Airflow) to build data from Amundsen. You could
use the library either with an adhoc python script(example) or
inside an Apache Airflow DAG(example).

How can I alter the incoming documents on replication in CouchDB

I need to replicate in CouchDB data from one database to another but in the process I want to alter the documents being replicated over,
mostly stripping out particular fields (but other applications mentioned in comments).
The replication would always be 100% one way (but other applications mentioned in comments could use bi-directional and sync)
I would prefer if this process did not increment their revision ID but that might be asking for too much.
But I don't see any of the design document functions that do what I am trying to do.
As it seems doesn't do this, what plans are there for adding this? And meanwhile, what workarounds are there?

No, there is no out-of-the-box solution, as this would defy the whole purpose and logic of multi-master, MVCC logic.
The only option I can see here is to create your own solution, but I would not call this a replication, but rather ETL (Extract, Transform, Load). And for ETL there are tools available that will let you do the trick, like (mixing open source and commercial here):
Scriptella
CloverETL
Pentaho Data Integration, or to be more specific Kettle
Jespersoft ETL
Talend have some tools as well
There is plenty more of ETL tools on the market.

I believe the best approach here would be to break out the fields you want to filter out into a separate document and then filter out the document during replication.

Of course the best way would be to have built-support for this, but a workaround which occurs to me would be, instead of here using the built-in replication, to code and use a custom replication which will do the additional needed alterations/transformations, still using rather than going beneith, the other built-ins, and with good coding, in many situations (especially if each master can push to its slaves), it feels this could be nearly as efficient.
This requires efficient triggers be put on each source/master to detect any changes, which I believe CouchDB does offer (or at least PouchDB appears to), which would then copy the changes to another location also doing the full alterations.
If the source of the change is unable to push the change to the final destination, this fixed store may to be local to it where the destination can pull from -- which could get pretty expensive especially in multi-master, as each location has to not only store & maintain its own data but also the data (being sent) of everyone it sends to.
This replicate would also place each source document's revision ID in the the document's copy...
...that is ideally, including essential if the copy was to be {updated, aka a master}, too.
...in form of either:
ideally the normal "_rev" property. Indeed this looks quite possible per it ("preserve their revisions ID") already done by the normal replication algorithm using the builtin "Bulk Docs API" which seemingly our varient would use, too
otherwise have a new copy object (with its own _rev) plus another field as "_rev_original" ntelling the original rev. But well that would work?
Clearly such copy could be created no problem.
Probably no big if the destination is just reading the data.
Seems hairy if the destination is also writing the data. As we'd now have to merge with these non-standard revisions. But doable.
Relevant to this (coding an a custom/improved replication (to do this apparently-missing functionality) ideally without altering Pouch and especially Couch source code), as starter/basis material (the standard method), here's the normal Couch replication algorithm which unfortunately doens't clearly say it only uses builtin ops but it looks like it, and also the official overview of what it does; I'm suspecting Pouch implements this, likely in Pouch's replicate.js (latest release as of 2014.07).
Futher implementation particulars? - those who would know, please put it here.
This is a "community wiki" answer so please extend it.
Also please comment links & details of anyone/system already doing or trying to do this or similar.

Can I query SAP BO WEBI via Excel VBA? Can I do it fast enough?

Following up on my previous post, I need to be able to query a database of 6M+ rows in the fastest way possible, so that this DB can be effectively used as a "remote" data source for a dynamic Excel report.
Like I said, normally I would store the data I need on a separate (perhaps hidden) worksheet and I would manipulate it through a second "control" sheet. This time, the size (i.e. number of rows) of my database prevents me from doing so (as you all know, excel cannot handle more than 1,4M rows).
The solution my IT guy put in place consists of holding the data on a txt file inside of a network folder. This far, I managed to query this file through ADO (slow but no mantainance needed) or to use it as a source to populate an indexed Access table, which I can then query (faster but requires more mantainance & additional software).
I feel both solutions, although viable, are sub-optimal. Plus it seems to me as all of this is but an unnecessary overcomplication. The txt file is actually an export from SAP BO, which the IT guy has access to through WEBI. Now, can't I just query the BO database through WEBI myself in a "dynamic" kind of way?
What I'm trying to say is, why can't I extract only bits of information at a time, on a need-to-know basis and directly from the primary source, instead of having all of the data transfered in bulk on a secondary/duplicate database?
Is this sort of "dynamic" queries even possible? Or will the "processing" times hinder the success of my approach? I need this whole thing to really feel istantaneuos, as if the data was already there and I'm not actually retrieving it all the times.
And most of all, can I do this through VBA? Unfortunately that's the only thing I will be having access to, I can't do this BO-side.
I'd like to thank you guys in advance for whatever help you can grant me!

Webi (short for Web Intelligence) is a front-end analytical reporting application from Business Objects. Your IT contact apparently has created (or has access to) such a Webi document, which retrieves data through a universe (an abstraction layer) from a database.
One way that you could use the data retrieved by Web Intelligence as a source and dynamically request bits instead of retrieving all information in one go, it to use a feature called BI Web Service. This will make data from Webi available as a web service, which you could then retrieve from within Excel. You can even make this dynamic by adding prompts which would put restrictions on the data retrieved.
Have a look at this page for a quick overview (or Google Web Intelligence BI Web Service for other tutorials).
Another approach could be to use the SDK, though as you're trying to manipulate Web Intelligence, your only language options are .NET or Java, as the Rebean SDK (used to talk to Webi) is not available for COM (i.e. VBA/VBScript/…).
Note: if you're using BusinessObjects BI 4.x, remember that the Rebean SDK is actually deprecated and replaced by a REST SDK. This could make it possible to approach Webi using VBA after all.
That being said, I'm not quite sure if this is the best approach, as you're actually introducing several intermediate layers:
Database (holding the data you want to retrieve)
Universe (semantic abstraction layer)
Web Intelligence
A way to get data out of Webi (manual export, web service, SDK, …)
Excel
Depending on your license and what you're trying to achieve, Xcelsius or Design Studio (BusinessObjects BI 4.x) could also be a viable alternative to the Excel front-end, thereby eliminating layers 3 to 4 (and replacing layer 5). The former's back-end is actually heavily based on Excel (although there's no VBA support). Design Studio allows scripting in JavaScript.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.

You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.

If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.

While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.

Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)

Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.

Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.

Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string