Integrating with 500+ applications - excel

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.

You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.

I would also look at CSV and then use an OLEDB connection against the CSV file for importing.

If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.

With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3

Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004

Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.

If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

Related

Excel Data Validation not processing recent cell-data from smartphone input

I have recently observed an issue regarding my data in a column that I use to perform data validation on my spreadsheet.
So There is nothing wrong with the formula, neither is there anything from with the use of data validation.
It should be looking for duplicate entries, which works quite fine.
The issue is that it no longer recognizes input made from a smartphone using the excel app.
so what i did was to retype cell text field from my PC and it worked perfectly.
Is there a way that I can continue using this technique (Data validation) without having to re-enter data from a PC in order for it to process?
Certainly! Yes, that is possible.
But... with all the possibilities in today's world, is your current strategy the one that is the best for you?
That is something I cannot answer for you.
That is something I cannot enumerate for you.
But... There is something that I can introduce to you.
PowerQuery
PowerQuery was a free add-on for Excel 2010 and 2013 and it has been baked directly into Excel for more than half a decade. So, if you're using the mobile app then you probably have a modern version of Excel with PowerQuery right at your finger tips.
Your first step if to determine how you want to make your data available for Excel to get. Go to the Data Tab on the ribbon and review your options in the "Get Extetnal Data" group.
It doesn't matter if free data is your Creed and your most intimate moments are publicly available through your raw data feed. Or if paranoia is the reason why you constantly drive around the block scraping SSIDs before squirreling them away to SQL server for detailed analysis. Or if you're using a USB cable to transfer photos to your PC because your mom walked in on you without knocking and was so disgusted by what she saw on your desktop that you're banned from the family LAN... For life. None of that matters because Excel can connect to your data in so many ways that one of them will be perfect for you.
There is a sense of familiarity when Importing your data into PowerQuery. It's not unlike following those timeless MS Wizards; but nothing like the uncanny sensation of being dropped into the PowerQuery editor. It is simultaneously the same as Excel and different from Excel and it may be the closest you ever come to visiting a parallel universe. Many of the same tools are available but they behave just slightly differently. And in some cases, like the Text To Columns tool, it is light years ahead of Excel and you will find yourself cursing at MS for not using it as a replacement for the old tool.
When you're done transforming your data, you'll have a tight clean table. But the real prize, is that you have fully automated pipe from source to product .
I figured that the phone user included extra spaces when inputting the data.
So i Used the TRIM() function which takes care of the extra spaces between, before, or after each word, and that did the job.
Therefore the major error was that there were additional spaces that was not recognized in the tested data.

Automating Raw Export Data Cleansing for Client Onboarding - Format is Always Different

So a bit of a general question. I work as a data analyst for a startup. My primary process involves taking existing customer data a client has and cleansing/normalizing it to fit into our platform once as part of our onboarding process. A member of our team exports their data from their system they are transitioning from or, if they kept track of it in house, we receive their Excel log they used to track it. It is always in a different format and requires extensive cleansing (avg 1 min/record). We take what is usually one large table (.xlxs format), and after cleansing, split it into four .csv files; which we load as four tables on our platform.
I feel I have optimized the process quite well in terms of the process steps and cleansing with excel functions (if, concat, text-to-columns, etc). I have beginner-intermediate skills in VBA and SQL and have just scratched the surface in R; what is frustrating is that I know there is the potential to automate this process but I just don't know where to start. If anyone has experience with something like this, code, a link to an article / another thread, or just some general direction would be much appreciated. Please ask for clarification where you feel it is needed. Thanks.
This will be really hard to do in Excel. If you have the time you can try out Optimus, a Data Cleansing library written in Python and Pyspark (you don't need to know spark). Here is the webpage https://hioptimus.com.
You can create Data Pipelines with it, and I recommend that you do that, try to generalize your processes, and asking the client for more a structure way of passing the data.
The good thing is that you don't need Big Data for running Optimus, bit if you have it some day, the same code will work.
Check out the documentation for more:
http://optimus-ironmussa.readthedocs.io/en/latest/
Let me know if you have doubts!

Data extraction from mainframe to excel

how to extract the data from mainframe into excel? Currently , I am fetching data from MS access but the requirements are for Mainframe.
Thanks in advance
First, please understand that saying "extract data from mainframe" is similar to saying "extract data from Intel." The following is not comprehensive but is intended to provide an idea of how to ask your question in a manner which can be meaningfully answered.
Please understand there is a big difference between...
what is technically possible
what is allowed in your shop
what is likely to provide a robust and maintainable solution given your requirements
These are three very different things. Some of us answering questions here on Stack Overflow have life experiences that make us reticent about answering questions regarding what is technically possible absent any mention of what is allowed in your shop or what the actual business requirement is that is being solved.
Mainframes have been around for over half a century, and many shops have standard solutions to technical problems. Sometimes the solution is "don't do that, and here's what we do instead." Working against the recommendations of your technical staff, or your shop standards, is career limiting.
What operating system?
z/OS is in common use on mainframes, but there do exist shops that still run one of its ancestors like MVS/XA. The mainframe operating system traces its roots back to OS/360 first available in 1965.
z/TPF
z/Linux usually runs on top of the z/VM hypervisor.
z/VSE
In what sort of file does the data reside?
QSAM or Queued Sequential Access Method, also commonly called flat files.
VSAM or Virtual Sequential Access Method. There are several different kinds of VSAM files including KSDS (Keyed Sequential Data Set) ESDS (Entry Sequenced Data Set), RRDS (Relative Record Data Set) and Linear (conceptually similar to a memory mapped file).
a DBMS like DB2 or IMS. A DBMS typically has extract facilities to allow writing a flat file from its own internal format. DB2, for example, stores data in Linear VSAM datasets.
Unix System Services files reside in a different file system than QSAM or VSAM. This will be more familiar, as it has a directory structure where the classic z/OS file system has none.
What does the data look like?
You must know the record layout of the data you wish to retrieve.
It is common for mainframe data to include both text and binary data in a single record, for example a name and a currency amount:
Hopper Grace ar%
...which would be...
x'C8969797859940404040C799818385404040404081996C'
...in hex. This is code page 37, commonly referred to as EBCDIC.
Without knowing that the family name is confined to the first 10 bytes, the given name confined the the subsequent 10 bytes, and the currency amount is in packed decimal (also known as binary coded decimal) in the next 3 bytes, you cannot accurately transfer the data because code page conversion will destroy the currency amount which is +819.96. Converting to code page 1250, commonly in use on Microsoft Windows, you would end up with...
x'486F707065722020202047726163652020202020617225'
...where the text data is translated but the packed data is destroyed. The packed data no longer has a valid sign in the last nibble (the lower half of the last byte) and the amount itself has been changed.
Security
Is the data you wish to access covered by privacy legislation? You may have to provide some evidence that whatever protections are in place to guarantee that only authorized personnel have access to this data on the mainframe are also in place once you have transferred it off of the mainframe. Such guarantees may have to satisfy an auditor.
What you need
You need to know what operating system holds your data, you need to know what type of file holds your data (a DBMS isn't a type of file but let's let that go for now), and you need to know your record layout(s).
Typically, the easy way to retrieve data is to extract it from its existing data store (QSAM, VSAM, DBMS) into a flat file where all the data is in a text format. There are mainframe utilities to accomplish this. In extreme cases, a program can be written to accomplish this goal. Once it has been accomplished, you can transfer your data without fear of destroying packed or binary data.
You may be able to read data directly from a DBMS if that's where your data resides, but this may depend on shop standards, including security.
Modern mainframes can transfer data via FTP, FTPS, and SFTP. Which is recommended in your shop is something to ask your technical staff.

alternative spreadsheet for real time data

I am looking for an alternative spreadsheet to Excel, preferably but not necessarily open source, that allows a programmer to create a plugin that can update cells in the sheet from an external data source in real time. The spreadsheet would then internally compute all dependent calculation chains upon change of value.
This is similar functionality to what the RTD method does with Microsoft Excel. The rate of external data change could be moderate to high (whatever such relativistic terms mean).
Also the reverse process would be useful, i.e. detecting a change in cells and then sending that information to a plugin that can communicate with external processes.
Any recommendations or experience in trying this?
I am afraid you will not find any. The main consumers of the real-time spreadsheets (grids) are big banks and they usually invest in their own solutions. [Because they can afford and they used to see it as their advantage over the competition] Some of the solutions are very dated, but still going strong! Three years ago I worked on a system which was written in C++ (with TibcoRv as a backbone) and it was already five years old. It is still alive and kicking.
One of the strong points of the bespoke grid are "Excel-like formulae" where a user can use a field from the provided data dictionary. So rather than reference cells, you reference data from your systems. It makes formulae easier to implement and read. And of course you can export or share them; users really like that.
The following could be of some help:
http://www.dadisp.com
http://www.quantrix.com
http://www.resolversystems.com/products/
http://pyspread.sourceforge.net/
http://matrex.sourceforge.net/
This may not exactly satisfy your real time requirement but worth exploring.

Can an excel worksheet be used as UDF?

I'm building a network business model in excel. A similar model is that of Gawker Media.
In my model I have a number properties that have some over lap of audience. Each property attracts users, which in turn affords cross promotional opportunities. In the case of Gawker they have a series of blogs whose audience will likely read several of their blogs in their network.
If gawker launched a new blog they're able to direct traffic from their blog network.
Creating a model for a single blog is fairly simple - although the initial assumptions are harder. The next step is to model the network effect.
Excel provides a scenarios manager that allows me to vary the key assumptions in the basic model. This is almost perfect, I can model the launch of 10 properties, each with different launch assumptions and see the summary.
Where I need help is figuring out how I can vary the initial number of users for the launch of each property. In other words, once the network is established, its possible to drive people to any new property launched on the network.
I don't believe the scenario manager will do what I need.
So, I'm wondering if its possible to use the model work sheet as a UDF? The UDF would need to spit out the monthly revenue and unique users given a number of input assumptions.
I would then be able to create my own summary sheet for the 10 properties and using the total uniques for each property get a summary for the network. This network summary would be used to determine how many people could be driven to the launch of a new property.
In effect, the only difference to the scenario manager is that I need one of my input variables (initial users) to be programmatically generated as a function of the number of people in the network at the time of launch.
I'm hoping its possible to achieve something along these lines in excel. I could drop down and create the whole model in Java, but then its much harder to share with business colleagues!
Thanks - Matt.
You could try Data Table.
It only allows you to analyse the effect of varying 2 input parameters, but you can create several data tables, and each parameter can take hundreds of different values.
It's little know, but efficient and available since Excel 3.0.
There is a product that I have researched but never used - search for calc4web. It takes a sheet of formulas and generates code (C++) that can be compiled into an XLL add-in. Then you can call a function that does what your sheet does. But of course then you have an XLL to distribute, and a build step every time you change your logic, which defeats much of the point of using a spreadsheet.
In my case, I wound up writing some very simple VBA code to vary my sheet "inputs" using the scenario manager, and capture my "outputs". This works if you have a batch of inputs that you can just point your macro at and step through.
EDIT:
See here for a VBA-only example of doing this:
using a sheet in an excel user defined function

Resources