OLAP cube powering Excel Pivot. What's a better solution? - excel

I'm looking to build a dynamic data environment for non-technical marketers.
I want to provide large sets of data in an Excel pivot table form so even marketers without analytics/technical backgrounds can access relevant performance information. I'm trying for avoid non-excel front ends since I don't want users to have to constantly export data when they need to manipulate it in some way.
My first thought was to just throw together an OLAP cube populated with pre-aggregated data, but I got pushback from the IT team as OLAP is "obsolete." I don't disagree with them - there are definitely faster data processing architectures out there.
So my question is this: are there any other ways to structure the data so that marketers can access it easily but still manipulate it to some degree in Excel? I'm working with probably 50-100m rows of data and need the ability to scale dimensionality.

This is just my thoughts.
Really the question could be thrown back at your IT team. Your first thought was to throw together an OLAP cube. IT didn't like this. If they're so achingly hip that they consider OLAP "obsolete", what do they suggest as a better, more up-to-date alternative?
Or, to put it a different way - what is the substance of their objection to an OLAP solution? (I'm assuming there is one beyond "MS gave us an awesome presentation of PowerPivot/Azure tabular, with really great free snacks and coffee").
Your requirements are pretty clear:
Easy access for non-technical people
Structured data so that they don't have to interpret the raw data
Access through Excel
Scalability
I'll be paying close attention to any other answers to your question, because I'm always interested in finding out that I don't know something; but personally I haven't come across a better solution to these requirements than OLAP.
What makes me suspicious of the "post-OLAP" sentiment is related to point (2) in the list above. Non-technical users can tend to think of the cube data they consume as being somehow effortlessly produced, by some kind of magic. That in itself is an indicator of success, demonstrating just how easy it is for users to get what they want from a well-designed OLAP system.
But this effortlessness is an illusion: to structure the raw data into this form takes design effort, and the resulting structure incorporates design decisions and assertions: that is how it can be easy to use, because the hard stuff has been encapsulated in the cube design.
I have a definite Han Solo-like bad feeling about "post-OLAP": that it amounts to pandering to this illusion of effortless transformation of data into a usable form, and propagates further illusions.
Under OLAP, users get their wonderful magic usable data structure, and the hard work is done out of sight by developers like you or me. Perhaps we get something wrong so that they can't see data exactly as they'd like to - but at least the users can then talk to us and ask for what they do want.
My impression of the "post-OLAP" sales pitch is that it tries to dispense with the design work. We don't need those pesky expensive developers, we don't need to make specific design decisions (which necessarily enable some functionality while precluding some other functionality), we don't need cube-processing time-lags. We can somehow deliver this:
Input any data you like. Don't worry if it's completely unstructured or full of dirt!
Any scale
Immediate access to analytics without ETL/processing delays
Somehow, the output is usable, structured data. Structured by... no-one in particular. The user can structure it as they like, but somehow this will be easy
Call me cynical, but this sounds like magical thinking to me.

Related

Automating Raw Export Data Cleansing for Client Onboarding - Format is Always Different

So a bit of a general question. I work as a data analyst for a startup. My primary process involves taking existing customer data a client has and cleansing/normalizing it to fit into our platform once as part of our onboarding process. A member of our team exports their data from their system they are transitioning from or, if they kept track of it in house, we receive their Excel log they used to track it. It is always in a different format and requires extensive cleansing (avg 1 min/record). We take what is usually one large table (.xlxs format), and after cleansing, split it into four .csv files; which we load as four tables on our platform.
I feel I have optimized the process quite well in terms of the process steps and cleansing with excel functions (if, concat, text-to-columns, etc). I have beginner-intermediate skills in VBA and SQL and have just scratched the surface in R; what is frustrating is that I know there is the potential to automate this process but I just don't know where to start. If anyone has experience with something like this, code, a link to an article / another thread, or just some general direction would be much appreciated. Please ask for clarification where you feel it is needed. Thanks.
This will be really hard to do in Excel. If you have the time you can try out Optimus, a Data Cleansing library written in Python and Pyspark (you don't need to know spark). Here is the webpage https://hioptimus.com.
You can create Data Pipelines with it, and I recommend that you do that, try to generalize your processes, and asking the client for more a structure way of passing the data.
The good thing is that you don't need Big Data for running Optimus, bit if you have it some day, the same code will work.
Check out the documentation for more:
http://optimus-ironmussa.readthedocs.io/en/latest/
Let me know if you have doubts!

Converting data into information:Where to start?

We (my company) runs a website which have lots of data recorded like user registration, visits, clicks, what the stuff they post etc etc but so far we don't have a tool to find out how to monitor entire thing or how to find patterns in it so that we can understand what kind of information we can get from it? So that Mgmt can take decisions based on it. In short, the people do at Amazon or Google based on data they retrieve, we want a similar thing.
Now, after the intro, I would like to know what technology could it be called;is it Data Mining,Machine Learning or what? Where should we start to convert meaningless data into useful Information?
I think what you need enters in the "realm" of: parsing data, creating graphs, showing statistics about some elements, etc.
There is no "easy" answer, I can only answer parts of your question.
There are no premade magical analytical tools, big companies have their own backend tools tunned to parse the large amounts of data and spit out data summaries that are then used to build graphs or for statistical analysis.
I think the domain you are searching for is statistical data analysis. But there are many parts that go together here.
Best advice I can give you is to set up specific goals for you analysis and then try to see what is the best solution, you question is too open.
ie. if you are interested in visits/clicks/website related statistics Google Analytics is a great tool, and very easy to use.

Integrating with 500+ applications

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.
You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.
I would also look at CSV and then use an OLEDB connection against the CSV file for importing.
If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.
With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3
Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004
Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.
If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

Data Aggregation :: How important is it really?

I'm curious to know where people value Data Aggregation. I'm truly curious, if you don't mind letting me know how important this really is to you personally with respect to your work environment, and if you have to work directly with data agg in your line of work.
Really interested to hear about your feedback.
If you persist data (e.g. store it in a database) chances are that the data will be used by managers, statisticians, stake holders etc. to analyze the workings of their software-supported undertaking to make executive decisions. This analysis can only take place by methods of aggregation. There's no one in the world who can look at a million rows of raw data and glean insight. The data has to be summed, averaged, standard deviated etc. to make any sense to a human being.
A few examples of areas where data aggregation is important:
Public Health (CDC, WHO)
Marketing
Advertising
Politics
Organizational Management
Space Exploration
lol. Take your pick!
Very important, what else is there to say?
I work at a large hospital and not only do we have numerous departments using Analysis Services cubes we develioped but they rely heavily on the daily totals and different aggregations they can derive from these cubes by simple browsing. Without the very basic capability of being able to aggregate on some portions of your data you might as well write it on paper (IMO).
Say you have data over every individual sale.
Looking at these individual purchases could be interesting some level level(e.g. whne a customer comes and wants a refund)
However, I cannot send those 20 million records to my boss at the end of each month and say "Heres how much we sold this month".
This data needs to be aggregated and summarized on various levels. The business would not operate if the marketing guys couldn't get an aggregate for each product, the regional boss couldn't get an total aggregate over a time period and so on.
Our databases have millions of rows, of course we rely on aggregation for managment information, not to use it would be to put too heavy a load on the database in order to run large reports which would impact heavily on the users of the database. I can't think of many cases where the database contains business critical information that managers use to make decisions where aggregation would not be needed for managment reports.
I view data aggregation like data in a grid and being able to group, order, and sort columns. In a large grid of data, this is very important. It's really the difference between looking at a pile of numbers and looking at meaningful data.

Production, Test, Developer Environments vs Security

What are current practices for enabling developers to build systems that contain private data? Can anyone point to a "best practices" guide for that sort of thing?
We have a Catch-22 here in that developers need to write applications that go against systems that have data that is considered "private." The IT administration would like for us developers to not have access to the data (ie. provide a schema or data structure, but not data itself) whereas most developers (myself included) would like to have access to the production data since not having a representative dataset can lead to bad assumptions (eg. the format of data) and bugs later on.
Does anyone have any formalized "best practices" for this type of thing? Especially official guildines from some "BigCo" (eg. Microsoft, IBM) might help since it is needed to convince management.
My view of the world may be different, as I'm based in the UK, but for the past 20-odd years, I've worked primarily in the public sector on systems handling sensitive data.
The rules are **completely** cut-and-dried. No production data is allowed on the development estate.
As a fundamental principle, we do not want to be responsible for the loss of sensitive data. The users are perfectly good at that, themselves.
Within the past 12 months, my wife has moved from the same regime to one in the private sector where they allow developers access to production data and she's horrified by it. The legal implications (in the UK, at least) can be severe.
Developers don't **need** access to production data. It's simply laziness. Define and create test data to exercise defined test cases (including edge cases) and don't rely on the random-esque nature of production data.
If you **must** use production data (i.e. you manage to convince someone who doesn't know any better that it's acceptable), ensure the data is anonymised **before** it reaches the development estate.
Often times, a subset of sanitized data will be provided that is representative of the private data, but not the private data itself.
At my company, we started using Red-gate's data generator to generate test data. There is a bit of setup, but you can use the tools to generate very usable test data. Yes, I would prefer to use live production data, but it's not feasible (especially if you need to consider in HIPAA). It uses regex for each column and allows you to use look-up table's for related tables.
At MediumCo, we strip proprietary data out of our production data in Test and Dev. It has hurt us a little in the past to not have exactly-representative data, but the clients have asked about this point before, and it's usually not an issue, as the environments are populated with a lot of fake proprietary data.
I don't have any best practices paper or anything. But I would think that if you're developing out of an environment that is as protected as the environment that hosts the data in production, there wouldn't be a lot of argument to be made against it.
That is, if your production database is in a datacenter hosted and controlled and secured by your IT staff, if you have a development database that lives in the exact same scenario and doesn't offer any new ways to access the information - you would be in pretty good shape. As an added token of good will - it might be nice to offer to allow anyone worried about security a chance to do some kind of penetration test to ensure that you're telling the truth about security.
The other side of this, of course, is the analysis of the cost for not using the data: that is, it will lead to buggier code, which will cost $xxxxxx.xx in development time vs. virtually no cost to allow a small subset of your development team access to said data.
To avoid the need to manually sanitise/anonymise data, you could use random text replacement - to replace every alphanumeric character in each text field with a random alphanumeric. This:
keeps the data similar in length, size etc. from the developer's point of view
does not cause problems with character sets
leaves date and number fields untouched, which allows for accurate testing with respect to date ranges and quantities
will satisfy most privacy requirements
If you wanted to go a little further you could run random number-for-number replacement on telephone numbers and zip codes, while using alphanumeric replacement on other text fields.
Having an automated replacement script allows you to get up-to-date data dumps from the live system regularly, so your tests are up-to-date with respect to the size and variability of the data in practice.
It does mean that a small number of operations will not be realistic (e.g. indexing on name fields, which in real life are clustered around common letters) but these should be limited.

Resources