Data Aggregation :: How important is it really? - aggregation

I'm curious to know where people value Data Aggregation. I'm truly curious, if you don't mind letting me know how important this really is to you personally with respect to your work environment, and if you have to work directly with data agg in your line of work.
Really interested to hear about your feedback.

If you persist data (e.g. store it in a database) chances are that the data will be used by managers, statisticians, stake holders etc. to analyze the workings of their software-supported undertaking to make executive decisions. This analysis can only take place by methods of aggregation. There's no one in the world who can look at a million rows of raw data and glean insight. The data has to be summed, averaged, standard deviated etc. to make any sense to a human being.
A few examples of areas where data aggregation is important:
Public Health (CDC, WHO)
Marketing
Advertising
Politics
Organizational Management
Space Exploration
lol. Take your pick!

Very important, what else is there to say?
I work at a large hospital and not only do we have numerous departments using Analysis Services cubes we develioped but they rely heavily on the daily totals and different aggregations they can derive from these cubes by simple browsing. Without the very basic capability of being able to aggregate on some portions of your data you might as well write it on paper (IMO).

Say you have data over every individual sale.
Looking at these individual purchases could be interesting some level level(e.g. whne a customer comes and wants a refund)
However, I cannot send those 20 million records to my boss at the end of each month and say "Heres how much we sold this month".
This data needs to be aggregated and summarized on various levels. The business would not operate if the marketing guys couldn't get an aggregate for each product, the regional boss couldn't get an total aggregate over a time period and so on.

Our databases have millions of rows, of course we rely on aggregation for managment information, not to use it would be to put too heavy a load on the database in order to run large reports which would impact heavily on the users of the database. I can't think of many cases where the database contains business critical information that managers use to make decisions where aggregation would not be needed for managment reports.

I view data aggregation like data in a grid and being able to group, order, and sort columns. In a large grid of data, this is very important. It's really the difference between looking at a pile of numbers and looking at meaningful data.

Related

statistical test for an effect across groups, data are nested, not of normal distribution

What is the best statistical test for an effect across groups when the data are nested but may not be of normal distribution? I get a highly significant effect using Kruskall Wallis test, but using it there is no account to that the data points are from several locations, each contributed for several years, and in every year the data were pulled into age groups.
I think you can categorize the data by year, and change the data structure so that the data will be non-nested, making it easier to process. I agree that Kruskal-Wallis' test is a good choice of the cross-group effect test.

Is my database design consistent with RDMS

I am working on my website where I sell concert tickets.
I am working on designing the part of the website where I generate tickets based on seat and rows available.
After some thinking and drawing I have to the conclusion that this design would be best for my problem.
I was wondering is this poor design or are there any improvements that I can make?
Thank you
I wouldn't expect to have a table of unbooked seats. A table of bookings seems more logical. Your concerts table looks questionable if you expect to have a series of dates for the same concert.
Perhaps you should first sketch out the key functions of your site as User Stories or Use Cases and list out the required attributes for each. That could give you a better set of requirements for your database design, e.g. what customer attributes; what about seat attributes such as restricted view, standing places or accessible places for the disabled.

OLAP cube powering Excel Pivot. What's a better solution?

I'm looking to build a dynamic data environment for non-technical marketers.
I want to provide large sets of data in an Excel pivot table form so even marketers without analytics/technical backgrounds can access relevant performance information. I'm trying for avoid non-excel front ends since I don't want users to have to constantly export data when they need to manipulate it in some way.
My first thought was to just throw together an OLAP cube populated with pre-aggregated data, but I got pushback from the IT team as OLAP is "obsolete." I don't disagree with them - there are definitely faster data processing architectures out there.
So my question is this: are there any other ways to structure the data so that marketers can access it easily but still manipulate it to some degree in Excel? I'm working with probably 50-100m rows of data and need the ability to scale dimensionality.
This is just my thoughts.
Really the question could be thrown back at your IT team. Your first thought was to throw together an OLAP cube. IT didn't like this. If they're so achingly hip that they consider OLAP "obsolete", what do they suggest as a better, more up-to-date alternative?
Or, to put it a different way - what is the substance of their objection to an OLAP solution? (I'm assuming there is one beyond "MS gave us an awesome presentation of PowerPivot/Azure tabular, with really great free snacks and coffee").
Your requirements are pretty clear:
Easy access for non-technical people
Structured data so that they don't have to interpret the raw data
Access through Excel
Scalability
I'll be paying close attention to any other answers to your question, because I'm always interested in finding out that I don't know something; but personally I haven't come across a better solution to these requirements than OLAP.
What makes me suspicious of the "post-OLAP" sentiment is related to point (2) in the list above. Non-technical users can tend to think of the cube data they consume as being somehow effortlessly produced, by some kind of magic. That in itself is an indicator of success, demonstrating just how easy it is for users to get what they want from a well-designed OLAP system.
But this effortlessness is an illusion: to structure the raw data into this form takes design effort, and the resulting structure incorporates design decisions and assertions: that is how it can be easy to use, because the hard stuff has been encapsulated in the cube design.
I have a definite Han Solo-like bad feeling about "post-OLAP": that it amounts to pandering to this illusion of effortless transformation of data into a usable form, and propagates further illusions.
Under OLAP, users get their wonderful magic usable data structure, and the hard work is done out of sight by developers like you or me. Perhaps we get something wrong so that they can't see data exactly as they'd like to - but at least the users can then talk to us and ask for what they do want.
My impression of the "post-OLAP" sales pitch is that it tries to dispense with the design work. We don't need those pesky expensive developers, we don't need to make specific design decisions (which necessarily enable some functionality while precluding some other functionality), we don't need cube-processing time-lags. We can somehow deliver this:
Input any data you like. Don't worry if it's completely unstructured or full of dirt!
Any scale
Immediate access to analytics without ETL/processing delays
Somehow, the output is usable, structured data. Structured by... no-one in particular. The user can structure it as they like, but somehow this will be easy
Call me cynical, but this sounds like magical thinking to me.

How to export specific price and volume data from the LMAX level 2 widget to excel

Background -
I am not a programmer.
I do trade spot forex on an intraday basis.
I am willing to learn programming
Specific Query -
I would like to know how to export into Excel in real time 'top of book' price and volume data as displayed on the LMAX level 2 widget/frame on -
https://s3-eu-west-1.amazonaws.com/lmax-widget/website-widget-quote-prof-flex.html?a=rTWcS34L5WRQkHtC
In essence I am looking to export
price and volume data where the coloured flashes occur.
price and volume data for when the coloured flashes do not occur.
I understand that 1) and 2) will encompass all the top of book prices and volume. However i would like to keep 1) and 2) separate/distinguished as far as data collection is concerned.
Time period for which the collected data intends to be stored -> 2-3 hours.
What kind of languages do I need to know to do the above?
I understand that I need to be an advanced excel user too.
Long term goals -
I intend to use the above information to make discretionary intraday trading decisions.
In the long run I will get more involved with creating an algo or indicator to help with the decision making process, which would include the information above.
I have understood that one needs to know coding to get involved in activities such as the above. Hence I have started learning C ++. More so to get a hang/feel for coding.
I have been searching all over the web as to where to start in this endeavor. However I am quite confused and overwhelmed with all the information.
Hence apart from the specific data export query, any additional guidelines would also be helpful.
As of now I use MT4 to trade. Hence I believe to do the above - I will need more than just MT4.
Any help would be highly appreciated.
Yes, MetaTrader4 is still not able ( in spite of all white-label-ed Terminals' OrderBook Add-On(s) marketing and PR efforts ) to provide an OrderBook-L2/DoM-data into your MQL4 / NewMQL4 algorithm for any decision making. Third party software tools' integration is needed to make MQL4-code aware of the real-time L2/DoM-data.
LMAX widget has impressive look & feel, however for your Excel export it requires a lot of programming efforts to re-use it for an automated scanner to produce data for 1 & 2 while there may be some further, non-technical, troubles on legal / operational restrictions for automated scanner to be operated on such data-source. To bring an example, the data-publisher policy restrict automated Options-pricing scanners for options on { FTSE | CAC | AMS | DAX }, may re-visit the online published data-sources no more than once a quarter of an hour and get blocked / black-listed otherwise. So a care and a proper data-source engineering is in place.
Size of data collection is another issue. Excel has some restrictions on an amount of rows/columns that may get imported. Large data-files, the more the CSV-imports may strike these limits. L2/DoM-data, collected for 2-3 hours just for one single FX Major may go beyond such a limit, as there are many records per second ( tens, if not hundreds, with just a few miliseconds between them ). Static file-size of collected data-records take typically several minutes to just get written on disk, so proper distributed processing data-flow-design and non-blocking-fileIO engineering is a must.
Real-time system design is the right angle to view the problem solution approach, rather than just some programming language excersise. Having mastered some programming language is a great move, nevertheless, so called robust real-time system design, and Trading software is such a domain, requires, with all respect, a lot more insights and hands-on experience than to make an MQL4 code run multi-thread-ed & multi-process-ed with a few DLL services for a Cloud/Grid-based distributed processing system.
How much real-time traffic is expected to be there?
For just a raw idea, what the Market can produce per second, per milisecon, per microsecond, let's view a NYNEX traffic analysis for one instrument:
One second can have this wild relief:
And once looking into 5-msec sampling:
How to export
Check if the data-source owner legally permits your automated processing.
Create your own real-time DataPump software, independent of the HTML-wrapped Widget
Create your own 'DB-store' to efficiently off-load scanned data-records from real-time DataPump
Test the live data-source >> DataPump >> DB-store performance & robustness on being able to serve error-free a 24/6 duty for several FX Majors in parallel
Integrate your DataPump fed DB-store local data-source for on-line/off-line interactions with your preferred { MT4 | Excel | quantitative-analytics } package
Integrate a monitoring of any production environment irregularity in your real-time processing pipeline, which may range from network issues, VPN / hosting issues, data-source availability issues to an unexpected change in the scanned data-source format/access conditions.

How to estimate search application's efficiency?

I hope it belongs here.
Can anyone please tell me is there any method to compare different search applications working in the same domain with the same dataset?
The problem is they are quite different - one is a web application which looks up the database where items are grouped in categories, and another one is a rich client which makes search by keywords.
Is there any standard test giudes for that purpose?
There are testing methods. You may use e.g. Precision/Recall or the F beta method to estimate a rate which computes the "efficiency". However you need to make a reference set by yourself. That means you will somehow measure not the efficiency in the domain, more likely the efficiency compared to your own reasoning.
The more you need to make sure that your reference set is representative for the data you have.
In most cases common reasoning will give you also the result.
If you want to measure the performance in matters of speed you need to formulate a set of assumed queries against the search and query your search engine with these at a given rate. Thats doable with every common loadtesting tool.

Resources