Reporting a huge volume of data

Reporting a huge volume of data - excel

We run a SAAS product. One of our customer is an enterprise client and while others generate 8-10,000 rows of data for a bill report for 6 months. They generate about 300,000 rows of data. Each row has 30 columns. So when they try to goto our reporting module and export their 6 months report, our servers go unresponsive as the CPU jumps to 80%, and the entire process fails. Fetching 300K from SQL database after joins etc takes 30 minutes, followed by 3 hours to write the 300K rows with 30 columns in excel. My worry is that just 1 client is bringing the entire site down, as we have 40 reports, and if they start going back 6 months and exporting data, other users will have a terrible experience.
What is the best way to handle such large exports of data in SAAS products?
Tx

Replicate your production database into a "end-user's sand-box database", serve your end-users' reporting from there, where they can slow down the DB to a crunching halt, if they need to.
Such large exports shouldn't be done from a production system.

Related

Excel power query taking extremely long loading to data model

I have one table in Excel 365 that I transform using a few queries to output as 2 pivot tables and 1 table. One of these queries is also loaded to the data model as I need to create measures.
The operation is taking extremely and erratically long - approximately 25 to 70 minutes just to load a query involving 14 columns and 100 rows. The same operation in Power BI consistently takes 2.5 minutes in comparison. After spending 2 weeks on diagnostics, I eventually isolate the problem to
Output query to pivot table and Load to data model.
Row-wise operations in query (if loaded to data model).
I had spent 2 week refactoring my code. This includes buffering what can be buffer and optimising every operation as best as I can. Also tried "Fast Data Load" but doesn't help.
Not loading the query to data model means everything only takes 2-3 minutes. But I cannot not load the query to the data model because I need measures.
Is there anything else I can do to fix this extremely long load to data model operation?

Load Large Amount of Row In Cassandara / Datastax

I have a table that has about 3 millions row in 1 partition key. I need to load all those data and save it as file.
Actually it is data recorded for 1 day sensor input. And what I want to build is a playback service to replay the sensor event at least in the past 3 months. So I am thinking it will be really big volume of data.
I am new in NoSql database, any approach how to achieve these goal?

Using QDigest over a date range

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.

Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

Power Pivot Report Slow

I wanted to see if anyone can assist. I have been using power pivot a fair amount over the last 2 years. I have created a report within it which uses 9 columns measuring 3 metrics. (Essentially sales over the last 2 years with a variance).
The Pivot table pulling through uses no sub-totals and is in tabular form pulling from only the one source file which is an excel workbook.
The issue I am receiving when I pulled in the 6th to 9th column the query got notably slower. Now taking up to 15 minutes to simply add a filter.
Can anyone assist in making this data refresh at a normal rate? I have tried making the source file smaller, making it now so that there is only one source file and not 3 however this seems to have a minimal effect.
I have other reports with much larger amounts of data and more connections where I am not encountering this problem at all so this is a little beyond my comprehension.

Access DB: Daily cumulative count and daily number on assignment

I have a table that tracks the dispatching of personnel. The table has the employee name and the date the person went out and the date they returned.
The table has hundreds of entries from 1988 to current.
In Excel I track the cumulative count per day (of the year) of how many people have been sent out, and I also track the number of people out on any given day. The table lists the Month & Day in the first column (every day of the year, including leap days) and the years on the first row. There is data for every date (a zero is entered until the first person is sent out that year, then starts counting up as there are more dispatches, or in the case of the number of people out each day, it will show zero if no one is out that day or if there were, say 5 people out, it would show "5" for that day). I then use the data in Excel to construct a graph that shows the number of dispatches on the y axis and the day of the year on the x axis (along with the current year’s number, the average number and the max over the 27 year history). Currently I just track this manually (I just keep a running count of each and enter it in manually in Excel.) I would like to build a query of my Access data that would return the same information that I could import into my Excel spreadsheet. One query that would show the day & month in the first column and the years along the top row and for each day show a cumulative count for that year of how many people have been sent out. Another query that has the day & month in the first column and the years along the top and a count of how many people were out for that particular day for that particular year. There shouldn't be any gaps (every day has data, even if it is "0"). I would then import those queries into Excel to replace my manual tracking that I am doing now.
I know how to construct the Excel stuff (I have that running already), and how to import info from Access to Excel, what I need to know is how to construct these 2 Access queries.
Any help/ideas on how to construct those 2 queries would be greatly appreciated!

I'd recommend that you migrate this app to a web based solution that uses a real database - SQL Server or MySQL, not Access.
"Desk drawer software" is what I call homegrown apps that someone creates for themselves to perform some small task that eventually become integral to running a business and grow out of hand. Your truck factor is 1: if anything happened to you, no one would know how to do this function. The software may not be backed up or checked into a source code management system. There's no QA. There's no way to migrate new features to production: if you alter the app, then that is what you have.
I'd recommend a web app to mitigate all the risks I've described:
You have to deploy a web app to a server, which takes it off your desktop and puts it in a central place where anyone who's authorized can access it.
Separates database from display issues.
Makes you think about how to archive historical data. Partitioning by year makes sense.
Likely you'll put this in a source code management system like Subversion or Git.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string