Options for running data extraction on a daily basis - excel

I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks

if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.

Related

Live Connection to Database for Excel PowerQuery?

I currently have approximately 10M rows, ~50 columns in a table that I wrap up and share as a pivot. However, this also means that it takes approximately 30mins-1hour to download the csv or much longer to do a powerquery ODBC connection directly to Redshift.
So far the best solution I've found is to use Python -- Redshift_connector to run update queries and perform an unload a zipped resultset to an S3 bucket then use BOTO3/gzip to download and unzip the file, then finally performing a refresh from the CSV. This resulted in a 600MB excel file compiled in ~15-20 mins.
However, this process still feel clunky and sharing a 600MB excel file among teams isn't the best either. I've searched for several days but I'm not closer to finding an alternative: What would you use if you had to share a drillable table/pivot among a team with a 10GB datastore?
As a last note: I thought about programming a couple of PHP scripts, but my office doesn't have the infrastructure to support that.
Any help would or ideas would be most appreciated!
Call a meeting with the team and let them know about the constraints, you will get some suggestions and you can give some suggestions
Suggestions from my side:
For the file part
reduce the data, for example if it is time dependent, increase the interval time, for example an hourly data can be reduced to daily data
if the data is related to some groups you can divide the file into different parts each file belonging to each group
or send them only the final reports and numbers they require, don't send them full data.
For a fully functional app:
you can buy a desktop PC (if budget is a constraint buy a used one or use any desktop laptop from old inventory) and create a PHP/Python web application that can do all the steps automatically
create a local database and link it with the application
create the charting, pivoting etc modules on that application, and remove the excel altogether from your process
you can even use some pre build applications for charting and pivoting part, Oracle APEX is one examples that can be used.

Alternative of Cassandra for storing User data with high IO

We are looking for a technology stack which will have the following criteria.
We will be having around 10 million customer.
Each customer will be having around 20MB+ of data.
Data of each user will be updated everyday.
We need to store the data for more than six months.
We may need to query on the data any time within the time span of six months.
Currently we are thinking to use Cassandra, but the limitation of maximum storage per node in Cassandra should be less than 3TB, we are looking for other alternatives to use with or without Cassandra.
Well, I don't know if my suggestion applies for your case. We had a similar case with one of our products. There was created a blob field to record binary data, as pdf documents, that made the database grew considerably.
The solution we made was to create a second database, as a repository for records older then one year. At the application server there's a service running which:
1) Copies the records, from specific tables, older then one year to this second database;
2) Deletes records from the main database, once we have a copy in the other side;
3) Queries that need data older then one year are directed to this second database;
Sure, we had to do some implementations on the code to adapt to this situation, but is running good so far.
You can try ScyllaDB. It's a C++ reimplementation of Cassandra at 10x the speed. Scylla supports 10TB/node and there are examples of larger amounts per node. Proper disclosure - I work there but am speaking from experience.
You can definitely consider just to store the metadata itself in the database and the blobs on a separate nodes outside but it's complex and Scylla can store it all altogether. Such a similar system is already in production and we hope that user will eventually open source it

Unable to open a large .csv file

A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.

Grails Excel import fails for huge data

I am using grails 2.3.7 and the latest excel-import plugin (1.0.0). My requirement is that I need to copy the contents of an excel sheet completely as it is into the database. My database is mssql server 2012.
I have got the code working for the development version. The code works fine when the number of records are few or may be upto a few hundreds.
But while in production the excel sheet will be having as many as 50,000 rows and over 75 columns.
Initially I faced a data out of memory exception. I increased the heap size to as much as 8GB, but now the thread keeps running on and on without termination. No errors are generated.
Please note that this is a once in while operation and it will be carried out by a person who will ensure that this operation does not hamper other operations running parellely. So need to worry about the huge load of this operation. I can afford to run it.
When the records are upto 10,000 with the same number of columns the data gets copied in around 5 mins. If now I have 50,000 rows then the time taken should ideally be around 5 times more, which is around 25 mins. But the code kept running for more than an hour without termination.
Any idea how to go about this issue. Any help is highly appreciated.
If you load 5 times more data in memory, it doesn't always take 5 times more. I guess that most of 8GB are in virtual memory and the virtual memory is very slow on hardware. Try to decrease the memory, run some memory tests and try to use as much as possible the RAM.
In my experience, a normal problem with large batch operations in Grails. I think you have memory leaks that radically slow down the operation as it proceeds.
My solution has been to use an ETL tool such as Pentaho Kettle for the import, or chunk the import into manageable pieces. See this related question:
Insert 10,000,000+ rows in grails
Not technically an answer to your problem, but have you considered just using CSV instead of of excel?
From a users point of view, saving as a CSV before importing is not a lot of work.
I am loading, validating and saving CSVs with 200-300 000 rows without a hitch.
Just make sure you have the logic in a service so it puts a transaction around it.
A bit more code to decode csv maybe, especially to translate to various primitives, but it should be orders of magnitude faster.

SharePoint SQL Reporting Services - OutOfMemory exception on large reports. How to solve?

We have a bunch of reports on SharePoint, using SQL Reporting Services.
The statistics reports - the ones that aggregate data and display few hundreds to a few thousands records are loading fine.
However, we also have reports that display raw records from database. These reports usually have tens or hundreds of thousands of records. Sometimes even millions. And most of the times, they do not load but throw OutOfMemory errors.
The queries for those reports are very simple selects with some where conditions (sometimes, another few small tables might be joined on the huge one). In SQL Server Management Studio the query completes in 5-10 seconds.
I'm frustrated, because the clients are asking for the report, but I can't find any solutions to this (I googled a lot, but the best advice I could find was "get rid of the report or try to minimize the amount of data in it" which doesn't really solve anything - clients insist that they need the ENTIRE report.)
Is it possible to solve this somehow?
Move to 64 bit for the reporting server?
Chances are the users need the ENTIRE report because they are scraping the data off into excel or some other format and using it elsewhere. If you can get away with it, coding a webpage or similar that displays the query in a simple text format/csv may be more effective than a report.
I.e. The best advice you can find is the best advice.

Resources