I currently have approximately 10M rows, ~50 columns in a table that I wrap up and share as a pivot. However, this also means that it takes approximately 30mins-1hour to download the csv or much longer to do a powerquery ODBC connection directly to Redshift.
So far the best solution I've found is to use Python -- Redshift_connector to run update queries and perform an unload a zipped resultset to an S3 bucket then use BOTO3/gzip to download and unzip the file, then finally performing a refresh from the CSV. This resulted in a 600MB excel file compiled in ~15-20 mins.
However, this process still feel clunky and sharing a 600MB excel file among teams isn't the best either. I've searched for several days but I'm not closer to finding an alternative: What would you use if you had to share a drillable table/pivot among a team with a 10GB datastore?
As a last note: I thought about programming a couple of PHP scripts, but my office doesn't have the infrastructure to support that.
Any help would or ideas would be most appreciated!
Call a meeting with the team and let them know about the constraints, you will get some suggestions and you can give some suggestions
Suggestions from my side:
For the file part
reduce the data, for example if it is time dependent, increase the interval time, for example an hourly data can be reduced to daily data
if the data is related to some groups you can divide the file into different parts each file belonging to each group
or send them only the final reports and numbers they require, don't send them full data.
For a fully functional app:
you can buy a desktop PC (if budget is a constraint buy a used one or use any desktop laptop from old inventory) and create a PHP/Python web application that can do all the steps automatically
create a local database and link it with the application
create the charting, pivoting etc modules on that application, and remove the excel altogether from your process
you can even use some pre build applications for charting and pivoting part, Oracle APEX is one examples that can be used.
Related
I need to access about 9 files in a particular sharepoint sub-folder for my powerbi visualisation. Each file holds different data and I need them as separate tables.
I tried the below approach as I felt connecting to the share point folder is really slow.
Sharepoint_Folder_Query > This connects to share point site. And filters for the subfolder. This uses "Sharepoint.files" as I was not able to use sharepoint.contents in subfolder since the files are in a subfolder.
File_Query > This references the above "Sharepoint_Folder_Query" and picks up the file it needs. There is 9 File_Query(s), one for each file
Data_Query > There are again 9 data queries each referencing the respective "File_Query" and performs additional manipulation on the data. These are the tables used in my visualisations.
My idea on this was that since connecting to sharepoint takes a lot of time, I wanted to connect just once and use the reference from then on.
But right now my refresh is taking almost 1 hour... During the refresh I could see each of my queries are trying to connect to share point... I could see the message "Waiting for https://..." under all my queries. Not sure what I did wrong.
For comparison, initially I had the files in one-drive and just used the Data_Query section and created all the visualisations. It then took me only 30 seconds to refresh.
So my question. Is something wrong with my approach? if yes what? Also is there a better way to do this and reduce the refresh time.
I download the end of day stock prices for over 20,000 global securities across 20 different markets. I then run my 20,000 proprietary trading setups over these securities for profitable trading setups. The process is simple but the process needs the power of cloud computing to automate because its impossible to run on a desktop.
I'm coming at this solution as a complete beginner so please excuse my lack of technical understanding.
I download the prices from a single source onto my computer into Microsoft Excel Files.
Do I use Apache Arrow to transport the excel files into Apache Parquet? I'm considering Parquet because its a columnar storage solution which is ideal for historical stock price file formats.
To run my 20,000 proprietary trading setups I would use Apache Spark to read the parquet files in my chosen cloud environment.
This would produce the high probability trade results everyday which would upload onto my web based platform.
A very simplified setup from my current research. Thank you for assistance in advance.
Kind regards
Levi
I'm sorry but you don't have a big data setup.
What you are doing is using just one computer to convert from excel files into parquet. If you are able to read the data and write again on disk in a reasonble time it seems you don't have "big data".
What you should do is:
Get data into your datalake using something like Apache NiFi
Use spark to read data from datalake. For excel files see How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks
if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.
A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.
I am working on a project within excel and am starting prepare my document for future performance related problems. The excel file contains large amounts of data and large amounts of images which are all in sets, ie, 40 images belong to one function of the program, another 50 belong to another etc... and only one set of them is used at a time.
This file is only going to get bigger as the number of jobs/functions it has to handle increase. Now, I could just make multiple excel files and let the user choose which one is appropriate for the job but it is requested that this is all done from one file.
Baring this in mind, I started thinking about methods of creating such a large file whilst keeping its performance levels high and had an idea which I am not sure is possible or not. This is to have multiple protected workbooks each one containing the information for each job "set" and a main workbook which accesses these files depending on the user inputs. This will result in many excel files which take time to download initially but whilst being used should eliminate the low performance issues as the computer only has to access a subset of these files.
From what I understand this is sort of like what DLL's are for but I am not sure if the same can be done by excel and if possible would the performance increase be significant?
If anyone has any other suggestions or elegant solutions on how this can be done please let me know.
Rather than saving data such as images in the excel file itself, write your macro to load the appropriate images from files and have your users select which routine to run. This way, you load only files you need. If your data is text / numbers, you can store it in a CSV or, if your data gets very large, use a Microsoft Access database and retrieve the data using the ADODB library.
Inserting Images: How to insert a picture into Excel at a specified cell position with VBA
More on using ADODB: http://msdn.microsoft.com/en-us/library/windows/desktop/ms677497%28v=vs.85%29.aspx