Streaming writes to a Microsoft Excel file

Streaming writes to a Microsoft Excel file - excel

I am trying to generate very large Microsoft Excel files in a browser application. While there are JavaScript libraries which allow me to generate XLSX files from the browser, the issue with them is that they require all of the document contents to be loaded in memory before writing them, which gives me an upper bound on how much I can store in a single file before the browser crashes. Thus I would like to have a write stream that allows me to write data sequentially into a Excel file using something like StreamSaver.js.
Doing such a thing with CSV would be trivial:
for (let i = 0; i < paginatedRequest.length; i++) {
writer.write(paginatedRequest[i].join(",") + "\n");
}
The approach above would allow me to write an extremely large number of CSV rows to an output stream without having to store all of the data in memory. My question is: is this technically feasible to do with an XLSX file?
My main concern here is that internally XLSX files are ZIP archives, so my first idea was to use an uncompressed ZIP archive and stream writes to it, but every file inside a ZIP archive comes with a header which indicates its size and I can't possibly know that beforehand. Is there a workaround that I could possibly use for this?
Lastly, if not possible, are there any other streamable spreadsheet formats which can be opened in Excel and "look nice"? (There is a flat OpenDocument specification with the .fods extension, so I could stream writes to such a file. Sadly, Microsoft Office does not support flat OpenDocument files.)

A possible solution would be to generate a small, static XLSX file which imports an external CSV file using Excel's Data Model. Since generating a streaming CSV file is almost trivial, that could be a feasible solution. However, it's somewhat unsatisfactory:
It's rather annoying to have the user download two files instead of one (or a compressed file that they'd need to uncompress).
Excel does not support relative routes to external CSV files, so we'd also need a macro to ensure that we update the route every time we open the file (if this is feasible at all). This requires the user accepting the usage of macros, which comes with a security warning and is not terribly nice for them.

Related

In Excel how can I update an external file if it has so many rows that it cannot be loaded?

I have a .csv file that has around 2 million rows, and I want to add a new column. Problem is, I could manage to that by losing a lot of data (basicly everything above ~1,1m rows). When I used connection to the external file (so that I could read all rows), and made changes to it in Power Query, the changes was not saved to the .csv file.

You can apply one of several solutions:
Using a text editor which can handle huge files, save the csv files into smaller chunks. Apply the modifications to each chunk. Join chunks again to get the desired file.
Create a "small" program yourself, which loads the csv line by line and applies the modification, adding the resulting data to a second file.
Maybe some other software can handle that size of a csv. Patch the LibreOffice for this purpose, to handle 2000000+ lines - the source code is available :)

Azure Data Factory deflate without creating a folder

I have a Data Factory v2 job which copies files from an SFTP server to an Azure Data Lake Gen2.
There is a mix of .csv files and .zip files (each containing only one csv file).
I have one dataset for copying the csv files and another for copying zip files (with Compressoin type set to ZipDeflate). The problem is that the ZipDeflate creates a new folder that contains the csv file and I need this to respect the folder hierarchy without creating any folders.
Is this possible in Azure Data Factory?

Good question, I ran into similar trouble* and it doesn't seem to be well documented.
If I remember correctly Data Factory assumes ZipDeflate could contain more than one file and appears to create a folder no matter what.
If you have Gzip files on the other hand which only have a single file, then it will create only that.
You'll probably already know this bit, but having it in the forefront of your mind helped me realise the sensible default data factory has:
My understanding of it is that the Zip standard is an archive format which is happening to use the Deflate algorithm. Being an archive format it naturally can contain multiple files.
Whereas gzip (for example) is just the compression algorithm it doesn't support multiple files (unless tar archived first), so it will decompress to just a file without a folder.
You could have an additional data factory step to take the hierarchy and copy it to a flat folder perhaps, but that leads to random file names (which you may or may not be happy with). For us it didn't work as our next step in the pipeline needed predictable filenames.
n.b. Data factory does not move files it copies them so if they're very large this could be a pain. You can trigger a meta data move operation via the data lake store API or Powershell etc however.
*Mine was slightly crazier situation in that I was receiving files named .gz from a source system but were in fact zip files in disguise! In the end the best option was to ask our source system to change to true gzip files.

Unable to open a large .csv file

A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!

There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.

Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.

Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.

This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.

Pentaho Data Integration - Excel Writer Output File Size

Is PDI inefficient in terms of writing excel xlsx file with Microsoft Excel Writer.
A transformed excel data file in Pentaho output seems to be three times the size, if the data was transformed manually. Is this inefficiency expected or is there a workaround for it.
A CSV file of the same transformed output is way smaller in size. Have I configured something wrong ?

xlsx files should normally be smaller in size than CSV, since they consist of XML data compressed in ZIP files. Pentaho's Microsoft Excel Writer uses org.apache.poi.xssf.streaming.SXSSFWorkbook and org.apache.poi.xssf.usermodel.XSSFWorkbook to write xlsx files, and they create compressed files so this should not be your issue.
To check the files you could check with a zip utility, to see the file sizes and compression rate, to see if there is a bug. You could also try to open the file in Excel and re-save it, to see if that gives a smaller size, which could indicate an inefficiency.

Creating an excel library (DLL for excel)?

I am working on a project within excel and am starting prepare my document for future performance related problems. The excel file contains large amounts of data and large amounts of images which are all in sets, ie, 40 images belong to one function of the program, another 50 belong to another etc... and only one set of them is used at a time.
This file is only going to get bigger as the number of jobs/functions it has to handle increase. Now, I could just make multiple excel files and let the user choose which one is appropriate for the job but it is requested that this is all done from one file.
Baring this in mind, I started thinking about methods of creating such a large file whilst keeping its performance levels high and had an idea which I am not sure is possible or not. This is to have multiple protected workbooks each one containing the information for each job "set" and a main workbook which accesses these files depending on the user inputs. This will result in many excel files which take time to download initially but whilst being used should eliminate the low performance issues as the computer only has to access a subset of these files.
From what I understand this is sort of like what DLL's are for but I am not sure if the same can be done by excel and if possible would the performance increase be significant?
If anyone has any other suggestions or elegant solutions on how this can be done please let me know.

Rather than saving data such as images in the excel file itself, write your macro to load the appropriate images from files and have your users select which routine to run. This way, you load only files you need. If your data is text / numbers, you can store it in a CSV or, if your data gets very large, use a Microsoft Access database and retrieve the data using the ADODB library.
Inserting Images: How to insert a picture into Excel at a specified cell position with VBA
More on using ADODB: http://msdn.microsoft.com/en-us/library/windows/desktop/ms677497%28v=vs.85%29.aspx

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string