Handling huge Excel file

Handling huge Excel file - excel

Need your help badly. I am dealing with a workbook which has 7000 rows X 5000 columns data in one sheet. Each of this datapoint has to be manipulated and pasted in another sheet. The manipulations is relatively simple where each manipulation will take less than 10 lines of code (simple multiplications and divisions with a couple of Ifs). However, the file crashes every now and then and getting various types of errors. The problem is the filesize. To overcome this problem, I am trying a few approaches
a) Separate the data and output in different files. Keep both files open and take data chunk by chunk (typically 200 rows x 5000 columns) and manipulate that and paste that in output file. However, if both files are open, then I am not sure it remedies the problem since the memory consumed will be same either way i.e. instead of one file consuming a large memory, it would be two files together consuming the same memory.
b) Separate the data and output in different files. Access the data in the data file while it is still closed by inserting links in the output file through a macro, manipulate the data and paste it in output. This can be done chunk by chunk.
c) Separate the data and output in different files. Run a macro to open the data file and load a chunk of data say 200 rows into memory into an array and close it. Process the array and open the output file and paste the array results.
Which of the three approaches are better? I am sure there are other methods which are more efficient. Kindly suggest.
I am not familiar with Access but I tried to import the raw data into Access and it failed because it allowed only 255 columns.
Is there a way to keep the file open but wash it in and out of Memory. Then slight variations to a and c above can be tried. (I am afraid repeated opening and closing will crash the file.)
Look forward to your suggestions

If you don't want to leave Excel, one trick you can use is to save the base excel file as a binary ".xlsb". This will clean out a lot of potential rubbish that might be in the file (it all depends on where it first came from.)
I just shrank a load of webdata by 99.5% - from 300MB to 1.5MB - by doing this, and now the various manipulation in excel works like a dream.
The other trick (from the 80s :) ) if you are using a lot of in cell formulae rather than a macro to iterate through, is to:
turn calculate off.
copy your formulae
turn calculate on, or just run calculate manually
copy and paste-special-values the formulae outputs.

My suggestion is using a scripting language of your choice and working with decomposition/composition of spreadsheets in it.
I was composing and decomposing spreadsheets back in the days (in PHP, oh shame) and it worked like a charm. I wasn't even using any libraries.
Just grab yourself xlutils library for Python and get your hands dirty.

Related

Node.js script for editing a .xlsx file

I have a large .xlsx file where each row contains a person's name and various other information. Some rows have duplicate entries throughout the file. I'd like to create a Node.js script that parses the file and deletes the rows with duplicate entries. What is the easiest way to go about this?

I have found Sheet.js to be the easiest way to interact with Excel files in node. They publish the xlsx node module: https://www.npmjs.com/package/xlsx.
The documentation can be a bit confusing, however. If you have specific issues during your implementation, feel free to edit your question with code or ask a new question!
Concerning your specific scenario, the xlsx module comes with some nifty ways to convert spreadsheets to and from arrays of arrays as well as arrays of objects. You say you have "a large .xlsx file". If it is truly massive, you might consider something like a stream read from the spreadsheet populating a new array with duplicates as you go. Then, using the original spreadsheet, stream it again into a new document ommiting the entries from the duplicates array.
However the array-of-arrays helpers etc might be an easier route. I have done in-memory processing of CSVs with nearly 100,000 rows (~50MB). It's a bit slow, but definitely possible.
Hope that helps
https://docs.sheetjs.com

Pentaho Data Integration - Dump Excel into table

I'm very new to this tool and I want to do a simple operation:
Dump data from an XML to tables.
I have an Excel file that has around 10-12 sheets, and almost every sheet coresponds to a table.
With the first Excel input operation there is no problem.
The only problem is that, I don't know why but, when I try to edit (show the list of sheets, or get the list of columns) a second Excel Input the software just hangs, and when it responds just opens a warning with an error.
This is an image of the actual diagram that I'm trying to use:

This is a typical case of out of memory problem. PDI is not able to read the file and required more amount of memory to process the excel file. You need to give PDI more memory to work with your excel. Try increasing the memory of the Spoon. You can read Increase Spoon memory.
Alternatively, try to replicate your excel file with few rows of data keeping the structure of the file as it is e.g. a test file. You can either use that test file to generate the necessary sheet names and columns in excel step. Once you are done, you can point the original file and execute the job.

Why is the text in my excel spreadsheet created from csv treating everything as text?

I wrote a python script to generate some data into a csv file. The data looks something like the following:
12/10/2015 1 0:05:38 0:09:18 0:00:24 0:15:20
5/11/2016 1 0:39:07 3:22:09 0:00:08 4:01:24
7/27/2016 1 0:00:00 0:37:42 0:02:12 0:39:54
8/4/2016 1 0:00:00 0:00:29 0:00:35 0:01:04
10/3/2016 1 0:05:51 0:50:46 0:00:17 0:56:54
The data I am interested in analyzing is in the form of h:mm:ss but formuals that I write to sum the information doesn't work. I figured out that the ISTEXT(CELLNUM) is returning TRUE so it is clearly treating the data is text even if I manually reformat the cells as h:mm:ss. I must be overlooking something simple because there must be a way to do this easier without having to go through a process every time I open a CSV into excel and save it as a spreadsheet. How can I open this csv into excel and save as a spreadsheet in a way that I can setup formulas to sum the times? I might end up creating a lot of these CSV files so I need a way to do it that is fast. What am I missing? Why isn't simply selecting all of the cells and reformatting them working?

The best answer is posted here by jeeped
When you have pasted data from an external source (e.g. web pages are
horrific for this) into a worksheet and numbers, dates and/or times
come in as textual representations rather than true numbers, dates
and/or times usually the quickest method is to select the column and
choose Data ► Text to Columns ► Fixed Width ► Finish. This forces
Excel to reevaluate the text values and should revert the
pseudo-numbers into their true numerical values.
It's strange that excel can't figure this out or provide a way to do it as the data is imported. It can handle dates during import but not time. However the fact that I can so easily fix the time values one column at a time after saving as an xlsx file makes me wonder why Microsoft never bothered to just make it easier to specify what the columns are when bringing in the data the first time. Instead I have to search the internet for hours on end to ultimately find a solution that takes just a minute or two. Weird. There are some other answers posted for other types of data where you can use paste special to add a number to the existing data but those solutions do not seem to work for time.

VBA Access -> Excel Output: File Size BALLOON from 3mb to 20mb, why?

I'm finishing up a program that I built to import an excel to a database, do some manipulations/edits, and then spit back out the edited Excel. Except my problem is that the file size just balloon'ed to a huge amount from approx 3mb to ~19mb.
It has the same record count ~20k. It has ~3 more columns (out of 40+ columns total) - but that shouldn't make the file size x6, should it? Below is the code I use for the output:
DoCmd.OutputTo acOutputQuery, "Q_Export", acFormatXLS, txtFilePath & txtFileName
Any ideas on how I can get that file size a bit more down? Or anyone have a possible indication of what is doing it at least?

There are three possible reasons that spring immediately to mind;
1) You are importing more records than you think you are. Check the table after the Excel file has been imported. Make sure the table has EXACTLY as many records as there are rows of data in Excel. Often the import process will bring in many empty records, and that data is then exported as empty strings. To you it looks the same, but to Excel it's information that must be stored, which takes up space.
2) Excel handles NULL values differently than Access does. If your data has a lot of missing information, it's going to be stored differently when it's imported into Access. This actually brings us back to Reason #1.
3) When you import data, sometimes it comes in with trailing spaces. make sure you TRIM() your data before exporting it, to get rid of any potential storage space being used.

How use to write programme that involve with a lot of active calculation? In excel 1M+ Row and 20+ column

First, I dont have any experience with programming. If I ever start, then this would be probably my first. I keep looking for answer until I found this site.
I am looking outside the box because in excel doing a data of 1 million + row and 20 + column would take a very long time just to wait for the calculation to be done and the copy and paste with formula would take longer. Imagine I have to let the computer running for 8+ hours with the helps of marco and F4 (repeat). All my formula have to paste into number only with I have done with the formula. And even I break the files into piece, the files sizes are 20MB to 110MB without active formula. Opening the file is taking forever.
I wonder how to write a programme with 1) dialog box, 2) the excel command and formula (sort, delimiter, concatenate), 3) ability to create graph, 4) with tabs to view different set of data or graph 5) add in a set of data 6) limiting the number (1-100000), etc. Outlook something look like utorrent.
What compiler suitable for this programme? It's easier you tell me which 'book' to read that me finding which 'book' is suitable because even if it is I might flip it through and go on to the next one. 'book' may refer to book, way, steps, etc.

I'm not sure what you actually want. With 1M+ rows and 20+ columns, an Excel sheet doesn't seem to be the right tool for the job. So do you...
want to keep using Excel, but automate the job? Use Excel VBA like renick suggested. It's the language that Excel uses internally for macros, but you can write any kind of automated processing you'd like. Beware, however, that VBA is not exactly the best language to start a programming experience with. (That's my personal opinion, and what matters is of course whether you get the job done).
want to switch to something else? A database management system seems better suited for the amount of data you have. Microsoft Access is part of Office and might already be on your system. Getting your data into and out of the database could be a problem, but the advantage you have is that a database is built to handle colossal amounts of data and will happily munch your figures for several days without failing. You can access the data using the Structured Query Language (SQL), which is not really a programming language, but very powerful (and it most certainly has CONCATENATE, ADD etc.). Graphing is more difficult, but can also be done.

If you know excel then Excel VBA is a VERY capable language to do all this. I would suggest you go to the VBA Dev Center here to get started.

I can't believe I'm about to say this (for most things I do it would be the wrong choice) but:
If the computations aren't that complex (just lots of them) Python might be a good bet.
If you can get the input as a CSV file than, for about 10 lines of code, you can write a loop that will be run for each line of input and hands you the values to play with.
for line in open('filename', 'w')
values = line.split(',')
#values has the values from this line as strings.
#these can be converted to numbers:
x = float(values[0])
n = int(values[1])
#... and then processed
That might not be the cleanest/best approach but it's simple and straight forward.
p.s. For 1M+ rows, don't expect it to be blazing fast (10 sec to a min or so, depending on what you do to the data)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string