How can I speed up my openpyxl program reading Excel .xlsx files? - python-3.x

Like many folks I need to read both .xls files (I call them S files, using xlrd) and .xlsx files (the X files, using openpyxl), in both cases files of about 30,000 rows. And in both cases I'm just copying all excel data read out to a .csv file, no other processing so just Input/Output.
But the X file operations are over 200 times slower than for .xls, for example reading a 30,000 row .xlsx file now takes 2 minutes compared to 1/2 second for .xls with xlrd. We have thousands of files to process so the time per file matters.
Is openpyxl that much slower or do I need to do something, like release some resource at the end of each row?
BTW, I have made several great improvements by using read_only=True and reading a row at a time instead of cell by cell
as shown in the following code segment. Thanks to blog.davep.org
https://blog.davep.org/2018/06/02/a_little_speed_issue_with_openpyxl.html
wb = openpyxl.load_workbook("excel_file.xlsx", data_only=True, read_only=True)
sheet = wb.active
for row in sheet.rows:
for cell in row:
cell_from_excel = cell.value

Related

Using Query function in Excel to combine data from multiple CSV files into one excel worksheet with each CSV parallel to last?

I have a bunch of csvs each with 2 columns - time and value. Can anyone help me to use MS query to import data from multiple CSV files into one worksheet, with one time column then data from each csv file as a new subsequent column? I've got it to put each CSV dataset into one ginormous worksheet with 1000s of rows so far but no idea how to make it put them next to one another?
Many thanks

How to Append CSV file via VBA without reading empty formatted cells

I have a code that is able to Append a range of data in a worksheet to an existing CSV file. However, I noticed that when the existing CSV file contains empty, but formatted cells beyond the end of the data range, the Append function takes into account those empty cells as well.
For example, the existing CSV file has 10 rows of data, while rows 11 to 20 are empty, but has been formatted (e.g. as "dd-mmm-yy"). So when I append the CSV file, the new data is added at row 21 instead of row 11.
Apart from manually deleting rows 11 to 20 in the CSV file, is there a quick fix to this? I have many existing CSV files, so it is not feasible to do the manual way. As to why some of the rows are empty, but formatted, it is due to some earlier amendments that took place.
Appreciate the help.
I have a code that is able to Append a range of data in a worksheet to an existing CSV file. However, I noticed that when the existing CSV file contains empty, but formatted cells beyond the end of the data range, the Append function takes into account those empty cells as well.
If you've written that code, it needs to account for the formatted cells you want to ignore. Let's say you're exchanging data from a .xlsx file to a .csv file and all of this is being done inside the Excel platform. Then for example, if "General" is the format that works best for you, but some trailing cells in .csv column A, Sheet 1 are not that, then have your code evaluate on a loop whether CSV Worksheets("Book1").Range("A[whatever row]").NumberFormat = "General"
Then have your code change the ones that aren't, to General; then proceed with the append.

Excel Save CSV without blanks

I have an Excel spreadsheet that generates CSV scripts used in an application. The scripts must be in a very specific format, and I save a master in XLSX format with protected sheets and data validation to save the CSVs from rather than directly edit the CSVs, as directly editing the CSVs can lead to mistakes.
The issue is that the scripts can be of nearly any length. The left column of each line can only be one of a certain set of values, and the last line has to say "END". The only way I can do this without VBA is the following formula in the A column, from row 7 (the first 6 are header information) to row 1048576 (last Excel row) and protect the sheet with column A locked:
=IF(AND(ISBLANK(B368),NOT(ISBLANK(B367))),"END",IF(ISBLANK(B368),"",A367))
This makes the last row say "END" in column A, and all rows after blank, which is what is desired. The problem is that now when the CSV file is saved, it will always have 1048576 rows, with all the bottom rows containing the delimiters ",,,," . This won't work, the CSV file needs to stop after the "END" row. Is there a way to write the formula that will cause Excel to ignore the cells which evaluate to blank when saving to CSV or an alternate way to save to CSV in Excel that will ignore all the rows that evaluate to blank?
Note: I have a solution in VBA already that I can use on my own machine (it copies the data up to "END", pastes in a new sheet in text only format, then saves as CSV with the name of the original worksheet). I want to share this sheet, however, and getting around the security constraints to share macros at my company is a pain. So I'm looking for a way this might be done without Macros, if it's possible at all.
In looking for an answer I found this link, which is similar, but not the same:
Saving Excel data as csv with VBA - removing blank rows at end of file to save
As the "blanks" I have are active rows because they contain formulas, this method will not work.
Manually deleting the rows / columns will work to reset the size, as GSerg noted in the other question. Alternatively, also as suggested by GSserg, you can copy the data to a new sheet before saving.
Otherwise, an easy fix might be to create a small post-excel / pre-processing script - perhaps using a batch file - Batch / Find And Edit Lines in TXT file - or a similar solution in any small scripting language to remove the extra rows.

Matlab: Excel COM actxserver read entire sheet

I have a large number of measurement data in excel files. I need to read and process this data using matlab. The problem I have is that not all excel files contain the same amount of data.
I have tried the following
excel = actxserver('excel.application');
xlswb = excel.Workbook.Open(myFile);
data = xlswb.Sheets.Item(mySheet).get('Range', 'A1', 'L10000').Value;
This "works" as there will not be more than 10000 rows and 8 columns (in my case). However, this is very dirty code and I have to check for each file where the data actually ends.
Is there a way to read a whole sheet (similar to the xlsread function)?
Thanks!
Sheets("SheetName").UsedRange will get you a collection every used cell in that sheet. However, if cell L10000 had data and it was cleared, it will still make part of that collection.

excel into matlab

I have several excel spreadsheets in a folder, where each spreadsheet contains several worksheets. I've written a code which loads a specific worksheet from each spreadsheet into matlab. The worksheet is called 'Bass min'.
files = dir('*.xls');
%read data from excel into matlab
for i=1:length(files);
File_Name{i}=files(i,1).name;%Removes the file names from 'files'
[num{i},txt{i},raw{i}] = xlsread(File_Name{i},'Bass min');
end
Is there a faster way of doing this? As I have many spreadsheets its takes a long time to read. I've heard some people mentioning actxserver as a faster method, but don't know how this would work!
many thanks
You could try reading the files in basic mode, in which case Matlab would read the files directly without going through Excel:
[num{i},txt{i},raw{i}] = xlsread(File_Name{i},'Bass min','','basic');

Resources