extremely slow add a table to python-docx from a csv file - python-3.x

I have to add a table from a CSV file around 1500 rows and 9 columns, (75 pages) in a docx word document. using python-docx.
I have tried differents approaches, reading ths csv with pandas or directly openning de csv file, It cost me around 150 minutes to finish the job independently the way I choose
My question is if this could be normal behavior or if exist any other way to improve this task.
I'm using this for loop to read several cvs files and parsing it in table format
for toTAB in listBRUTO:
df= pd.read_csv(toTAB)
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
t.style = 'LightShading-Accent1' # border
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
#TABLE Format
for row in t.rows:
for cell in row.cells:
paragraphs = cell.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
font = run.font
font.name = 'Calibri'
font.size= Pt(7)
doc.add_page_break()
doc.save('blabla.docx')
Thanks in advance

You'll want to minimize the number of calls to table.cell(). Because of the way cell-merging works, these are expensive operations that really add up when performed in a tight loop.
I would start with refactoring this block and see how much improvement that yields:
# --- add the rest of the data frame ---
for i in range(df.shape[0]):
for j, cell in enumerate(table.rows[i + 1].cells):
cell.text = str(df.values[i, j])

python-docx walk the whole table every single time you access its "cells" property.
so you better call ".cell" as less as possible and use a cache for cells instead.
these are two examples access a table with size 3*1500:
code 1: about 150.0s
for row in table.rows:
print('processing: {0:30s}'.format(row.cells[0].text),end='\r')
code 2: about 1.4s
clls=table._cells
for row_idx in range(len(clls)//table._column_count):
print('processing: {0:30s}'.format(
clls[0 + row_idx*table._column_count].text),end='\r')
clls=table._cells in code 2 use "_cells" to process the cell-merging, so ccls[column_idx + row_idx*table._column_count].text works just as fine as table.rows[row_idx].cells[column_idx].text, and dont require table to be exactly rectangular

For rectangular table without merged cells you can export all cells into list-of-lists structure and fill them very quickly (less then 0.5s vs 15s for ~300 lines tables with 3 columns):
from docx.table import _Cell
def get_cells_grid(table):
cells = [[]]
col_count = table._column_count
for tc in table._tbl.iter_tcs():
cells[-1].append(_Cell(tc, table))
if len(cells[-1]) == col_count:
cells.append([])
return cells
cells = get_cells_grid(t)
for i in range(df.shape[0]):
for j in range(df.shape[i]):
cells[i][j].text = str(df.values[i, j])
Function based on table._cells() code: https://github.com/python-openxml/python-docx/blob/da75fcf01f7f322e846e2ac3e1936aedd766acc8/docx/table.py#L162

Just to add my experience, if you have to create a huge table, create the whole structure first, meaning all the rows and cells you will need; and then store the cells like so
table_cells = table._cells (according to #kztopia)
And from there you can manipulate cells as you wish, merging, adding text etc... with a rather optimized fastness since you make only one call to cell()
In my use case, for a table being, in my opinion, not so big (~130rows, 8cells per row), it used to take 9sec to create the whole thing and now i'm at .5 or so.
Keep in mind that, the bigger the table, the more time it'll take to execute cell().

Related

Clean data in excel that comes in varying formats

I have an excel table that contain values in these formats. The tables span over 30000 entries.
I need to clean this data so that only the numbers directly after V- are left. This would mean that when the value is SV-51140r3_rule, V-4407..., I would only want 4407 to remain and when the value is SV-245744r822811_rule, I would only want 245744 to remain. I have about 10 formulas that can handle these variations, but it requires a lot of manual labor. I've also used the text to column feature of excel to clean this data as well, but it takes about 30 minutes to an hour to go through the whole document. I'm looking for ways that I can streamline this process so that one formula or function can handle all of these different variations. I'm open to using VBA but don't have a whole lot of experience with it and I am unable to use Pandas or any IDE or programming language. Help please!!
I've used text to columns to clean data that way and I've used a variation of this formula
=IFERROR(RIGHT(A631,LEN(A631)-FIND("#",SUBSTITUTE(A631,"-","#",LEN(A631)-LEN(SUBSTITUTE(A631,"-",""))))),A631)
Depending on your version of Excel, either of these should work. If you have the ability to use the Let function, it will improve your performance, as this outstanding article articulates.
If you're on a really old version of excel, you'll need to hit ctl shift enter to make array formula work.
While these look daunting, all these functions are doing is finding the last V (by this function) =SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄","") and then looping through each character and only returning numbers.
Obviously the mushroom 🍄 could be any character that one would consider improbable to appear in the actual data.
Old School
=TEXTJOIN("",TRUE,IF(ISNUMBER(MID(MID(SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄",""),
FIND("-",SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄","")),9^9),
FILTER(COLUMN($1:$1),COLUMN($1:$1)<=LEN(MID(SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄",""),
FIND("-",SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄","")),9^9))),1)+0),
MID(MID(SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄",""),
FIND("-",SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄","")),9^9),
FILTER(COLUMN($1:$1),COLUMN($1:$1)<=LEN(MID(SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄",""),
FIND("-",SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄","")),9^9))),1),""))
Let Function
(use this if you can)
=LET(zText,SUBSTITUTE(RIGHT(SUBSTITUTE(A2,"V",REPT("🍄",999)),999),"🍄",""),
TEXTJOIN("",TRUE,IF(ISNUMBER(MID(MID(zText,FIND("-",zText),9^9),
FILTER(COLUMN($1:$1),COLUMN($1:$1)<=LEN(MID(zText,FIND("-",zText),9^9))),1)+0),
MID(MID(zText,FIND("-",zText),9^9),
FILTER(COLUMN($1:$1),COLUMN($1:$1)<=LEN(MID(zText,FIND("-",zText),9^9))),1),"")))
VBA Custom Function
You could also use a VBA custom function to accomplish what you want.
Function getNumbersAfterCharcter(aCell As Range, aCharacter As String) As String
Const errorValue = "#NoValuesInText"
Dim i As Long, theValue As String
For i = Len(aCell.Value) To 1 Step -1
theValue = Mid(aCell.Value, i, 1)
If IsNumeric(theValue) Then
getNumbersAfterCharcter = Mid(aCell.Value, i, 1) & getNumbersAfterCharcter
ElseIf theValue = aCharacter Then
Exit Function
End If
Next i
If getNumbersAfterCharcter = "" Then getNumbersAfterCharcter = errorValue
End Function

Table Segmentation,Data Reduction, Sorting And Writing in MATLAB/EXCEL

I would like to do a data reduction operation on a spreadsheet. Preferably I would like to use MATLAB/(or excel) since I need separate output files for each case.
The link is for the spreadsheet is below
Spreadsheet
A screenshot of the spreadsheet is as below
The output I required in text files is something as below
The first sheet in the .xls file is the main input. Wheras the the following sheets (d**) are my required output. I also need these sheets in a separate ASCII file (.dat) to plot hem later on. Here is how the algorithm works
Lookup the number/string in column B(FileName)
Extract all data in Columns C and D (Saturation and ETC) with same FileName Value(Column B)
Lookup the matching FileName(Column B) value in Column E (ImageIndex).
Copy Value of ImageName(Column F) to the corresponding Value in Image(IndexColumn E)
Result would be three columns (ImageName,Saturation,ETC). ImageName would be same for each subcase
Sort the columns based on Saturation
Write each sub case as a separate .dat file
I tried using a few recipes using categorical arrays (findgroups and splitapply) in MATLAB. Didn't seem to work out for me. I would be later working on a larger data set so automation is necessary. I think this could be done using macros on excel, but I would prefer using MATLAB since I would use MATLAB to plot the data. Any other alternative suggestions are welcome
Thanks,
Here's a Matlab solution. You could do it with a rather convoluted accumarray call, but readability would be rather bad, so I'm opting for a loop here.
out is a structure which you can use to either write files, or to plot the data.
tbl = readtable('yourFile.xls');
%# get the group indices for the files
%# this assumes that you have cleaned up the dash after the 1
%# so that all of the entries in the FileName column are numeric
idx = tbl.FileName;
%# the uIdx business is to account for the possibility
%# that there are images missing from the sequence
uIdx = unique(idx);
nImages = length(uIdx);
%# preassign output structure
out(1:nImages) = struct('name','','saturation',0,'etc',0);
%# loop to extract relevant information
for iImage = uIdx(:)'
myIdx = idx==iImage;
data = tbl(myIdx,{'Saturation','ETC'});
data = sortrows(data,'Saturation');
name = tbl.ImageName{tbl.ImageIdx==iImage};
out(iImage==uIdx).name = name;
out(iImage==uIdx).saturation = data.Saturation;
out(iImage==uIdx).etc= data.ETC;
end
%# plotting
for iImage = 1:nImages
figure('name',out(iImage).name)
plot(out(iImage).saturation, out(iImage).etc,'.');
end

Can pandas implicitly determine header based on value, not row?

I work with people who use Excel and continuously add or subtract rows unbeknownst to me. I have to scrape a document for data, and the row where the header is found changes based on moods.
My challenge is to handle these oscillating currents by detecting where the header is.
I first organized my scrape using xlrd and a number of conditional statements using the values in the workbook.
My initial attempt works and is long (so I will not publish it) but involves bringing in the entire sheet, and not slices:
from xlrd import open_workbook
book = open_workbook(fName)
sheet = book.sheet_by_name(sht)
return book,sheet
However, it is big and I would prefer to get a more targeted selection. The header values never change, nor does when the data shows up after this row.
Do you know of a way to implicitly get the header based on a found value in the sheet using either pandas.ExcelFile or pandas.read_excel?
Here is my attempt with pandas.ExcelFile:
import pandas as pd
xlsx = pd.ExcelFile(fName)
dataFrame = pd.read_excel(xlsx, sht,
parse_cols=21, merge_cells=noMerge,
header=header)
return dataFrame
I cannot get the code to work unless I give the call the correct header value, which is exactly what I'm hoping to avoid.
This previous question seems to present a similar problem without addressing the concern of finding the headers implicitly.
Do the same loop through ExcelFile objects:
xlsx = pd.ExcelFile(fName)
sheet = xlsx.sheet_by_name(sht)
# apply the same algorithm you wrote against xlrd here
# ... results in having header_row = something, 0 based
dataFrame = pd.read_excel(xlsx, sht,
parse_cols=21, merge_cells=noMerge,
skip_rows=header_row)

How to loop through excel sheets, perform calculations, and compile results

I have roughly 70,000 sheets that all have to have calculations done, and then all results compiled into a new sheet (which would be 70,000 lines long).
It needs to be sorted by date.
I'm VERY very very poor at matlab, but I've what I need the script to do for each excel sheet, I'm just unsure how to make it do them for all.
Thank you!!! (I took out some of the not important code)
%Reading in excel sheet
B = xlsread('24259893-008020361800.TorqueData.20160104.034602AM.csv');
%Creating new matrix
[inYdim, inXdim] = size(B);
Ydim = inYdim;
[num,str,raw]=xlsread('24259893-008020361800.TorqueData.20160104.034602AM.csv',strcat('A1:C',num2str(Ydim)));
%Extracting column C
C=raw(:,3);
for k = 1:numel(C)
if isnan(C{k})
C{k} = '';
end
end
%Calculations
TargetT=2000;
AvgT=mean(t12);
TAcc=((AvgT-TargetT)/TargetT)*100 ;
StdDev=std(B(ind1:ind2,2));
ResTime=t4-t3;
FallTime=t6-t5;
DragT=mean(t78);
BreakInT=mean(t910);
BreakInTime=(t10-t9)/1000;
BreakInE=BreakInT*BreakInTime*200*.1047;
%Combining results
Results=[AvgT TAcc StdDev ResTime FallTime DragT BreakInT BreakInTime BreakInE]
I think I need to do something along the lines of:
filenames=dir('*.csv')
and I found this that may be useful:
filenames=dir('*.csv');
for file=filenames'
csv=load(file.name);
with stuff in here
end
You have the right idea, but you need to index your file names in order to be able to step through them in the for loop.
FileDir = 'Your Directory';
FileNames = {'Test1';'Test2';'Test3'};
for k=1:length(FileNames)
file=[FileDir,'/',FileNames{k}]);
[outputdata]=xlsread(file,sheet#, data locations);
THE REST OF YOUR LOOP, Indexed by k
end
How you choose to get the file names and directory is up to you.

How to get MATLAB xlsread to read until a last row of a contiguous <<data-range>>?

I want to use xlsread in MATLAB to read an Excel file.
While I know which columns I want to read from, and which row I want to start reading from, the file could contain any number of rows.
Is there a way to do something like:
array = xlsread( 'filename', 'D4:F*end*' ); %% OR ANY SIMILAR SYNTAX
Where F*end* is the last row in column F?
Yes. Try this:
FileFormat = '.xls' or '.xlsx'; % choose one
% ( by default MATLAB
% imports only '.xls' )
filename = strcat( 'Filename you desire', FileFormat );
array = xlsread( filename ) % This will read all
% the Matrix ( by default
% MATLAB will import all
% numerical data from
% file with this syntax )
Then you can look to the size of the matrix to refine the search/import.
[nRows,nCols] = size( array );
Then if the matrix you want to import just parts of the matrix, you can do this:
NewArray = xlsread( filename, strcat( 'initial cell',
':',
'ColumnLetter',
num2str( nRows )
)
);
% for your case:
NewArray = xlsread( filename, strcat( 'D3', ':', 'F', num2str( nRows ) ) );
Hope this helps.
In xls format excel files, 65536 seems to be limit of number of rows that you can use. You can use this number with F and that will basically tell MATLAB to search till the end of file. That's all I could gather from little digging up work on these and this trick/hack seems to work alright.
To sum up, this seems to do the trick for xls files -
array = xlsread('filename', 'D4:F65536')
For xlsx files, the limit seems to be 1048576, so the code would change to -
array = xlsread('filename', 'D4:F1048576')
External source to confirm the limit on number of rows -
Excel versions 97-2003 (Windows) have a file extension of XLS and the
worksheet size is 65,536 rows and 256 columns. In Excel 2007 and 2010
the default file extension is XLSX and the worksheet size is 1,048,576
rows and 16,384 columns.
You could read column by column:
col1= xlsread( 'filename', 'D:D' );
col2= xlsread( 'filename', 'E:E' );
col3= xlsread( 'filename', 'F:F' );
...
Don't provide row numbers (such as D12:D465), Matlab will deal with D:D like you would expect. col1, col2 and col3 will have different sizes depending on how much data was extracted from each column.
I haven't tried something like this thought, I don't know if it would work:
colAll= xlsread( 'filename', 'D:F' );
No, But...
MATLAB does not have either documented or undocumented feature for doing this directly.
The maximum one can use under direct MATLAB support is to:
___ = xlsread(filename,-1) opens an Excel window to interactively select data.
Select the worksheet, drag and drop the mouse over the range you want,
and click OK.
This syntax is supported only on Windows systems with Excel software.
Still, how to approach the task efficiently and future-proof?
The "blind" black-box approach would be to first test the boundary of the contiguous area, where your data is present -- use any feasible iterator, first forward-stepping by doubling a blind-test step-distance of a tested cell alike aRowToTEST = ( aRowToStartFROM + aRowNumberDistanceToTEST ) and in case the tested cell contains a number, set aLastNonEmptyROW = aRowToTEST; double the aRowNumberDistanceToTEST and repeat.
In case aRowToTEST points "behind" the format-specific maximum row number, set aRowToStartFROM = aLastNonEmptyROW; and reset the forward-stepping distance aRowNumberDistanceToTEST = 1; to continue forward-stepping iterations with a doubling-step stepping. If this again hits the limit, having the step == 1 and yet pointing right "behind" the format-specific limit, your sheet-under-review contains data until its last row ( finishing on the format-specific "edge" ).
But once the target cell is empty/NaN, stop the forward-stepping phase and start a standard back-stepping phase by halving the interval between a found/failed ( empty ) cell aFirstEmptyROW = aRowToTEST; and the last known cell at aLastNonEmptyROW, that contained number.
Again, if a cell under test contained a fair value, move the aLastNonEmptyROW-boundary to aRowToTEST value, if not, move the same way aFirstEmptyROW-boundary.
Finally set aBackSteppingSTEP = ( aFirstEmptyROW - aLastNonEmptyROW )/2; aRowToTEST = aFirstEmptyROW - aBackSteppingSTEP;.
Iterate the above until your step is < 1 and thus you have iteratively found the contiguous data-area boundary.
This is way faster and incomparably more efficient than a raw-dumb-import-whole-sheet and works until both a 64k or 1M or any further upper-limit of an XLS rowNumber.
Having the boundary, simply array = xlsread( 'filename', 'D4:F<<aLastNonEmptyROW>>' );

Resources