Can pandas implicitly determine header based on value, not row? - excel

I work with people who use Excel and continuously add or subtract rows unbeknownst to me. I have to scrape a document for data, and the row where the header is found changes based on moods.
My challenge is to handle these oscillating currents by detecting where the header is.
I first organized my scrape using xlrd and a number of conditional statements using the values in the workbook.
My initial attempt works and is long (so I will not publish it) but involves bringing in the entire sheet, and not slices:
from xlrd import open_workbook
book = open_workbook(fName)
sheet = book.sheet_by_name(sht)
return book,sheet
However, it is big and I would prefer to get a more targeted selection. The header values never change, nor does when the data shows up after this row.
Do you know of a way to implicitly get the header based on a found value in the sheet using either pandas.ExcelFile or pandas.read_excel?
Here is my attempt with pandas.ExcelFile:
import pandas as pd
xlsx = pd.ExcelFile(fName)
dataFrame = pd.read_excel(xlsx, sht,
parse_cols=21, merge_cells=noMerge,
header=header)
return dataFrame
I cannot get the code to work unless I give the call the correct header value, which is exactly what I'm hoping to avoid.
This previous question seems to present a similar problem without addressing the concern of finding the headers implicitly.

Do the same loop through ExcelFile objects:
xlsx = pd.ExcelFile(fName)
sheet = xlsx.sheet_by_name(sht)
# apply the same algorithm you wrote against xlrd here
# ... results in having header_row = something, 0 based
dataFrame = pd.read_excel(xlsx, sht,
parse_cols=21, merge_cells=noMerge,
skip_rows=header_row)

Related

extremely slow add a table to python-docx from a csv file

I have to add a table from a CSV file around 1500 rows and 9 columns, (75 pages) in a docx word document. using python-docx.
I have tried differents approaches, reading ths csv with pandas or directly openning de csv file, It cost me around 150 minutes to finish the job independently the way I choose
My question is if this could be normal behavior or if exist any other way to improve this task.
I'm using this for loop to read several cvs files and parsing it in table format
for toTAB in listBRUTO:
df= pd.read_csv(toTAB)
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
t.style = 'LightShading-Accent1' # border
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
#TABLE Format
for row in t.rows:
for cell in row.cells:
paragraphs = cell.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
font = run.font
font.name = 'Calibri'
font.size= Pt(7)
doc.add_page_break()
doc.save('blabla.docx')
Thanks in advance
You'll want to minimize the number of calls to table.cell(). Because of the way cell-merging works, these are expensive operations that really add up when performed in a tight loop.
I would start with refactoring this block and see how much improvement that yields:
# --- add the rest of the data frame ---
for i in range(df.shape[0]):
for j, cell in enumerate(table.rows[i + 1].cells):
cell.text = str(df.values[i, j])
python-docx walk the whole table every single time you access its "cells" property.
so you better call ".cell" as less as possible and use a cache for cells instead.
these are two examples access a table with size 3*1500:
code 1: about 150.0s
for row in table.rows:
print('processing: {0:30s}'.format(row.cells[0].text),end='\r')
code 2: about 1.4s
clls=table._cells
for row_idx in range(len(clls)//table._column_count):
print('processing: {0:30s}'.format(
clls[0 + row_idx*table._column_count].text),end='\r')
clls=table._cells in code 2 use "_cells" to process the cell-merging, so ccls[column_idx + row_idx*table._column_count].text works just as fine as table.rows[row_idx].cells[column_idx].text, and dont require table to be exactly rectangular
For rectangular table without merged cells you can export all cells into list-of-lists structure and fill them very quickly (less then 0.5s vs 15s for ~300 lines tables with 3 columns):
from docx.table import _Cell
def get_cells_grid(table):
cells = [[]]
col_count = table._column_count
for tc in table._tbl.iter_tcs():
cells[-1].append(_Cell(tc, table))
if len(cells[-1]) == col_count:
cells.append([])
return cells
cells = get_cells_grid(t)
for i in range(df.shape[0]):
for j in range(df.shape[i]):
cells[i][j].text = str(df.values[i, j])
Function based on table._cells() code: https://github.com/python-openxml/python-docx/blob/da75fcf01f7f322e846e2ac3e1936aedd766acc8/docx/table.py#L162
Just to add my experience, if you have to create a huge table, create the whole structure first, meaning all the rows and cells you will need; and then store the cells like so
table_cells = table._cells (according to #kztopia)
And from there you can manipulate cells as you wish, merging, adding text etc... with a rather optimized fastness since you make only one call to cell()
In my use case, for a table being, in my opinion, not so big (~130rows, 8cells per row), it used to take 9sec to create the whole thing and now i'm at .5 or so.
Keep in mind that, the bigger the table, the more time it'll take to execute cell().

Cannot find a way to replace last row inside excel files

I am fighting with an excel file in which I would simple delete the last row.
I am using XLSXWRITER, and I tried several ways, but nothing is working. I am doing something wrong (maybe I have to take a break).
I tried
worksheet.write_blank(row, col, None)
but I found out that xlsxwriter cannot replace an old row with a new one. So if I use write_blank() to write on on an existing row, it won't work.
Could you please help me? I am looping through several XLSX file, open them and replace the last row with a blank.
Many thanks!
So, I found a way to achieve this step on my own.
Basically I wasn't able to do this with XLSXWRITER library, so I loop through my excel files opening them with OPENPYXL.
import openpyxl
from openpyxl import Workbook
## look for all excel files needed
filepath = r"C:\Users\name\Desktop\folder\folder\folder"
xlsxfiles = glob.glob(filepath + r"\**\*.xlsx")
## for each excel file open the workbook and spreadsheet
for file in xlsxfiles:
wb = openpyxl.load_workbook(file)
ws = wb.active
## for each excel file, count the maximum number of rows and store the value in last_row variable
last_row = ws.max_row
print("MAX NUMER OF ROW: ", last_row)
## replace the last row with None value
ws.cell(last_row, 1).value = None
## save each excel file
wb.save(file)
My need was quite specific but I think it can be easily modify to different purposes.

how to update a portion of existing excel sheet with filtered dataframe?

I have an excel workbook with several sheets. I need to read a portion from one of the sheets, get a filtered dataframe and write a single value from that filtered dataframe to a specific cell in the same sheet. What is the best way to accomplish this, ideally without opening the excel workbook? I need to run this on linux, so can't use xlwings. I don't want to write the entire sheet, but just a selected cell/offset inside it. I tried the following to write to the existing sheet, but doesn't seem to work for me (no update occurs at the desired cell):
with pd.ExcelWriter('test.xlsx', engine='openpyxl') as writer:
writer.book = load_workbook('test.xlsx')
df_filtered.to_excel(writer, 'Sheet_Name', columns=['CS'], startrow=638, startcol=96)
Any tips would be helpful. Thanks.
If you're just writing a single cell the below should suffice.
import pandas as pd
import openpyxl
df = pd.DataFrame(data=[1,2,3], columns=['col'])
filtered_dataframe = df[df.col == 1].values[0][0]
filename = 'test.xlsx'
wb = openpyxl.load_workbook(filename)
wb['Sheet1'].cell(column=1, row=2, value=filtered_dataframe)
wb.save(filename)
I believe your issue was that you never called the save method of the writer.

Working with Excel sheets in MATLAB

I need to import some Excel files in MATLAB and work on them. My problem is that each Excel file has 15 sheets and I don't know how to "number" each sheet so that I can make a loop or something similar (because I need to find the average on a certain column on each sheet).
I have already tried importing the data and building a loop but MATLAB registers the sheets as chars.
Use xlsinfo to get the sheet names, then use xlsread in a loop.
[status,sheets,xlFormat] = xlsfinfo(filename);
for sheetindex=1:numel(sheets)
[num,txt,raw]=xlsread(filename,sheets{sheetindex});
data{sheetindex}=num; %keep for example the numeric data to process it later outside the loop.
end
I 've just remembered that i posted this question almost 2 years ago, and since I figured it out, I thought that posting the answer could prove useful to someone in the future.
So to recap; I needed to import a single column from 4 excel files, with each file containing 15 worksheets. The columns were of variable lengths. I figured out two ways to do this. The first one is by using the xlsread function with the following syntax.
for count_p = 1:2
a = sprintf('control_group_%d.xls',count_p);
[status,sheets,xlFormat] = xlsfinfo(a);
for sheetindex=1:numel(sheets)
[num,txt,raw]=xlsread(a,sheets{sheetindex},'','basic');
data{sheetindex}=num;
FifthCol{count_p,sheetindex} = (data{sheetindex}(:,5));
end
end
for count_p = 3:4
a = sprintf('exercise_group_%d.xls',(count_p-2));
[status,sheets,xlFormat] = xlsfinfo(a);
for sheetindex=1:numel(sheets)
[num,txt,raw]=xlsread(a,sheets{sheetindex},'','basic');
data{sheetindex}=num;
FifthCol{count_p,sheetindex} = (data{sheetindex}(:,5));
end
end
The files where obviously named control_group_1, control_group_2 etc. I used the 'basic' input in xlsread, because I only needed the raw data from the files, and it proved to be much faster than using the full functionality of the function.
The second way to import the data, and the one that i ended up using, is building your own activeX server and running a single excelapplication on it. Xlsread "opens" and "closes" an activeX server each time it's called so it's rather time consuming (using the 'basic' input does not though). The code i used is the following.
Folder=cd(pwd); %getting the working directory
d = dir('*.xls'); %finding the xls files
N_File=numel(d); % Number of files
hexcel = actxserver ('Excel.Application'); %starting the activeX server
%and running an Excel
%Application on it
hexcel.DisplayAlerts = true;
for index = 1:N_File %Looping through the workbooks(xls files)
Wrkbk = hexcel.Workbooks.Open(fullfile(pwd, d(index).name)); %VBA
%functions
WorkName = Wrkbk.Name; %getting the workbook name %&commands
display(WorkName)
Sheets=Wrkbk.Sheets; %sheets handle
ShCo(index)=Wrkbk.Sheets.Count; %counting them for use in the next loop
for j = 1:ShCo(index) %looping through each sheet
itemm = hexcel.Sheets.Item(sprintf('sheet%d',j)); %VBA commands
itemm.Activate;
robj = itemm.Columns.End(4); %getting the column i needed
numrows = robj.row; %counting to the end of the column
dat_range = ['E1:E' num2str(numrows)]; %data range
rngObj = hexcel.Range(dat_range);
xldat{index, j} = cell2mat(rngObj.Value); %getting the data in a cell
end;
end
%invoke(hexcel);
Quit(hexcel);
delete(hexcel);

How to import lots of data into matlab from a spreadsheet?

I have an excel spreadsheet with lots of data that I want to import into matlab.
filename = 'for_matlab.xlsx';
sheet = (13*2)+ 1;
xlRange = 'A1:G6';
all_data = {'one_a', 'one_b', 'two_a', 'two_b', 'three_a', 'three_b', 'four_a', 'four_b', 'five_a', 'five_b', 'six_a', 'six_b', 'seven_a', 'seven_b', 'eight_a', 'eight_b', 'nine_a', 'nine_b', 'ten_a', 'ten_b', 'eleven_a', 'eleven_b', 'twelve_a', 'twelve_b', 'thirteen_a', 'thirteen_b', 'fourteen_a'};
%read data from excel spreadsheet
for i=1:sheet,
all_data{i} = xlsread(filename, sheet, xlRange);
end
Each element of the 'all_data' vector has a corresponding matrix in separate excel sheet. The code above imports the last matrix only into all of the variables. Could somebody tell me how to get it so I can import these matrices into individual matlab variables (without calling the xlsread function 28 times)?
You define a loop using i but then put sheet in the actual xlsread call, which will just make it read repeatedly from the same sheet (the value of the variable sheet is not changing). Also not sure whether you intend to somehow save the contents of all_data, as written there's no point in defining it that way as it will just be overwritten.
There are two ways of specifying the sheet using xlsread.
1) Using a number. If you intended this then:
all_data{i} = xlsread(filename, i, xlRange);
2) Using the name of the sheet. If you intended this and the contents of all_data are the names of sheets, then:
data{i} = xlsread(filename, all_data{i}, xlRange); %avoiding overwriting

Resources