Error in fetching value using openpyxl and python - python-3.x

There are two values for this code: G82's value is 15,000 and its formula is =Assumptions!E32. When I fetch value for G82, instead of 15,000 the value which is shown is =Assumptions!E32. Any other method or function which I should use. Is there any mistake in my code.
import openpyxl
from openpyxl import load_workbook
aa = load_workbook('H2.xlsx')
ab = 'Sheet1' #Sheet1's name
ab1 = "Assumptions" #Sheet: "Assumption"
print(b['G82'].value
#Answer is =Assumptions!E32 instead of 15,000

Got this answer from available answers by Charlie clark and others. Its about changing how the file is read by additionally specifying the same. It will open all the formulas as values hence the said issues will be resolved.
import openpyxl
from openpyxl import load_workbook
aa = load_workbook('H2.xlsx',data_only = True ) #only difference is in this line
ab = 'Sheet1' #Sheet1's name
ab1 = "Assumptions" #Sheet: "Assumption"
print(b['G82'].value
This changes would give the needed output.
For the application I was writing this code, I also wanted the formulas as well as their immediate outputs (and not what the refer to) so I imported the workbook twice with different names, normally the first time and other time as ` = True.
So wherever I needed use of formulas I will call out the normal one and on needing the values would call the later.

Related

How to get a similar sheet name in pandas

I am trying to find a similar sheet name in an excel using pandas.
Currently I am using below code to get dataframe of a sheet in pandas.
excel= pd.ExcelFile(excel)
tab_name = 'Employee'
emp_df= excel.parse(tab_name)
But this code will fail if the sheet name in excel contains any space or some other extra characters.
Is there any easy way to do this ?
I used similarity api (fuzzywuzzy) to find similar sheet only when sheet not found error thrown when running excel.parse(tab_name)
from fuzzywuzzy import fuzz
import xlrd
try:
tab_df = excel.parse(tab_name)
except xlrd.biffh.XLRDError:
sheet_names=excel.sheet_names
ratios = [fuzz.ratio(tab_name, tbname) for tbname in sheet_names]
if(max(ratios)>50):
tab_name = sheet_names[ratios.index(max(ratios))]
tab_df = excel.parse(tab_name)
else:
logger.error(tab_name+"Not found")

How to load multiple excel files with multiple sheets in to one dataframe in python

We're trying to make an automatic program, that can take multiple excel files with multiple sheets from a folder, and append them to one data frame.
Our problem is that we're not quite sure how to do this, so the process becomes most automatic. And since the sheets varies in names, we can't specify any variable for them.
Alle of the files are *.xlsx, and the code has to load a arbitrary number of files.
We have tried with different types of codes, primarily using pandas, but we can't seem to append them in one data frame.
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer)
writer.save()
sheet1 = xls.parse(0)
We expect to have one data frame with all data, such that we can use data and extract different features and make statistics.
The documentation of pandas.read_excel states:
*sheet_name : str, int, list, or None, default 0
Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets.
Available cases:
Defaults to 0: 1st sheet as a DataFrame
1: 2nd sheet as a DataFrame
"Sheet1": Load sheet with name “Sheet1”
[0, 1, "Sheet5"]: Load first, second and sheet named “Sheet5” as a dict of DataFrame
None: All sheets.*
I would suggest to try the last option, being pd.read_excel(f,sheet_name = None). Otherwise you might want to create a loop and pass indexes vs. the actual sheet names, this way you don't have to have prior knowledge of the .xlsx files.

Pandas read_excel for multiple sheets returns same nb_rows on every sheet

I've got an Excel file with multiple sheets with same structure.
Number of rows varies on every sheet, but pd.read_excel() returns df with nb_rows == nb_rows on the first sheet.
I've checked Excel sheets with CTRL+down - there is no empty lines in the middle of the sheet.
How can I fix the problem?
The example code is follows:
import pandas as pd
xls_sheets = ['01', '02', '03']
fname = 'C:\\data\\data.xlsx'
xls = pd.ExcelFile(fname)
for sheet in xls_sheets:
df = pd.read_excel(io=xls, sheet_name=sheet)
print(len(df))
Output:
>> 4043 #Actual nb_rows = 4043
>> 4043 #Actual nb_rows = 11015
>> 4043 #Actual nb_rows = 5622
python 3.5, pandas 0.20.1
Check the names of sheets are they correct in your xls_sheets list if yes then Try it after installing xlrd library/module (pip install xlrd) and then run the code again. Because for me it works fine. Hope this helps you!
Given the limited information on the question and assuming you want to read all of the sheets in the Excel file, I would suggest that you use the following:
data=pd.read_excel('excelfile.xlsx', sheet_name=None)
datais a dictionary where the keys are the sheet names and the values are the data in each sheet. Please try this method. It may solve your problem.

How to execute the second Iterarion of Data in excel using Openpyxl with Python 3.4

I am trying to read data from my Excel spreadsheet and so far i have been able to do it using the code below but i cant run iterations.
from openpyxl import load_workbook
import numpy as np
wb = load_workbook('c:\ExcelData\pyExcel.xlsx')
ws = wb.get_sheet_by_name('Sheet1')
table = np.array([[cell.value for cell in col] for col in ws['A2':'A3']])
print(table)
Another Example:
val1=2
val2=1
wb = load_workbook(os.path.abspath(os.path.join(os.path.dirname(__file__),'c:\ExcelData\pyExcel.xlsx')))
sheet = wb.get_sheet_by_name('Sheet1')
c = sheet.cell(row=val1, column=val2).value
d = sheet.cell(row=val2, column=val2).value
print(c)
print(d)
So far what this does is to read a harcoded row and cell from an excel file and print or assign the value to a variable, But I am looking for a way to run iterations of data.. I want to use it as a data table when the first rows of all the columns will be executed the first time and then at the end the script will start over again but using the next row.
Thanks.
pinky you should use variables into the table = np.array([[cell.value for cell in col] for col in ws['A2':'A3']])
example ws['variable':'variable']]) or ws['ANUMBERVARIABLE':'ANUMBERVARIABLE']])
#Pinky Read this page http://openpyxl.readthedocs.org/en/latest/tutorial.html and try to find your answer. If you still did not understand from it, I'll try to help you with a code. I feel this is the best way you could actually learn what you are doing rather than just receiving the code directly.

reading 300k cells in excel using read_only in openpyxl not enough

I've read quite a few questions on here about reading large excel files with openpyxl and the read_only param in load_workbook(), and I've done it successfully with source excels 50x30, but when I try to do it on a workbook with a 30x1100 sheet, it stalls. Right now, it just reads in the excel and transfers it to a multi-dimensional array.
from openpyxl import Workbook
from openpyxl import load_workbook
def transferCols(refws,mx,refCol,newCol,header):
rmax = refws.max_row
for r in range(1, rmax+1):
if (r == 1):
mx[r-1][newCol-1] = header
else:
mx[r-1][newCol-1] = refws.cell(row = r, column = refCol).value
return
ref_wb = load_workbook("UESfull.xlsx", read_only= True)
ref_ws = ref_wb.active
rmax = ref_ws.max_row
matrix = [["fill" for col in range(30)] for row in range(rmax)]
print("step ", 1)
transferCols(ref_ws,matrix,1,1,"URL")
...
I only put the print("step") line to track the progress, but surprisingly, it stalls at step 1! I just don't know if the structure is poor or if 300k cells is too much for openpyxl. I havent even began to write to my put excel yet! Thanks in advance!
I suspect that you have an undimensioned worksheet so ws.max_row is unknown. If this is the case, use ws.calculate_dimensions() will tell you, then you should just iterate over the rows of both sheets in parallel.
Instead of trying to read large excel in openpyxl try pandas shall get you better result. pandas has better functions to clean the data that you are ought to do.
Here is an example of 10000 rows and 30 columns of data that is written and read back in pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000,30))
df.to_excel('test.xlsx')
df1 = pd.read_excel('test.xlsx')

Resources