reading 300k cells in excel using read_only in openpyxl not enough

reading 300k cells in excel using read_only in openpyxl not enough - excel

I've read quite a few questions on here about reading large excel files with openpyxl and the read_only param in load_workbook(), and I've done it successfully with source excels 50x30, but when I try to do it on a workbook with a 30x1100 sheet, it stalls. Right now, it just reads in the excel and transfers it to a multi-dimensional array.
from openpyxl import Workbook
from openpyxl import load_workbook
def transferCols(refws,mx,refCol,newCol,header):
rmax = refws.max_row
for r in range(1, rmax+1):
if (r == 1):
mx[r-1][newCol-1] = header
else:
mx[r-1][newCol-1] = refws.cell(row = r, column = refCol).value
return
ref_wb = load_workbook("UESfull.xlsx", read_only= True)
ref_ws = ref_wb.active
rmax = ref_ws.max_row
matrix = [["fill" for col in range(30)] for row in range(rmax)]
print("step ", 1)
transferCols(ref_ws,matrix,1,1,"URL")
...
I only put the print("step") line to track the progress, but surprisingly, it stalls at step 1! I just don't know if the structure is poor or if 300k cells is too much for openpyxl. I havent even began to write to my put excel yet! Thanks in advance!

I suspect that you have an undimensioned worksheet so ws.max_row is unknown. If this is the case, use ws.calculate_dimensions() will tell you, then you should just iterate over the rows of both sheets in parallel.

Instead of trying to read large excel in openpyxl try pandas shall get you better result. pandas has better functions to clean the data that you are ought to do.
Here is an example of 10000 rows and 30 columns of data that is written and read back in pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000,30))
df.to_excel('test.xlsx')
df1 = pd.read_excel('test.xlsx')

Related

Better way to read columns from excel file as variables in Python

I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)

Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()

How can I do something similar to VLOOKUP in Excel with urlparse?

I need to compare two sets of data from csv's, one (csv1) with a column 'listing_url' the other (csv2) with columns 'parsed_url' and 'url_code'. I would like to use the result set from urlparse on csv1 (specifically the netloc) to compare to csv2 'parsed_url' and output matching value from 'url_code' to csv.
from urllib.parse import urlparse
import re, pandas as pd
scr = pd.read_csv('csv2',squeeze=True,usecols=['parsed_url','url_code'])[['parsed_url','url_code']]
data = pd.read_csv('csv1')
L = data.values.T[0].tolist()
T = pd.Series([scr])
for i in L:
n = urlparse(i)
nf = pd.Series([(n.netloc)])
I'm stuck trying to convert the data into objects I can use map with, if that's even the best thing to use, I don't know.

for loop that can for range n, n-1

I am fairly new to python so i need your help to try an figure out how to do this.
I am trying to import data from an excel file to a MySQL database, and i need to import everything but the heather and last row from this file.
at the moment, this is the code:
for r in range(1, sheet.nrows - 1):
rowid = sheet.cell(r, o).value
currency_Symbol = cell(r, 1).value
....
I know that probably it is in this part of the code that i need to do something, but i tried everything that i could think of, and nothing worked.
any thoughts?
important to say that i am using xlrd and MySQLdb, so i need something that work with this modules.
thanks in advance

Use Pandas
pd.read_excel(<parameter>)
pd.to_sql(<parameters>)
A quick Sample below with Pandas and SQL Alchemy:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(<<MYSQL COnnection String>>)
df = pd.read_excel("") #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
df.to_sql(con=engine, name='tablename',if_exists='append', index=False)
You will need to install Pandas and SQLAlchemy Module though. You can google those easily,

Having run a few tests on one of my own spreadsheets, it looks like a confusion as to what nrows means and how it relates to the range index.
The header row is row 0.
By starting your range at 1, you correctly exclude the header (if the header is only 1 row!).
nrows is the number of rows. This means if you have 362 rows, nrows will be 362. BUT ... if you pass this to range, since range starts with 0, you will get "bumped" to the 363rd row. The nrows -1 you have in your code adjusts this back to the last line of the sheet.
Therefore, to clean your data the way you describe, all you need to do to exclude the header and the last row is to start at row 1 and end at nrows -2.
for r in range(1, sheet.nrows - 2):

Pandas read_excel for multiple sheets returns same nb_rows on every sheet

I've got an Excel file with multiple sheets with same structure.
Number of rows varies on every sheet, but pd.read_excel() returns df with nb_rows == nb_rows on the first sheet.
I've checked Excel sheets with CTRL+down - there is no empty lines in the middle of the sheet.
How can I fix the problem?
The example code is follows:
import pandas as pd
xls_sheets = ['01', '02', '03']
fname = 'C:\\data\\data.xlsx'
xls = pd.ExcelFile(fname)
for sheet in xls_sheets:
df = pd.read_excel(io=xls, sheet_name=sheet)
print(len(df))
Output:
>> 4043 #Actual nb_rows = 4043
>> 4043 #Actual nb_rows = 11015
>> 4043 #Actual nb_rows = 5622
python 3.5, pandas 0.20.1

Check the names of sheets are they correct in your xls_sheets list if yes then Try it after installing xlrd library/module (pip install xlrd) and then run the code again. Because for me it works fine. Hope this helps you!

Given the limited information on the question and assuming you want to read all of the sheets in the Excel file, I would suggest that you use the following:
data=pd.read_excel('excelfile.xlsx', sheet_name=None)
datais a dictionary where the keys are the sheet names and the values are the data in each sheet. Please try this method. It may solve your problem.

How to execute the second Iterarion of Data in excel using Openpyxl with Python 3.4

I am trying to read data from my Excel spreadsheet and so far i have been able to do it using the code below but i cant run iterations.
from openpyxl import load_workbook
import numpy as np
wb = load_workbook('c:\ExcelData\pyExcel.xlsx')
ws = wb.get_sheet_by_name('Sheet1')
table = np.array([[cell.value for cell in col] for col in ws['A2':'A3']])
print(table)
Another Example:
val1=2
val2=1
wb = load_workbook(os.path.abspath(os.path.join(os.path.dirname(__file__),'c:\ExcelData\pyExcel.xlsx')))
sheet = wb.get_sheet_by_name('Sheet1')
c = sheet.cell(row=val1, column=val2).value
d = sheet.cell(row=val2, column=val2).value
print(c)
print(d)
So far what this does is to read a harcoded row and cell from an excel file and print or assign the value to a variable, But I am looking for a way to run iterations of data.. I want to use it as a data table when the first rows of all the columns will be executed the first time and then at the end the script will start over again but using the next row.
Thanks.

pinky you should use variables into the table = np.array([[cell.value for cell in col] for col in ws['A2':'A3']])
example ws['variable':'variable']]) or ws['ANUMBERVARIABLE':'ANUMBERVARIABLE']])

#Pinky Read this page http://openpyxl.readthedocs.org/en/latest/tutorial.html and try to find your answer. If you still did not understand from it, I'll try to help you with a code. I feel this is the best way you could actually learn what you are doing rather than just receiving the code directly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

reading 300k cells in excel using read_only in openpyxl not enough - excel

I suspect that you have an undimensioned worksheet so ws.max_row is unknown. If this is the case, use ws.calculate_dimensions() will tell you, then you should just iterate over the rows of both sheets in parallel.

Related

Better way to read columns from excel file as variables in Python

How can I do something similar to VLOOKUP in Excel with urlparse?

for loop that can for range n, n-1

Pandas read_excel for multiple sheets returns same nb_rows on every sheet

How to execute the second Iterarion of Data in excel using Openpyxl with Python 3.4

Categories

Resources