There are many questions on this, but there has been no simple answer on how to read an xlsb file into pandas. Is there an easy way to do this?
With the 1.0.0 release of pandas - January 29, 2020, support for binary Excel files was added.
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
Notes:
You will need to upgrade pandas - pip install pandas --upgrade
You will need to install pyxlsb - pip install pyxlsb
Hi actually there is a way. Just use pyxlsb library.
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])
UPDATE:
as of pandas version 1.0 read_excel() now can read binary Excel (.xlsb) files by passing engine='pyxlsb'
Source: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html
Pyxlsb indeed is an option to read xlsb file, however, is rather limited.
I suggest using the xlwings package which makes it possible to read and write xlsb files without losing sheet formating, formulas, etc. in the xlsb file. There is extensive documentation available.
import pandas as pd
import xlwings as xw
app = xw.App()
book = xw.Book('file.xlsb')
sheet = book.sheets('sheet_name')
df = sheet.range('A1').options(pd.DataFrame, expand='table').value
book.close()
app.kill()
'A1' in this case is the starting position of the excel table.
To write to xlsb file, simply write:
sheet.range('A1').value = df
If you want to read a big binary file or any excel file with some ranges you can directly put at this code
range = (your_index_number)
first_dataframe = []
second_dataframe = []
with open_xlsb('Test.xlsb') as wb:
with wb.get_sheet('Sheet1') as sheet:
i=0
for row in sheet.rows():
if(i!=range):
first_dataframe.append([item.v for item in row])
i=i+1
else:
second_dataframe.append([item.v for item in row])
first_dataframe = pd.DataFrame(first_dataframe[1:], columns=first[0])
second_dataframe = pd.DataFrame(second_dataframe[:], columns=first.columns)
To be able to read xlsb files, it is necessary to have openpyxl installed.
As per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files.
When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.
xlsb reading without index_col:
import pandas as pd
dfcluster = pd.read_excel('c:/xml/baseline/distribucion.xlsb', sheet_name='Cluster', index_col=0, engine='pyxlsb')
Related
Windows 10, Python 3.6, xlwings 0.27.8
When trying to debug my code outside of RunPython, I keep stumbling on the following issue, for example:
import xlwings as xw
xlsx_file = 'anExcelFile.xlsm'
xlwx = xw.book(xlsx_file).set_mock_caller()
From there, I am hoping to be able to use xlsw as normally as such if I had used the routine from RunPython, but now, typing xlsw returns None
However, if i do:
xlsx = xw.book(xlsx_file).set_mock_caller
then xlsx contains: <Book ['anExcelFile.xlsm]>
but still, xlsx() returns None.
Any lead on what I am getting wrong would help, thanks!
Finally understood my issue.
As explained in xlwings, the code sequence should read as follows:
import xlwings as xw
xlsx_file = 'anExcelFile.xlsm'
xw.book(xlsx_file).set_mock_caller()
# Sets the Excel file which is used to mock xw.Book.caller()
# when the code is called from Python and not from Excel via RunPython.
# and then:
xlsx = xw.Book.caller()
Now, xlsx returns: <Book [anExcelFile.xlsm]>
I have a .gds file. How can I read that file with pandas and do some analysis? What is the best way to do that in Python? The file can be downloaded here.
you need to change the encoding and read the data using latin1
import pandas as pd
df = pd.read_csv('example.gds',header=27,encoding='latin1')
will get you the data file, also you need to skip the first 27 rows of data for the real pandas meat of the file.
The gdspy package comes handy for such applications. For example:
import numpy
import gdspy
gdsii = gdspy.GdsLibrary(infile="filename.gds")
main_cell = gdsii.top_level()[0] # Assume a single top level cell
points = main_cell.polygons[0].polygons[0]
for p in points:
print("Points: {}".format(p))
I've read a lot of stackoverflow and other threads where it's been mentioned how to read excel binary file.
Reference: Read XLSB File in Pandas Python
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
However, I can not find any solution on how to write it back as .xlsb file after processing using pandas? Can anyone please suggest a workable solution for this using python?
Any help is much appreciated!
I haven't been able to find any solution to write into xlsb files or create xlsb files using python.
But maybe one work around is to save your file as xlsx using any of the many available libraries to do that (such as pandas, xlsxwriter, openpyxl) and then converting that file into a xlsb using xlsb-converter. https://github.com/gibz104/xlsb-converter
CAUTION: This repository uses WIN32COM, which is why this script only supports Windows
you can read binary file with open_workbook under pyxlsb. Please find below the code:
import pandas as pd
from pyxlsb import open_workbook
path=r'D:\path_to_file.xlsb'
df2=[]
with open_workbook(path) as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df2.append([item.v for item in row])
data= pd.DataFrame(df2[1:], columns=df2[0])
I am trying to import a series of HTML files with news articles that I have saved in my working directory. I developed the code using one single HTML files and it was working perfectly. However, I have since amended the code to import multiple files.
As you can see from the code below I am using pandas and pd.read_html(). It no longer imports any files and give me the error code 'ValueError: No tables found'.
I have tried with different types of HTML files so that doesn't seem to be the problem. I have also updated all of the packages that I am using. I am using OSX and Python 3.6 and Pandas 0.20.3 in Anaconda Navigator.
It was working, now it's not. What am I doing wrong?
Any tips or clues would be greatly appreciated.
import pandas as pd
from os import listdir
from os.path import isfile, join, splitext
import os
mypath = 'path_to_my_wd'
raw_data = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and splitext(f)[1]=='.html')]
news = pd.DataFrame()
for htmlfile in raw_data:
articles = pd.read_html(join(mypath, htmlfile), index_col=0) #reads file as html
data = pd.concat([art for art in articles if 'HD' in art.index.values],
axis=1).T.set_index('AN')
data_export = pd.DataFrame(data, columns=['AN', 'BY', 'SN', 'LP', 'TD'])
#selects columns to export
news = news.append(data_export)
The HTML files were slightly different in formatting and I needed to pass sort=False to pd.concat(): data = pd.concat([art for art in articles if 'HD' in art.index.values], sort=False, axis=1).T.set_index('AN') This is new in Pandas version 0.23.0. That solved the problem.
I have tried to find the answer online but only got confused.
I am a windows user, have Python 2.7 and work with Excel 2010 files saved on sharepoint, trying to automate the data extraction. Basically, my solitaire-long program opens the files one by one, extract the data and saves them into a new xl file.
Hitherto I have used xlwt and xlrd and everything was going pretty smoothly. But now I have encountered a file xlsm that contains the pivot tables that need to be refreshed every time . I have googled it up and found the code:Python: Refresh PivotTables in worksheet
The problem is it does not work for me at all... I keep getting attribute errors like
AttributeError: .Open
I have noticed that the syntax differs substantially, too (wb.Sheets.Count versus wb.nsheets). With win32com I am not able even to iterate through the sheets of a workbook... I just do not have a clue what is a problem - the Python version, importing problem or whatever...
The thing that I cannot find in xlrd/xlwt: ws.PivotTables(j).PivotCache().Refresh() - if I am not mistaken, the problem is that with xlrd/xlwt I am not actually opening Excel file, so probably it is not possible to refresh data using them... Alas, moving to win32com.client hasn't helped...
Any suggestions or links? :) I am probably gonna add an automatic update (refresh) to a VBA code of the xl file in question, but I'd rather not change the files, but my code :)
Edit: I paste my code below, along with an error I keep getting:
the below thing, copied from someone else's code, does not work, returning an AttributeError: Property 'Excel.Application.Visible' can not be set.:
import win32com.client
import os
xl = win32com.client.DispatchEx("Excel.Application")
wb = xl.workbooks.open("//some_dir.xlsm")
xl.Visible = True
wb.RefreshAll()
xl.Quit()
Alas when I try to merge it into a program that uses also xlrd and xlwt it does not work...
from copy import deepcopy
from xlrd import open_workbook
from xlutils.copy import copy as copy
from xlwt import *
from datetime import *
#some definitions here...
rb = open_workbook('my_template.xlsx')
wb = copy(rb)
sheet_1 = wb.get_sheet(0)
sheet_2 = wb.get_sheet(1)
#now the code that evaluates the dates:
import datetime
ys = raw_input('year in yyyy format:\n')
ms = raw_input('month in mm format:\n')
ds = raw_input('day in dd format:\n')
#here the definitions concerning dates are called, all in xlwt and xlrd
#now other files are being opened and data extracted and written into sheet_1 and sheet_2
#now it's time for refreshments ;) - I wanted to 1. open and update the file with win32com, #close it and 2. with xlrd get the data from the updated file:
#refreshing all (preparing file)
import win32com.client
import os
xl = win32com.client.DispatchEx("Excel.Application")
wb = xl.workbooks.open("some_dir.xlsm") #it is a shared file saved in intranet
xl.Visible = True
wb.RefreshAll()
xl.Quit()
print 'file refreshed'
file = open_workbook("some_dir.xlsm")
row_sheet1 = 176
row_sheet2 = 248
for sheet in file.sheets():
print sheet.name
#... the rest of the code ensues
The error again is: AttributeError: Property 'Excel.Application.Visible' can not be set.
I guess I am not supposed to use xlrd and win32com.client simultanously and my whole idea is just soooo wrooong? ;) But why does it not work "alone", I mean the first, short code?