I am having trouble reading an xlsx file on Pandas.. The same code used to work before but does not work anymore. I tried a lot of ways but to no avails.
Here is my code
import pandas as pd
from io import StringIO
df = pd.read_csv("Muzika.xlsx")
print(df)
Your file is not .csv, it is an excel file, so try read_excel(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Related
I have a .gds file. How can I read that file with pandas and do some analysis? What is the best way to do that in Python? The file can be downloaded here.
you need to change the encoding and read the data using latin1
import pandas as pd
df = pd.read_csv('example.gds',header=27,encoding='latin1')
will get you the data file, also you need to skip the first 27 rows of data for the real pandas meat of the file.
The gdspy package comes handy for such applications. For example:
import numpy
import gdspy
gdsii = gdspy.GdsLibrary(infile="filename.gds")
main_cell = gdsii.top_level()[0] # Assume a single top level cell
points = main_cell.polygons[0].polygons[0]
for p in points:
print("Points: {}".format(p))
I've read a lot of stackoverflow and other threads where it's been mentioned how to read excel binary file.
Reference: Read XLSB File in Pandas Python
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
However, I can not find any solution on how to write it back as .xlsb file after processing using pandas? Can anyone please suggest a workable solution for this using python?
Any help is much appreciated!
I haven't been able to find any solution to write into xlsb files or create xlsb files using python.
But maybe one work around is to save your file as xlsx using any of the many available libraries to do that (such as pandas, xlsxwriter, openpyxl) and then converting that file into a xlsb using xlsb-converter. https://github.com/gibz104/xlsb-converter
CAUTION: This repository uses WIN32COM, which is why this script only supports Windows
you can read binary file with open_workbook under pyxlsb. Please find below the code:
import pandas as pd
from pyxlsb import open_workbook
path=r'D:\path_to_file.xlsb'
df2=[]
with open_workbook(path) as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df2.append([item.v for item in row])
data= pd.DataFrame(df2[1:], columns=df2[0])
Recently i encountered a file with .data extension and i searched google, i found irrelevant answers. I tried different solutions provided by blogs and websites. Nothing seems helpful. I am providing solution which was suggested to me by my colleague. Before i tried reading with read_csv.
import pandas as pd
data = pd.read_csv("example.data")
It processed the csv file but with irrelevant data.
Hope it will be helpful.
The solution to read .data file using pandas is read_fwf(). For better knowledge refer read_fwf.
Example:
import pandas as pd
data = pd.read_fwf("example.data")
By default data will not contains columns because in .data will contain any columns. In order to get column names we have to pass the column names while reading the file.
Example:
import pandas as pd
data = pd.read_fwf("example.data", names=["col1", "col2"])
print(data.columns)
>>> [col1, col2]
Hope this is useful..!!!
I would say treat the .data file as a csv file(for me worked out). In case your column names are missing, just specify them.
I used:
import pandas as pd
DataFrame = pd.read_csv("file.data", names=["columnName", "..." , ".." ])
I am trying to import a series of HTML files with news articles that I have saved in my working directory. I developed the code using one single HTML files and it was working perfectly. However, I have since amended the code to import multiple files.
As you can see from the code below I am using pandas and pd.read_html(). It no longer imports any files and give me the error code 'ValueError: No tables found'.
I have tried with different types of HTML files so that doesn't seem to be the problem. I have also updated all of the packages that I am using. I am using OSX and Python 3.6 and Pandas 0.20.3 in Anaconda Navigator.
It was working, now it's not. What am I doing wrong?
Any tips or clues would be greatly appreciated.
import pandas as pd
from os import listdir
from os.path import isfile, join, splitext
import os
mypath = 'path_to_my_wd'
raw_data = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and splitext(f)[1]=='.html')]
news = pd.DataFrame()
for htmlfile in raw_data:
articles = pd.read_html(join(mypath, htmlfile), index_col=0) #reads file as html
data = pd.concat([art for art in articles if 'HD' in art.index.values],
axis=1).T.set_index('AN')
data_export = pd.DataFrame(data, columns=['AN', 'BY', 'SN', 'LP', 'TD'])
#selects columns to export
news = news.append(data_export)
The HTML files were slightly different in formatting and I needed to pass sort=False to pd.concat(): data = pd.concat([art for art in articles if 'HD' in art.index.values], sort=False, axis=1).T.set_index('AN') This is new in Pandas version 0.23.0. That solved the problem.
There are many questions on this, but there has been no simple answer on how to read an xlsb file into pandas. Is there an easy way to do this?
With the 1.0.0 release of pandas - January 29, 2020, support for binary Excel files was added.
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
Notes:
You will need to upgrade pandas - pip install pandas --upgrade
You will need to install pyxlsb - pip install pyxlsb
Hi actually there is a way. Just use pyxlsb library.
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])
UPDATE:
as of pandas version 1.0 read_excel() now can read binary Excel (.xlsb) files by passing engine='pyxlsb'
Source: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html
Pyxlsb indeed is an option to read xlsb file, however, is rather limited.
I suggest using the xlwings package which makes it possible to read and write xlsb files without losing sheet formating, formulas, etc. in the xlsb file. There is extensive documentation available.
import pandas as pd
import xlwings as xw
app = xw.App()
book = xw.Book('file.xlsb')
sheet = book.sheets('sheet_name')
df = sheet.range('A1').options(pd.DataFrame, expand='table').value
book.close()
app.kill()
'A1' in this case is the starting position of the excel table.
To write to xlsb file, simply write:
sheet.range('A1').value = df
If you want to read a big binary file or any excel file with some ranges you can directly put at this code
range = (your_index_number)
first_dataframe = []
second_dataframe = []
with open_xlsb('Test.xlsb') as wb:
with wb.get_sheet('Sheet1') as sheet:
i=0
for row in sheet.rows():
if(i!=range):
first_dataframe.append([item.v for item in row])
i=i+1
else:
second_dataframe.append([item.v for item in row])
first_dataframe = pd.DataFrame(first_dataframe[1:], columns=first[0])
second_dataframe = pd.DataFrame(second_dataframe[:], columns=first.columns)
To be able to read xlsb files, it is necessary to have openpyxl installed.
As per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files.
When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.
xlsb reading without index_col:
import pandas as pd
dfcluster = pd.read_excel('c:/xml/baseline/distribucion.xlsb', sheet_name='Cluster', index_col=0, engine='pyxlsb')