What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?
I've tried using pandas:
For sheet names using pd.ExcelFile:
xl = pd.ExcelFile(filename)
return xl.sheet_names
For column names using pd.ExcelFile:
xl = pd.ExcelFile(filename)
df = xl.parse(sheetname, nrows=2, **kwargs)
df.columns
For column names using pd.read_excel with and without nrows (>v23):
df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
df.columns
However, both pd.ExcelFile and and pd.read_excel seem to read the entire .xlsx in memory and are therefore slow.
Thanks a lot!
Here is the easiest way I can share with you:
# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names
According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows will always read all the file into memory first.
Possible solutions:
Convert the sheet to csv, and read that in chunks.
Use something other than pandas. See this page for a list of alternative libraries.
I think this would help the need
from openpyxl import load_workbook
workbook = load_workbook(filename, read_only=True)
data = {} #for storing the value of sheet with their respective columns
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
data[sheet.title] = value #value would be a tuple with headings of each column
This program lists all the sheets in the excel.
Pandas is used here.
import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)
Related
I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?
I have a excel file which has multiple sheets(6) in it. I am writing a python script to convert the each individual sheet into a csv file.
My input file looks like this and this is for example sheetname = class5
Name ID
Mark 11
Tom 22
Jane 33
like this I have multiple sheets in the excel
I need to convert them in csv file having just 'Name' and class like this:
Mark,class5
Tom,class5
Jane,class5
This one one sheet like this I have multiple sheets so what I am using is converting every sheet in dataframe like this
xls = pd.Excelfile('path_of_file'.xlsx)
df1= pd.read_excel(xlsx, 'Sheet1')
df2 = pd.read_excel(xlsx, 'Sheet2')
df3 = pd.read_excel(xlsx, 'Sheet3')
How can I make csv file called 'class5'.csv with output as above and same for every sheet such as class6,7,8?
So, assuming from your question what you want is the contents of each sheet to be saved to a different csv, where the csv has the name column, and another column containing the name of the sheet it came from, without a header.
If that's what you're after, you could do:
xls = pd.read_excel('path_of_file',sheet_name = None)
for sheet_name, df in xls.items():
df['sheet'] = sheet_name
df[['Name','sheet']].to_csv(f'{sheet_name}.csv', header=False)
key point is the sheet_name argument of read_excel. as the commentor on your question states, leave this as None and you will get a dictionary you can iterate through
I have a pandas dataframe as below:
header = [np.array(['location','location','location','location2','location2','location2']),
np.array(['S1','S2','S3','S4','S5','S6'])]
df = pd.DataFrame(np.random.randn(5, 6), columns = header )
df
I want to export my dataframe to an excel sheet ignoring the index. Here is my code which exports my dataframe to excel spreadsheet but with index. when I am using the parameter, index = False, It gives me an error.
# output all the consolidated input to an excel sheet
out_file_name = os.path.join(folder_location, "templates", future_template)
writer = pd.ExcelWriter(out_file_name, engine='xlsxwriter')
# Write each dataframe to a different worksheet.
df.to_excel(writer, sheet_name='Ratings Inputs')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
DataFrame.to_excel(index=False) is still unsupported for MultiIndex (as of Pandas 1.3.4, Oct 2021). You will get the error:
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.
You can try some workarounds instead:
Write with index=True. Then using openpyxl, re-open the file, delete the undesired cols/rows, and re-save the file. This is a slow process, so it may not be practical for large dataframes.
You can manually write the MultiIndex headers. This won't have merged cells though. See How to hide the rows index
I have a .csv file that I'm reading from. I read only select columns from it and I need to further process this data before I save it into an excel sheet. The idea is to repeat this process for all the files in the folder and save the sheets with the same names as the original .csv.
As of now, I'm able to read the specific columns from .csv and write the whole file into excel. I am yet to figure out how to further process these columns before I save to excel. Further processing involves
Averaging rows 18000-20000 for each column separately.
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names.
My code is as follows. Need some help with this.
import pandas as pd
import os
from pathlib import Path
for f in os.listdir():
file_name, file_ext = os.path.splitext(f) #splitting into file name and extension
if file_ext == '.atf':
#open the data file and get data only from specific columns.
df = pd.read_csv(f, header = 9, index_col = 0, usecols = [0,55,59,63,67,71,75,79,83,87,91,95,99,103], encoding = "ISO-8859-1", sep = '\t', dtype = {'YFP.13':str,'YFP.14':str,'YFP.15':str,'YFP.16':str,'YFP.17':str,'YFP.18':str,'YFP.19':str,'YFP.20':str,'YFP.21':str,'YFP.22':str,'YFP.23':str,'YFP.24':str,'YFP.25':str,'Signals=':str})
df.to_excel(file_name+'.xlsx',sheet_name=file_name, engine = 'xlsxwriter') #writing into an excel file
Let's say your dataframe has a shape of 4 columns and 100 rows:
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Averaging rows 18000-20000 for each column separately.
To perform the averaging on a subset, you define a logical mask based on inequalities over the index and apply the averaging function on the selected dataframe. The results are saved in a new dataframe as you want to use them later:
means = data[(60<data.index) & (80>data.index)].mean()
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names
For the two last steps, the code below speaks by itself:
cols = data.columns
for col in cols:
data[col +"_calc"] = (data[col]-means[col])/means[col]
In the end, you may export the dataframe "data" to an excel format as you did earlier.
i have a data (3000 rows and 20000 columns) i need to add a column with a header of 'class' and the all 3000 rows contain the same word which is here 'Big' how i can do that using python? i tried to do it manually but the file is too large for excel it can't be loaded completely.
I know it may seems easy but i'm new to python tried several codes but non of them gave the needed result.
Use Pandas module:
import pandas as pd
df = pd.read_csv(r'/path/to/file.csv').assign(Class='Big')
df.to_csv('/path/to/new_file.csv', index=False)
or as a one-liner:
pd.read_csv(r'/path/to/file.csv').assign(Class='Big') \
.to_csv(r'/path/to/new_file.csv', index=False)
UPDATE:
I have 9 files as the one you just helped me to add a column to, each
one represent a class's attributes. can you tell me how i can combine
these files in one csv file, that will be 27000 rows and 30000
columns?
files = ['file1.csv','file2.csv', ...]
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)