I have a pandas dataframe as below:
header = [np.array(['location','location','location','location2','location2','location2']),
np.array(['S1','S2','S3','S4','S5','S6'])]
df = pd.DataFrame(np.random.randn(5, 6), columns = header )
df
I want to export my dataframe to an excel sheet ignoring the index. Here is my code which exports my dataframe to excel spreadsheet but with index. when I am using the parameter, index = False, It gives me an error.
# output all the consolidated input to an excel sheet
out_file_name = os.path.join(folder_location, "templates", future_template)
writer = pd.ExcelWriter(out_file_name, engine='xlsxwriter')
# Write each dataframe to a different worksheet.
df.to_excel(writer, sheet_name='Ratings Inputs')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
DataFrame.to_excel(index=False) is still unsupported for MultiIndex (as of Pandas 1.3.4, Oct 2021). You will get the error:
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.
You can try some workarounds instead:
Write with index=True. Then using openpyxl, re-open the file, delete the undesired cols/rows, and re-save the file. This is a slow process, so it may not be practical for large dataframes.
You can manually write the MultiIndex headers. This won't have merged cells though. See How to hide the rows index
Related
I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?
I have a workbook named CustomerSales with two Excel worksheets named sheet1 and sheet2 in both sheets I need to write/change values in particular column values so I used.
if (data.at[g, 'failedColumn'] == '' and data.at[g, 'reason'] == ''):
data.at[g, 'status'] = 'Fail'
data.at[g, 'failedColumn'] = 'BUKRS'
data.at[g, 'reason'] = 'Customer Not Extended To Any Company code'
data.to_excel(variable)//*variable-path to excel file
Here data is Dataframe of sheet1 here the code works perfectly fine resultexcel file has updated column values
But when I am trying above code with dataframe of sheet2 existing sheet1 data get replaced by sheet2. Is there a way I can change both sheets values.
In order to manipulate the data in both of the sheets you should read the file with pandas, create 2 separate dataframes (one for each sheet) and then save them in the same workbook using xlsxwriter. Here is a demo code:
import pandas as pd
## Read the 2 sheets as separate dataframes
df1 = pd.read_excel('name_of_your_file.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('name_of_your_file.xlsx', sheet_name='Sheet2')
# Do all of your data manipulation here
# Start using xlsxwriter
writer = pd.ExcelWriter('name_of_the_new_file.xlsx', engine='xlsxwriter')
# Save each df to separate sheet in the same file
df1.to_excel(writer, sheet_name='Sheet1', index=False)
df2.to_excel(writer, sheet_name='Sheet2', index=False)
# You can format your worksheets here
# Finally save the file
writer.save()
What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?
I've tried using pandas:
For sheet names using pd.ExcelFile:
xl = pd.ExcelFile(filename)
return xl.sheet_names
For column names using pd.ExcelFile:
xl = pd.ExcelFile(filename)
df = xl.parse(sheetname, nrows=2, **kwargs)
df.columns
For column names using pd.read_excel with and without nrows (>v23):
df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
df.columns
However, both pd.ExcelFile and and pd.read_excel seem to read the entire .xlsx in memory and are therefore slow.
Thanks a lot!
Here is the easiest way I can share with you:
# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names
According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows will always read all the file into memory first.
Possible solutions:
Convert the sheet to csv, and read that in chunks.
Use something other than pandas. See this page for a list of alternative libraries.
I think this would help the need
from openpyxl import load_workbook
workbook = load_workbook(filename, read_only=True)
data = {} #for storing the value of sheet with their respective columns
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
data[sheet.title] = value #value would be a tuple with headings of each column
This program lists all the sheets in the excel.
Pandas is used here.
import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)
I have a .csv file that I'm reading from. I read only select columns from it and I need to further process this data before I save it into an excel sheet. The idea is to repeat this process for all the files in the folder and save the sheets with the same names as the original .csv.
As of now, I'm able to read the specific columns from .csv and write the whole file into excel. I am yet to figure out how to further process these columns before I save to excel. Further processing involves
Averaging rows 18000-20000 for each column separately.
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names.
My code is as follows. Need some help with this.
import pandas as pd
import os
from pathlib import Path
for f in os.listdir():
file_name, file_ext = os.path.splitext(f) #splitting into file name and extension
if file_ext == '.atf':
#open the data file and get data only from specific columns.
df = pd.read_csv(f, header = 9, index_col = 0, usecols = [0,55,59,63,67,71,75,79,83,87,91,95,99,103], encoding = "ISO-8859-1", sep = '\t', dtype = {'YFP.13':str,'YFP.14':str,'YFP.15':str,'YFP.16':str,'YFP.17':str,'YFP.18':str,'YFP.19':str,'YFP.20':str,'YFP.21':str,'YFP.22':str,'YFP.23':str,'YFP.24':str,'YFP.25':str,'Signals=':str})
df.to_excel(file_name+'.xlsx',sheet_name=file_name, engine = 'xlsxwriter') #writing into an excel file
Let's say your dataframe has a shape of 4 columns and 100 rows:
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Averaging rows 18000-20000 for each column separately.
To perform the averaging on a subset, you define a logical mask based on inequalities over the index and apply the averaging function on the selected dataframe. The results are saved in a new dataframe as you want to use them later:
means = data[(60<data.index) & (80>data.index)].mean()
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names
For the two last steps, the code below speaks by itself:
cols = data.columns
for col in cols:
data[col +"_calc"] = (data[col]-means[col])/means[col]
In the end, you may export the dataframe "data" to an excel format as you did earlier.
This question is kind of odd and complex, so bear with me, please.
I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).
The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.
Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use, can you do something like:
def parse_big_csv(csvpath):
with open(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)