How to avoid writing an empty row when I save a multi-header DataFrame into Excel file? - excel

I would like to save a multi-header DataFrame as Excel file. Following is the sample code:
import pandas as pd
import numpy as np
header = pd.MultiIndex.from_product([['location1','location2'],
['S1','S2','S3']],
names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6),
index=['a','b','c','d','e'],
columns=header)
df.to_excel('result.xlsx')
There are two issues in the excel file as can be seen below:
Issue 1:
There is an empty row under headers. Please let me know how to avoid Pandas to write / insert an empty row in the Excel file.
Issue 2:
I want to save DataFrame without index. However, when I set index=False, I get the following error:
df.to_excel('result.xlsx', index=False)
Error:
NotImplementedError: Writing to Excel with MultiIndex columns and no index ('index'=False) is not yet implemented.

You can create 2 Dataframes - only headers and with default header and write both to same sheet with startrow parameter:
header = df.columns.to_frame(index=False)
header.loc[header['loc'].duplicated(), 'loc'] = ''
header = header.T
print (header)
0 1 2 3 4 5
loc location1 location2
S S1 S2 S3 S1 S2 S3
df1 = df.set_axis(range(len(df.columns)), axis=1)
print (df1)
0 1 2 3 4 5
a -1.603958 1.067986 0.474493 -0.352657 -2.198830 -2.028590
b -0.989817 -0.621200 0.010686 -0.248616 1.121244 0.727779
c -0.851071 -0.593429 -1.398475 0.281235 -0.261898 -0.568850
d 1.414492 -1.309289 -0.581249 -0.718679 -0.307876 0.535318
e -2.108857 -1.870788 1.079796 0.478511 0.613011 -0.441136
with pd.ExcelWriter('output.xlsx') as writer:
header.to_excel(writer, sheet_name='Sheet_name_1', header=False, index=False)
df1.to_excel(writer, sheet_name='Sheet_name_1', header=False, index=False, startrow=2)

Related

Append values to Dataframe in loop and if conditions

Need help please.
I have a dataframe that reads rows from Excel and appends to Dataframe if certain columns exist.
I need to add an additional Dataframe if the columns don't exist in a sheet and append filename and sheetname and write all the file names and sheet names for those sheets to an excel file. Also I want the values to be unique.
I tried adding to dfErrorList but it only showed the last sheetname and filename and repeated itself many times in the output excel file
from xlsxwriter import Workbook
import pandas as pd
import openpyxl
import glob
import os
path = 'filestoimport/*.xlsx'
list_of_dfs = []
list_of_dferror = []
dfErrorList = pd.DataFrame() #create empty df
for filepath in glob.glob(path):
xl = pd.ExcelFile(filepath)
# Define an empty list to store individual DataFrames
for sheet_name in xl.sheet_names:
df = pd.read_excel(filepath, sheet_name=sheet_name)
df['sheetname'] = sheet_name
file_name = os.path.basename(filepath)
df['sourcefilename'] = file_name
if "Project ID" in df.columns and "Status" in df.columns:
print('')
*else:
dfErrorList['sheetname'] = df['sheetname'] # adds `sheet_name` into the column
dfErrorList['sourcefilename'] = df['sourcefilename']
continue
list_of_dferror.append((dfErrorList))
df['Status'].fillna('', inplace=True)
df['Added by'].fillna('', inplace=True)
list_of_dfs.append(df)
# # Combine all DataFrames into one
data = pd.concat(list_of_dfs, ignore_index=True)
dataErrors = pd.concat(list_of_dferror, ignore_index=True)
dataErrors.to_excel(r'error.xlsx', index=False)
# data.to_excel("total_countries.xlsx", index=None)

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

How to prepend a string to a column in csv

I have a csv with 5 columns in in.csv and I want to prepend all the data in column 2 with "text" in this same file such that the row
data1 data2 data3 data4 data5
becomes
data1 textdata2 data3 data4 data5
I thought using regex might be a good idea, but I'm not sure how to proceed
Edit:
After proceeding according to bigbounty's answer, I used the following script:
import pandas as pd
df = pd.read_csv("in.csv")
df["id_str"] = str("text" + str(df["id_str"]))
df.to_csv("new_in.csv", index=False)
My in.out file is like:
s_no,id_str,screen_name
1,1.15017060743203E+018,lorem
2,1.15015544419693E+018,ipsum
3,1.15015089995785E+018,dolor
4,1.15015054311063E+018,sit
After running the script, the new_in.csv file is:
s_no,id_str,screen_name
1,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",lorem
2,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",ipsum
3,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",dolor
4,"text0 1.150171e+18
1 1.150155e+18
2 1.150151e+18
3 1.150151e+18
Name: id_str, dtype: float64",sit
Whereas it should be:
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit
This can easily be done using pandas.
import pandas as pd
df = pd.read_csv("in.csv")
df["data2"] = "text" + df["data2"].astype(str)
df.to_csv("new_in.csv", index=False)
Using the csv module
import csv
with open('test.csv', 'r+', newline='') as f:
data = list(csv.reader(f)) # produces a list of lists
for i, r in enumerate(data):
if i > 0: # presumes the first list is a header and skips it
r[1] = 'text' + r[1] # add text to the front of the text at index 1
f.seek(0) # find the beginning of the file
writer = csv.writer(f)
writer.writerows(data) # write the new data back to the file
# the resulting text file
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit
Using pandas
This solution is agnostic of any column names because it uses column index.
pandas.DataFrame.iloc
import pandas as pd
# read the file set the column at index 1 as str
df = pd.read_csv('test.csv', dtype={1: str})
# add text to the column at index 1
df.iloc[:, 1] = 'text' + df.iloc[:, 1]
# write to csv
df.to_csv('test.csv', index=False)
# resulting csv
s_no,id_str,screen_name
1,text1.15017060743203E+018,lorem
2,text1.15015544419693E+018,ipsum
3,text1.15015089995785E+018,dolor
4,text1.15015054311063E+018,sit

How to get the full text file after merge?

I’m merging two text files file1.tbl and file2.tbl with a common column. I used pandas to make data frames of each and merge function to have the output.
The problem is the output file does not show me the whole data and there is a row of "..." instead and at the end it just prints [9997 rows x 5 columns].
I need a file containing the whole 9997 rows.
import pandas
with open("file1.tbl") as file:
d1 = file.read()
with open("file2.tbl") as file:
d2 = file.read()
df1 = pandas.read_table('file1.tbl', delim_whitespace=True, names=('ID', 'chromosome', 'strand'))
df2 = pandas.read_table('file2.tbl', delim_whitespace=True, names=('ID', 'NUClen', 'GCpct'))
merged_table = pandas.merge(df1, df2)
with open('merged_table.tbl', 'w') as f:
print(merged_table, file=f)

Appending Columns from several worksheets Python

I am trying to import certain columns of data from several different sheets inside of a workbook. However, while appending it only seems to append 'q2 survey' to a new workbook. How do I get this to append properly?
import sys, os
import pandas as pd
import xlrd
import xlwt
b = ['q1 survey', 'q2 survey','q3 survey'] #Sheet Names
df_t = pd.DataFrame(columns=["Month","Date", "Year"]) #column Name
xls = "path_to_file/R.xls"
sheet=[]
df_b=pd.DataFrame()
pd.read_excel(xls,sheet)
for sheet in b:
df=pd.read_excel(xls,sheet)
df.rename(columns=lambda x: x.strip().upper(), inplace=True)
bill=df_b.append(df[df_t])
bill.to_excel('Survey.xlsx', index=False)
I think if you do:
b = ['q1 survey', 'q2 survey','q3 survey'] #Sheet Names
list_col = ["Month","Date", "Year"] #column Name
xls = "path_to_file/R.xls"
#create the empty df named bill to append after
bill= pd.DataFrame(columns = list_col)
for sheet in b:
# read the sheet
df=pd.read_excel(xls,sheet)
df.rename(columns=lambda x: x.strip().upper(), inplace=True)
# need to assign bill again
bill=bill.append(df[list_col])
# to excel
bill.to_excel('Survey.xlsx', index=False)
it should work and correct the errors in your code, but you can do a bit differently using pd.concat:
list_sheet = ['q1 survey', 'q2 survey','q3 survey'] #Sheet Names
list_col = ["Month","Date", "Year"] #column Name
# read once the xls file and then access the sheet in the loop, should be faster
xls_file = pd.ExcelFile("path_to_file/R.xls")
#create a list to append the df
list_df_to_concat = []
for sheet in list_sheet :
# read the sheet
df= pd.read_excel(xls_file, sheet)
df.rename(columns=lambda x: x.strip().upper(), inplace=True)
# append the df to the list
list_df_to_concat.append(df[list_col])
# to excel
pd.concat(list_df_to_concat).to_excel('Survey.xlsx', index=False)

Resources