I'm trying to change a list of excel files to csvs before loading them into a pandas dataframe, but I'm unsure how I can convert them in my script. Csvkit and xlsx2csv seem to work for doing it from the command line, but when I try to start a subprocess like so
for filename in sorted_files:
file = subprocess.Popen("in2csv filename", stdout=subprocess.PIPE)
print file.stdout
dataframe = pd.read_csv(file)
I'm getting the error
IOError: Expected file path name or file-like object, got type
schema must not be null when format is "fixed"
Is it possible to get the output from the subprocess and pipe that to a dataframe? Any help greatly appreciated!
Although it has been so long since the question was made, I had the same issue and this is the way it was implemented inside a python script:
Could only execute Xlsx2csv with sheetid parameter. In order to get sheet names and ids, get_sheet_details was used.
csvfrmxlsx creates csv files for each sheet in csv folder under parent directory.
import pandas as pd
from pathlib import Path
def get_sheet_details(filename):
import xmltodict
import shutil
import zipfile
sheets = []
# Make a temporary directory with the file name
directory_to_extract_to = (filename.with_suffix(''))
directory_to_extract_to.mkdir(parents=True, exist_ok=True)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(filename, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['#sheetId'], # can be sheetId for some versions
'name': sheet['#name'] # can be name
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
def csvfrmxlsx(xlsxfl, df): # create csv files in csv folder on parent directory
from xlsx2csv import Xlsx2csv
(xlsxfl.parent / 'csv').mkdir(parents=True, exist_ok=True)
for index, row in df.iterrows():
shnum = row['id']
shnph = xlsxfl.parent / 'csv' / Path(row['name'] + '.csv') # path for converted csv file
Xlsx2csv(str(xlsxfl), outputencoding="utf-8").convert(str(shnph), sheetid=int(shnum))
return
pthfnc = 'c:/xlsx/'
wrkfl = 'my.xlsx'
xls_file = Path(pthfnc + wrkfl)
sheetsdic = get_sheet_details(xls_file) # dictionary with sheet names and ids without opening xlsx file
df = pd.DataFrame.from_dict(sheetsdic)
csvfrmxlsx(xls_file, df) # df with sheets to be converted
Related
Below script is there to convert all xlsx files placed inside the folder to CSV files with the same name.
import os
import glob
import pandas as pd
#set working directory
os.chdir("C:/Users/piyush.upadhyay/Piyush/Tasks/Task-61(Script to convert excel to csv)/Files")
all_files = [i for i in glob.glob('*.{}'.format('xlsx'))]
print(all_files)
li = []
for filename in all_files:
try:
print(filename)
input('Going to read xlsx to csv')
outFileName = filename.split('.')[0]+'.csv'
data_xls = pd.read_excel(filename, engine = 'openpyxl')
print(data_xls)
input('Going to convert xlsx to csv')
data_xls.to_csv(outFileName, header=True, index=None)
input('Converted')
except Exception as e:
print ("Error Logged..")
print(e)
input('Enter to exit')
Code returns a warning at the time of reading xlsx file which says:
Issue: Above function returns an empty data frame when we print the data_xls variable. As soon as we save the same file manually with the same extension i.e. xlsx, the code successfully converts all excel files inside the folder into CSV files.
Issue is same as the link describes
I have a CSV file and I am running some Python to remove line breaks from the CSV.
import csv
with open('Jan2020.csv', 'r') as txtReader:
with open('new_jan2020.csv', 'w') as txtWriter:
for line in txtReader.readlines():
line = line.replace('\r', '')
txtWriter.write(line)
This works fine.
What I want to achieve is the following:
I have multiple CSV files in a folder: jan2020, feb2020, march2020, april2020, may2020
How would I loop through each file, remove line breaks like my above method and then output a new file for each where the name of the new file is the format: new_monthYear.csv?
So I would end up with a bunch of CSVs new_jan2020, new_feb2020, new_march2020, new_april2020, new_may2020
Thanks
You can list all files in a directory with os.listdir. I would also recommend to write your new files in another folder.
import os
import csv
# Folder name
path_input = '/path/folder/'
path_output = '/path/other/'
dirs = os.listdir( path_input )
# This would iterate over all the listed files
for file in dirs:
file_to_read = os.path.join( path_input, file )
file_to_write = os.path.join( path_output, 'new_' + file )
# Your code
with open(file_to_read, 'r') as txtReader:
with open(file_to_write, 'w') as txtWriter:
for line in txtReader.readlines():
line = line.replace('\r', '')
txtWriter.write(line)
Updated with the use of pandas:
import os
import csv
import pandas as pd
# Folder name
path_input = (r'your/path')
path_output = (r'your/path')
dirs = os.listdir( path_input )
# Iterate over all the listed files
df ={}
for file in dirs:
file_to_read = os.path.join( path_input, file )
file_to_write = os.path.join( path_output, 'new_' + file )
# Remove line breaks
df = pd.read_csv(file_to_read)
df2 = df.replace("\n","", regex=True)
df3 = df2.to_csv(file_to_write, index=False)
I am a beginner at python. I am writing a script to :
Read all csv files in a folder
Drop duplicate rows within a .csv file by reading one csv file at a time
Write to *_new.csv file
The code :
import csv
import os
import pandas as pd
path = "/Users/<mylocaldir>/Documents/Data/"
file_list = os.listdir(path)
for file in file_list:
fullpath = os.path.join(path, file)
data = pd.read_csv(fullpath)
newdata = data.drop_duplicates()
newfile = fullpath.replace(".csv","_new.csv")
newdata.to_csv ("newfile", index=True, header=True)
As I run the script, there is no error displayed. But, *_new.csv is not created
Any help to resolve this issue?
I don't know pandas but you don't need it. You could try something like this:
import os
file_list = os.listdir()
# loop through the list
for filename in file_list:
# don't process any non csv file
if not filename.endswith('.csv'):
continue
# lines will be a temporary holding spot to check
# for duplicates
lines = []
new_file = filename.replace('.csv', '_new.csv')
# open 2 files - csv file and new csv file to write
with open(filename, 'r') as fr, open(new_file, 'w') as fw:
# read line from csv
for line in fr:
# if that line is not in temporary list called lines,
# add it there and write to file
# if that line is found in temporary list called lines,
# don't do anything
if line not in lines:
lines.append(line)
fw.write(line)
print('Done')
Result
Original file
cat name.csv
id,name
1,john
1,john
2,matt
1,john
New file
cat name_new.csv
id,name
1,john
2,matt
Another original file
cat pay.csv
id,pay
1,100
2,300
1,100
4,400
4,400
2,300
4,400
It's new file
id,pay
1,100
2,300
4,400
Update
The following script works with a slight modification to read from Src folder and write to Dest folder :
import cdv
import os
import pandas as pd
path = "/Users/<localdir>/Documents/Data/Src"
newPath = "/Users/<localdir>/Documents/Data/Dest"
file_list = os.listdir(path)
for file in file_list:
fullpath = os.path.join(path, file)
data = pd.read_csv(fullpath)
newdata = data.drop_duplicates()
newfile = file.replace(".csv","_new.csv")
if not os.path.isfile(os.path.join(newPath, newfile)):
newdata.to_csv (os.path.join(newPath, newfile), index=False, header=True)
I also added a check to see if a file already exists in the Dest folder.
I will be keen to understand if there is a better way to write this script.
I need to dynamically save my pandas data frame. I have successfully managed to output a CSV file with a static name using the following code:
export_csv = df.to_csv(r'static_name.csv', header=False , index=False)
But I have failed to make this work dynamically. With the code below, I expect to get a file with the name passed into args.save_file_name and .csv suffix. However, I get no result.
import os
import argparse
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument('file', help="this is the file you want to open")
parser.add_argument('save_file_name', help="the name of the file you want for the output CSV")
args = parser.parse_args()
print("file name:", args.file) # checking that this worked
...
# export csv
path = os.getcwd()
export_path = path + args.save_file_name + '.csv'
export_csv = df.to_csv(path_or_buf=export_path, header=False, index=False)
I think, that problem is in your export_path variable, that is not made right. The following code should do the job.
export_path = os.path.join(path, args.save_file_name + '.csv')
I am converting multiple CSV files under the similar directory to XLSX files.
The CSV has delimiter as tab.
I executed the program and managed to generate the XLSX files. However, the XLSX files are not separate by the tab.
Please view my code and tell me what is wrong. In line 10, I have specified my delimiter as tab, but the resulting XLSX file is not separated.
import os
import glob
import csv
import openpyxl
for csvfile in glob.glob(os.path.join(r'(my directory)', '*.csv')):
wb = openpyxl.Workbook()
ws = wb.active
with open(csvfile, 'r') as f:
reader = csv.reader(f, delimiter='\t')
for r, row in enumerate(reader, start=1):
for c, val in enumerate(row, start=1):
ws.cell(row=r, column=c).value = val
wb.save(csvfile + '.xlsx')