Read file and output specific fields to CSV file - excel

I'm trying to search for data based on a key word and export that data to an Excel or text file.
When I "print" the variable/list it works no problem. When I try and output the data to a file it only outputs the last entry. I think something is wrong with the iteration, but I can't figure it out.
import xlsxwriter
#Paths
xls_output_path = 'C:\\Data\\'
config = 'C:\\Configs\\filename.txt'
excel_inc = 0 #used to increment the excel columns so not everything
#is written in "A1"
lines = open(config,"r").read().splitlines()
search_term = "ACL"
for i, line in enumerate(lines):
if search_term in line:
split_lines = line.split(' ') #Split lines via a space.
linebefore = lines[i - 1] #Print the line before the search term
linebefore_split = linebefore.split(' ') #Split the line before via
#space
from_obj = linebefore_split[2] #[2] holds the data I need
to_object = split_lines[4] #[4] holds the data I need
print(len(split_lines)) #Prints each found line with no
#problem.
excel_inc = excel_inc + 1 #Increments for column A so not all of
#the data is placed in A1
excel_inc_str = str(excel_inc) #Change type to string so it can
#concatenate.
workbook = xlsxwriter.Workbook(xls_output_path + 'Test.xlsx') #Creates the xls file
worksheet = workbook.add_worksheet()
worksheet.write('A' + excel_inc_str, split_lines[4]) #Write data from
#split_lines[4]
#to column A
workbook.close()
I created this script so it will go and find all lines in the "config" file with the keyword "ACL".
It then has the ability to print the line before and the actual line the data is found. This works great.
My next step is outputting the data to an excel spreadsheet. This is where I get stuck.
The script only prints the very last item in the column A row 10.
I need help figuring out why it'll print the data correctly, but it won't output it to an excel spreadsheet or even a .txt file.

Try this - I moved your workbook and worksheet definitions outside the loop, so it doesn't keep getting redefined.
import xlsxwriter
#Paths
xls_output_path = 'C:\\Data\\'
config = 'C:\\Configs\\filename.txt'
excel_inc = 0 #used to increment the excel columns so not everything
#is written in "A1"
lines = open(config,"r").read().splitlines()
search_term = "ACL"
workbook = xlsxwriter.Workbook(xls_output_path + 'Test.xlsx') #Creates the xls file
worksheet = workbook.add_worksheet()
for i, line in enumerate(lines):
if search_term in line:
split_lines = line.split(' ') #Split lines via a space.
linebefore = lines[i - 1] #Print the line before the search term
linebefore_split = linebefore.split(' ') #Split the line before via
#space
from_obj = linebefore_split[2] #[2] holds the data I need
to_object = split_lines[4] #[4] holds the data I need
print(len(split_lines)) #Prints each found line with no
#problem.
excel_inc = excel_inc + 1 #Increments for column A so not all of
#the data is placed in A1
excel_inc_str = str(excel_inc) #Change type to string so it can
#concatenate.
worksheet.write('A' + excel_inc_str, split_lines[4]) #Write data from
#split_lines[4]
#to column A
workbook.close()

Related

How do I convert multiple multiline txt files to excel - ensuring each file is its own line, then each line of text is it own row? Python3

Using openpyxl and Path I aim to:
Create multiple multiline .txt files,
then insert .txt content into a .xlsx file ensuring file 1 is in column 1 and each line has its own row.
I thought to create a nested list then loop through it to insert the text. I cannot figure how to ensure that all the nested list string is displayed. This is what I have so far which nearly does what I want however it's just a repeat of the first line of text.
from pathlib import Path
import openpyxl
listOfText = []
wb = openpyxl.Workbook() # Create a new workbook to insert the text files
sheet = wb.active
for txtFile in range(5): # create 5 text files
createTextFile = Path('textFile' + str(txtFile) + '.txt')
createTextFile.write_text(f'''Hello, this is a multiple line text file.
My Name is x.
This is text file {txtFile}.''')
readTxtFile = open(createTextFile)
listOfText.append(readTxtFile.readlines()) # nest the list from each text file into a parent list
textFileList = len(listOfText[txtFile]) # get the number of lines of text from the file. They are all 3 as made above
# Each column displays text from each text file
for row in range(1, txtFile + 1):
for col in range(1, textFileList + 1):
sheet.cell(row=row, column=col).value = listOfText[txtFile][0]
wb.save('importedTextFiles.xlsx')
The output is 4 columns/4 rows. All of which say the same 'Hello, this is a multiple line text file.'
Appreciate any help with this!
The problem is in the for loop while writing, change the line sheet.cell(row=row, column=col).value = listOfText[txtFile][0] to sheet.cell(row=col, column=row).value = listOfText[row-1][col-1] and it will work

How can I create an excel file with multiple sheets that stores content of a text file using python

I need to create an excel file and each sheet contains the contents of a text file in my directory, for example if I've two text file then I'll have two sheets and each sheet contains the content of the text file.
I've managed to create the excel file but I could only fill it with the contents of the last text file in my directory, howevr, I need to read all my text files and save them into excel.
This is my code so far:
import os
import glob
import xlsxwriter
file_name='WriteExcel.xlsx'
path = 'C:/Users/khouloud.ayari/Desktop/khouloud/python/Readfiles'
txtCounter = len(glob.glob1(path,"*.txt"))
for filename in glob.glob(os.path.join(path, '*.txt')):
f = open(filename, 'r')
content = f.read()
print (len(content))
workbook = xlsxwriter.Workbook(file_name)
ws = workbook.add_worksheet("sheet" + str(i))
ws.set_column(0, 1, 30)
ws.set_column(1, 2, 25)
parametres = (
['file', content],
)
# Start from the first cell. Rows and
# columns are zero indexed.
row = 0
col = 0
# Iterate over the data and write it out row by row.
for name, parametres in (parametres):
ws.write(row, col, name)
ws.write(row, col + 1, parametres)
row += 1
workbook.close()
example:
if I have two text file, the content of the first file is 'hello', the content of the second text file is 'world', in this case I need to create two worksheets, first worksheet needs to store 'hello' and the second worksheet needs to store 'world'.
but my two worksheets contain 'world'.
I recommend to use pandas. It in turn uses xlsxwriter to write data (whole tables) to excel files but makes it much easier - with literally couple lines of code.
import pandas as pd
df_1 = pd.DataFrame({'data': ['Hello']})
sn_1 = 'hello'
df_2 = pd.DataFrame({'data': ['World']})
sn_2 = 'world'
filename_excel = '1.xlsx'
with pd.ExcelWriter(filename_excel) as writer:
for df, sheet_name in zip([df_1, df_2], [sn_1, sn_2]):
df.to_excel(writer, index=False, header=False, sheet_name=sheet_name)

To extract content of 1st column (all rows) from an .xlsx file and replace it with the extracted information from each column

I have to replace first entire column (all rows) with information extracted from each column itself. Last digit is missing for each column with my code.
I have coded but had to save the output to a different file. I am unable to figure out how to replace the first column of the existing file itself. I need one file with the required output only.
fname = 'output.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.active
print('The sheet title is: ', sheet.title)
row_a = sheet['A']
d = []
for cell in row_a:
a = cell.value
d.append(a)
print(d)
s = []
for i in d:
i = i[-1:-8]
s.append(i)
print('The list of account numbers is: ', s)
wc = xlwt.Workbook()
ws = wc.add_sheet('Sheet1')
row=0
col=0
list_d = s
for item in list_d:
ws.write(row, col, item)
row+=1
wc.save('FINAL.xls')
I suggest using python's builtin string.split method:
import openpyxl
fname = 'output.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.active
d = [cell.value for cell in sheet['A']] # List comprehension to replace your for loop
# str.split splits the 'Name' column data into an array of strings
# selecting [-1] selects only the account number
s = [i.split('.')[-1] for i in d]
s[0] = 'Account' # replace 'Name' with 'Account' for column header
row = 1
col = 1
for item in s:
sheet.cell(row, col).value = item
row += 1
wb.save(fname)
I also added list comprehensions, which are a more Pythonic way of creating arrays from data in many cases.

Combine Multiple Workbooks into One

I have a various amount of input .xlsx documents that contain 12 sheets (all sheets have the same name within each .xlsx document). I need to combine these into one .xlsx document while retaining the original sheets' names, but the data from all documents for each sheets appended to the original sheets.
For example, see my original output:
Original Output
Desired Output
Currently, I am not adding the inputFile name anywhere and just trying to merge into one workbook. However, I keep receiving an error:
error
def createEmptyWorkbook(self, outputFileName):
logging.info('creating empty workbook: %s' % (outputFileName))
# create empty workbook
ncoa_combined_report = openpyxl.Workbook()
# save new file
ncoa_combined_report.save(outputFileName)
ncoa_combined_report = openpyxl.load_workbook(filename=outputFileName)#, data_only=True)
return ncoa_combined_report
def combine_sheets(self, inputFiles):
logging.info('combining ncoa reports to one workbook')
# new output files
outputFile = os.path.join(self.processingDir, 'combined_ncoa_report.xlsx')
# create empty workbook
ncoa_combined_report = self.createEmptyWorkbook(outputFile)
# get a list of sheet names created in output file
outputSheetNames = ncoa_combined_report.sheetnames
for inputFile in inputFiles:
logging.info('reading ncoa report: %s' % (os.path.split(inputFile)[-1]))
# load entire input file into memory
input_wb = openpyxl.load_workbook(filename = inputFile)#, data_only=True)
# get sheet name values in inputFile
sheets = input_wb.sheetnames
# iterate worksheets in input file
for worksheet in input_wb.worksheets:
outputSheetMaxRow = 0
currentSheet = ''
row = ''
column = ''
logging.info('working on sheet: %s' % (worksheet.title))
# check if sheet exist in output file and add if neccissary
if not worksheet.title in outputSheetNames:
logging.info('creating sheet: %s' % (worksheet.title))
currentSheet = ncoa_combined_report.create_sheet(worksheet.title)
else:
currentSheet = worksheet.title
## check if default sheet name is in output
#if 'Sheet' in outputSheetNames:
# ncoa_combined_report.remove_sheet(ncoa_combined_report.get_sheet_by_name('Sheet'))
outputSheetMaxRow = currentSheet.max_row
for row, entry in enumerate(worksheet, start=1):
logging.info('working on row: %s' % (row))
for cell in entry:
try:
outputSheetMaxRow = currentSheet.max_row
# add cell value to output file
#currentSheet[cell.coordinate].value
currentSheet.cell(row=row+outputSheetMaxRow, column=cell.column).value = cell.value #, value=cell
except:
logging.critical('could not add row:%s, cell:%s' % (row, entry))
raise ValueError('could not add row:%s, cell:%s' % (row, entry))
# save new file
ncoa_combined_report.save(outputFile)
I am not sure why I am getting the error or what I need to update to correct it. Any guidance is appreciated.
I think I found the issue with this portion of the code. I found where you can get the xy, col, and row from openpyxl.utils, which allowed me to insert at the append at the correct locations. Hopefully this will help someone else in the future.
for line, entry in enumerate(worksheet, start=1):
#logging.info('working on row: %s' % (row))
for cell in entry:
#try:
xy = openpyxl.utils.coordinate_from_string(cell.coordinate) # returns ('A',4)
col = openpyxl.utils.column_index_from_string(xy[0]) # returns 1
rowCord = xy[1]
# add cell value to output file
#currentSheet[cell.coordinate].value
if line == 1 and inputFileCount == 1:
currentSheet.cell(row=1, column=1).value = 'Project'
currentSheet.cell(row=1, column=2).value = os.path.split(inputFile)[-1]
if line == 1 and inputFileCount > 1:
currentSheet.cell(row=outputSheetMaxRow + 2, column=1).value = 'Project'
currentSheet.cell(row=outputSheetMaxRow + 2, column=2).value = os.path.split(inputFile)[-1]
else:
currentSheet.cell(row=outputSheetMaxRow + rowCord + 1, column=col).value = cell.value #, value=cell

KeyError: 'AK' census2010.allData['AK']['Anchorage']

'''
Reads the data from the Excel spreadsheet.
Counts the number of census tracts in each county.
Counts the total population of each county.
Prints the results.
This means your code will need to do the follwing:
Open and read the cells of an Excel document with the openpyxl module.
Calculate all the tract and population data and store it in a data structure.
Write the dat structure to a text file with the .py extension using the pprint module.
'''
import openpyxl,os,pprint
os.chdir('C:\Python34')
wb = openpyxl.load_workbook('censuspopdata.xlsx')
sheet = wb.get_sheet_by_name('Population by Census Tract')
CountyData = {}
for row in range(2,sheet.max_row + 1):
state = sheet['B' + str(row)].value
county= sheet['C' + str(row)].value
pop = sheet['D' + str(row)].value
CountyData.setdefault(state, {})
CountyData[state].setdefault(county, {'tracts': 0, 'pop': 0})
CountyData[state][county]['tracts'] += 1
CountyData[state][county]['pop'] += int(pop)
print('Writing the results...')
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(CountyData))
resultFile.close()
print('Done')
I can't deal with some KeyErrors. I made this program by following the instructions of "Project: Reading Data from a Spreadsheet" on this website: https://automatetheboringstuff.com/chapter12/
(I downloaded the excel file from here: https://www.nostarch.com/automatestuff/)
When I typed census2010.allData['AK']['Anchorage'], I got KeyError: 'AK'. I tried typing the other state abbreviations, but it didn't work either. Please help me out with this.

Resources