Extract some data from a text file

Extract some data from a text file - python-3.x

I am not so experienced in Python.
I have a “CompilerWarningsAllProtocol.txt” file that contains something like this:
" adm_1 C:\Work\CompilerWarnings\adm_1.h type:warning Reason:wunused
adm_2 E:\Work\CompilerWarnings\adm_basic.h type:warning Reason:undeclared variable
adm_X C:\Work\CompilerWarnings\adm_X.h type:warning Reason: Unknown ID"
How can I extract these three paths(C:..., E:..., C:...) from the txt file and to fill an Excel column named “Affected Item”.?
Can I do it with re.findall or re.search methods?
For now the script is checkling if in my location exists the input txt file and confirms it. After that it creates the blank excel file with headers, but I don't know how to populate the excel file with these paths written in column " Affected Item" let's say.
thanks for help. I will copy-paste the code:
import os
import os.path
import re
import xlsxwriter
import openpyxl
from jira import JIRA
import pandas as pd
import numpy as np
# Print error message if no "CompilerWarningsAllProtocol.txt" file exists in the folder
inputpath = 'D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt'
if os.path.isfile(inputpath) and os.access(inputpath, os.R_OK):
print(" 'CompilerWarningsAllProtocol.txt' exists and is readable")
else:
print("Either the file is missing or not readable")
# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('CompilerWarningsFresh.xlsx')
worksheet = workbook.add_worksheet('Results')
# Widen correspondingly the columns.
worksheet.set_column('A:A', 20)
worksheet.set_column('B:AZ', 45)
# Create the headers
headers=('Module','Affected Item', 'Issue', 'Class of Issue', 'Issue Root Cause', 'Type of Issue',
'Source of Issue', 'Test sequence', 'Current Issue appearances in module')
# Create the bold headers and font size
format1 = workbook.add_format({'bold': True, 'font_color': 'black',})
format1.set_font_size(14)
format1.set_border()
row=col=0
for item in (headers):
worksheet.write(row, col, item, format1)
col += 1
workbook.close()

I agree with #dantechguy that csv is probably easier (and more light weight) than writing a real xlsx file, but if you want to stick to Excel format, the code below will work. Also, based on the code you've provided, you don't need to import openpyxl, jira, pandas or numpy.
The regex here matches full paths with any drive letter A-Z, followed by "type:warning". If you don't need to check for the warning and simply want to get every path in the file, you can delete everything in the regex after S+. And if you know you'll only ever want drives C and E, just change A-Z to CE.
warningPathRegex = r"[A-Z]:\\\S+(?=\s*type:warning)"
compilerWarningFile = r"D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt"
warningPaths = []
with open(compilerWarningFile, 'r') as f:
fullWarningFile = f.read()
warningPaths = re.findall(warningPathRegex, fullWarningFile)
# ... open Excel file, then before workbook.close():
pathColumn = 1 # Affected item
for num, warningPath in enumerate(warningPaths):
worksheet.write(num + 1, pathColumn, warningPath) # num + 1 to skip header row

Related

Iterating and editing a varying number of excel files in a specific directory

I got a number of .xls files inside a specific directory with varying names and quantities that are downloaded from outlook. The objective of the script is to open each file and then write "Confirmed" in the O column if the M column is not blank.
import openpyxl as xl
import os
import sys
import pathlib
from pathlib import path
if __name__ == "main":
while True:
desktop_folder = Path.home().joinpath("Desktop", "Excel Files")
folder = (str(desktop_folder) + str("\\"))
os.chdir(folder)
excelFiles = os.listdir('.')
for i in range(0, len(excelFiles)):
wb = xl.load_workbook(excelFiles[i])
sheet = wb.active
for c, cellObj in enumerate(sheet['O'], 1):
if c != 1:
cellObj.value = '=IF(M="","","Confirmed")'.format(c)
wb.save(excelFiles[i])
print(excelFiles[i] + 'completed')
sys.exit()
At the moment this is the code I have, but I'm not getting any output on the terminal.
Any thoughts?
Thanks!

From looking at your code there are a few issues I see. Some of these may be copy/paste errors. Therefore given that issue 1 and 2 below are probably Ok in your testing you should at least be getting the 'completed' print output UNLESS there are no [xlsx] files whatsoever in your '<user>\Desktop\Excel Files' directory. Even a file that is not an xlsx file should cause an error. So it seems this is probably the reason for your issue.
As Andreas says, DEBUG, see if you are actually adding any [xlsx] files into the excelFiles list.
You are importing 'path' from pathlib. The function is 'Path' upper case 'P'
from pathlib import Path
__name__ should equal __main__ Very much likely this is just a copy/paste error
if __name__ == "__main__":
Your formula wouldn't do much
=IF(M="","","Confirmed")'.format(c)
M="" is not going to achieve what you think. You need to use the cell
coordinate so there is some missing brackets
cellObj.value = '=IF(M{}="","","Confirmed")'.format(c)
or the new method
cellObj.value = f'=IF(M{cellObj.row}="","","Confirmed")
Note you dont need the enumerate, just use the cells' row value
There is no space between the Excel file name and the word 'completed'
print(excelFiles[i] + 'completed')
The two words would be run together like 'excelfilecompleted'

How do I convert multiple multiline txt files to excel - ensuring each file is its own line, then each line of text is it own row? Python3

Using openpyxl and Path I aim to:
Create multiple multiline .txt files,
then insert .txt content into a .xlsx file ensuring file 1 is in column 1 and each line has its own row.
I thought to create a nested list then loop through it to insert the text. I cannot figure how to ensure that all the nested list string is displayed. This is what I have so far which nearly does what I want however it's just a repeat of the first line of text.
from pathlib import Path
import openpyxl
listOfText = []
wb = openpyxl.Workbook() # Create a new workbook to insert the text files
sheet = wb.active
for txtFile in range(5): # create 5 text files
createTextFile = Path('textFile' + str(txtFile) + '.txt')
createTextFile.write_text(f'''Hello, this is a multiple line text file.
My Name is x.
This is text file {txtFile}.''')
readTxtFile = open(createTextFile)
listOfText.append(readTxtFile.readlines()) # nest the list from each text file into a parent list
textFileList = len(listOfText[txtFile]) # get the number of lines of text from the file. They are all 3 as made above
# Each column displays text from each text file
for row in range(1, txtFile + 1):
for col in range(1, textFileList + 1):
sheet.cell(row=row, column=col).value = listOfText[txtFile][0]
wb.save('importedTextFiles.xlsx')
The output is 4 columns/4 rows. All of which say the same 'Hello, this is a multiple line text file.'
Appreciate any help with this!

The problem is in the for loop while writing, change the line sheet.cell(row=row, column=col).value = listOfText[txtFile][0] to sheet.cell(row=col, column=row).value = listOfText[row-1][col-1] and it will work

How can I create an excel file with multiple sheets that stores content of a text file using python

I need to create an excel file and each sheet contains the contents of a text file in my directory, for example if I've two text file then I'll have two sheets and each sheet contains the content of the text file.
I've managed to create the excel file but I could only fill it with the contents of the last text file in my directory, howevr, I need to read all my text files and save them into excel.
This is my code so far:
import os
import glob
import xlsxwriter
file_name='WriteExcel.xlsx'
path = 'C:/Users/khouloud.ayari/Desktop/khouloud/python/Readfiles'
txtCounter = len(glob.glob1(path,"*.txt"))
for filename in glob.glob(os.path.join(path, '*.txt')):
f = open(filename, 'r')
content = f.read()
print (len(content))
workbook = xlsxwriter.Workbook(file_name)
ws = workbook.add_worksheet("sheet" + str(i))
ws.set_column(0, 1, 30)
ws.set_column(1, 2, 25)
parametres = (
['file', content],
)
# Start from the first cell. Rows and
# columns are zero indexed.
row = 0
col = 0
# Iterate over the data and write it out row by row.
for name, parametres in (parametres):
ws.write(row, col, name)
ws.write(row, col + 1, parametres)
row += 1
workbook.close()
example:
if I have two text file, the content of the first file is 'hello', the content of the second text file is 'world', in this case I need to create two worksheets, first worksheet needs to store 'hello' and the second worksheet needs to store 'world'.
but my two worksheets contain 'world'.

I recommend to use pandas. It in turn uses xlsxwriter to write data (whole tables) to excel files but makes it much easier - with literally couple lines of code.
import pandas as pd
df_1 = pd.DataFrame({'data': ['Hello']})
sn_1 = 'hello'
df_2 = pd.DataFrame({'data': ['World']})
sn_2 = 'world'
filename_excel = '1.xlsx'
with pd.ExcelWriter(filename_excel) as writer:
for df, sheet_name in zip([df_1, df_2], [sn_1, sn_2]):
df.to_excel(writer, index=False, header=False, sheet_name=sheet_name)

Openpyxl 2.6.0 save issue

I have an issue while I try to save an Excel workbook with comments.
Without any comments in the Excel file, there is no issue with my scripts. I simply use:
wb_archive = load_workbook(archive_file)
However, if the file I want to save has comments, it doesn't work and I have the message:
AttributeError: 'NoneType' object has no attribute 'read'
So, I open the file with the method:
wb_archive = load_workbook(archive_file, keep_vba=True)
First run is ok, however, the second one always fails with the error:
KeyError: "There is no item named 'xl/sharedStrings.xml' in the archive"
Am I wrong somewhere in my code?
# coding: utf8
# !/usr/bin/env python3
"""
Program to extract Excel data to an archive
Program Python version 3.5
"""
# Standard Library
from pathlib import Path
from datetime import date
# External Libraries
from openpyxl import load_workbook
# Import the interface
# Import the project .py files
filein = "file1.xlsx"
fileout = "file2.xlsx"
def xlarchive(source_file, source_sheet, archive_file, archive_sheet, source_start_line=0, archiving_method="NEW", option_date=False):
"""
Function to save data from an Excel Workbook (source) to another one (archive).
Variables shall be check before calling the function.
:param source_file: file where data are copied from (source)
:type source_file: Path file
:param source_sheet: name of the sheet where data are located on the source file
:type source_sheet: string
:param archive_file: file where data are copied to (destination)
:type archive_file: Path file
:param archive_sheet: name of the sheet where data have to be copied on the destination file
:type archive_sheet: string
:param source_start_line:
:type source_start_line: int
:param archiving_method: defines if the destination file has to be created
:type archiving_method: string
:param option_date: defines if the extraction data shall be recorded
:type option_date: bool
:return: None
"""
wb_source = load_workbook(source_file)
#keep_vba = true to avoid issue with comments
ws_source = wb_source.get_sheet_by_name(source_sheet)
wb_archive = load_workbook(archive_file)
ws_archive = wb_archive.get_sheet_by_name(archive_sheet)
if archiving_method == "NEW":
# index of [archive_sheet] sheet
idx = wb_archive.sheetnames.index(archive_sheet)
# remove [ws_archive]
wb_archive.remove(ws_archive)
# create an empty sheet [ws_archive] using old index
wb_archive.create_sheet(archive_sheet, idx)
ws_archive = wb_archive.get_sheet_by_name(archive_sheet)
# If extraction has been performed the same day, previous data will be replaced
# Date are store in Excel under format YYYY-MM-DD HH:MM:SS.
# extractiondate is from datetime.now().date() and its format is YYYY-MM-DD
# Comparison thanks to string is needed
# As Openpyxl does not enable to delete row, the below code clear data and find the first empty occurrence
if option_date == True:
j = 0
for i in range(ws_archive.max_row, 1, -1):
if str(ws_archive.cell(row=i, column=1).value)[0:10] == str(date.today()):
j=j+1
ws_archive.delete_rows(ws_archive.max_row - j + 1, j)
for row in ws_source.iter_rows(min_row=source_start_line):
complete_row = []
for item in row:
complete_row.append(item.value)
if option_date is True:
complete_row.insert(0, str(date.today()))
ws_archive.append(complete_row)
wb_archive.save(archive_file)
xlarchive(filein, "Sheet1", fileout, "Sheet1", option_date=True, archiving_method="False", source_start_line=2)

Pandas read_excel with Hyperlink

I have an Excel spreadsheet that I am reading into a Pandas DataFrame:
df = pd.read_excel("file.xls")
However, one of the columns of the spreadsheet contains text which have a hyperlink associated with it. How do I access the underlying hyperlink in Pandas?

This can be done with openpyxl, I'm not sure its possible with Pandas at all. Here's how I've done it:
import openpyxl
wb = openpyxl.load_workbook('yourfile.xlsm')
ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)
You can also use iPython, and set a variable equal to the hyperlink object:
t = ws.cell(row=2, column=1).hyperlink
then do t. and press tab to see all the options for what you can do with or access from the object.

Quick monkey patching, without converters or anything like this, if you would like to treat ALL cells with hyperlinks as hyperlinks, more sophisticated way, I suppose, at least be able to choose, what columns treat as hyperlinked or gather data, or save somehow both data and hyperlink in same cell at dataframe. And using converters, dunno. (BTW I played also with data_only, keep_links, did not helped, only changing read_only resulted ok, I suppose it can slow down your code speed).
P.S.: Works only with xlsx, i.e., engine is openpyxl
P.P.S.: If you reading this comment in the future and issue https://github.com/pandas-dev/pandas/issues/13439 still Open, don't forget to see changes in _convert_cell and load_workbook at pandas.io.excel._openpyxl and update them accordingly.
import pandas
from pandas.io.excel._openpyxl import OpenpyxlReader
import numpy as np
from pandas._typing import FilePathOrBuffer, Scalar
def _convert_cell(self, cell, convert_float: bool) -> Scalar:
from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
# here we adding this hyperlink support:
if cell.hyperlink and cell.hyperlink.target:
return cell.hyperlink.target
# just for example, you able to return both value and hyperlink,
# comment return above and uncomment return below
# btw this may hurt you on parsing values, if symbols "|||" in value or hyperlink.
# return f'{cell.value}|||{cell.hyperlink.target}'
# here starts original code, except for "if" became "elif"
elif cell.is_date:
return cell.value
elif cell.data_type == TYPE_ERROR:
return np.nan
elif cell.data_type == TYPE_BOOL:
return bool(cell.value)
elif cell.value is None:
return "" # compat with xlrd
elif cell.data_type == TYPE_NUMERIC:
# GH5394
if convert_float:
val = int(cell.value)
if val == cell.value:
return val
else:
return float(cell.value)
return cell.value
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from openpyxl import load_workbook
# had to change read_only to False:
return load_workbook(
filepath_or_buffer, read_only=False, data_only=True, keep_links=False
)
OpenpyxlReader._convert_cell = _convert_cell
OpenpyxlReader.load_workbook = load_workbook
And after adding this above in your python file, you will be able to call df = pandas.read_excel(input_file)
After writing all this stuff it came to me, that maybe it would be easier and cleaner just use openpyxl by itself ^_^

as commented by slaw it doesnt grab the hyperlink but only the text
here text.xlsx contains links in the 9th column
from openpyxl import load_workbook
workbook = load_workbook('test.xlsx')
worksheet = workbook.active
column_indices = [9]
for row in range(2, worksheet.max_row + 1):
for col in column_indices:
filelocation = worksheet.cell(column=col, row=row) # this is hyperlink
text = worksheet.cell(column=col + 1, row=row) # thi is your text
worksheet.cell(column=col + 1, row=row).value = '=HYPERLINK("' + filelocation.value + '","' + text.value + '")'
workbook.save('test.xlsx')

You cannot do that in pandas. You can try with other libraries designed to deal with excel files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract some data from a text file - python-3.x

Related

Iterating and editing a varying number of excel files in a specific directory

How do I convert multiple multiline txt files to excel - ensuring each file is its own line, then each line of text is it own row? Python3

How can I create an excel file with multiple sheets that stores content of a text file using python

Openpyxl 2.6.0 save issue

Pandas read_excel with Hyperlink

Categories

Resources