Iterating and editing a varying number of excel files in a specific directory - python-3.x

I got a number of .xls files inside a specific directory with varying names and quantities that are downloaded from outlook. The objective of the script is to open each file and then write "Confirmed" in the O column if the M column is not blank.
import openpyxl as xl
import os
import sys
import pathlib
from pathlib import path
if __name__ == "main":
while True:
desktop_folder = Path.home().joinpath("Desktop", "Excel Files")
folder = (str(desktop_folder) + str("\\"))
os.chdir(folder)
excelFiles = os.listdir('.')
for i in range(0, len(excelFiles)):
wb = xl.load_workbook(excelFiles[i])
sheet = wb.active
for c, cellObj in enumerate(sheet['O'], 1):
if c != 1:
cellObj.value = '=IF(M="","","Confirmed")'.format(c)
wb.save(excelFiles[i])
print(excelFiles[i] + 'completed')
sys.exit()
At the moment this is the code I have, but I'm not getting any output on the terminal.
Any thoughts?
Thanks!

From looking at your code there are a few issues I see. Some of these may be copy/paste errors. Therefore given that issue 1 and 2 below are probably Ok in your testing you should at least be getting the 'completed' print output UNLESS there are no [xlsx] files whatsoever in your '<user>\Desktop\Excel Files' directory. Even a file that is not an xlsx file should cause an error. So it seems this is probably the reason for your issue.
As Andreas says, DEBUG, see if you are actually adding any [xlsx] files into the excelFiles list.
You are importing 'path' from pathlib. The function is 'Path' upper case 'P'
from pathlib import Path
__name__ should equal __main__ Very much likely this is just a copy/paste error
if __name__ == "__main__":
Your formula wouldn't do much
=IF(M="","","Confirmed")'.format(c)
M="" is not going to achieve what you think. You need to use the cell
coordinate so there is some missing brackets
cellObj.value = '=IF(M{}="","","Confirmed")'.format(c)
or the new method
cellObj.value = f'=IF(M{cellObj.row}="","","Confirmed")
Note you dont need the enumerate, just use the cells' row value
There is no space between the Excel file name and the word 'completed'
print(excelFiles[i] + 'completed')
The two words would be run together like 'excelfilecompleted'

Related

PDF parser in pdfs with multiple images and formats with python and tabula (open to other options)

So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)

Extract some data from a text file

I am not so experienced in Python.
I have a “CompilerWarningsAllProtocol.txt” file that contains something like this:
" adm_1 C:\Work\CompilerWarnings\adm_1.h type:warning Reason:wunused
adm_2 E:\Work\CompilerWarnings\adm_basic.h type:warning Reason:undeclared variable
adm_X C:\Work\CompilerWarnings\adm_X.h type:warning Reason: Unknown ID"
How can I extract these three paths(C:..., E:..., C:...) from the txt file and to fill an Excel column named “Affected Item”.?
Can I do it with re.findall or re.search methods?
For now the script is checkling if in my location exists the input txt file and confirms it. After that it creates the blank excel file with headers, but I don't know how to populate the excel file with these paths written in column " Affected Item" let's say.
thanks for help. I will copy-paste the code:
import os
import os.path
import re
import xlsxwriter
import openpyxl
from jira import JIRA
import pandas as pd
import numpy as np
# Print error message if no "CompilerWarningsAllProtocol.txt" file exists in the folder
inputpath = 'D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt'
if os.path.isfile(inputpath) and os.access(inputpath, os.R_OK):
print(" 'CompilerWarningsAllProtocol.txt' exists and is readable")
else:
print("Either the file is missing or not readable")
# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('CompilerWarningsFresh.xlsx')
worksheet = workbook.add_worksheet('Results')
# Widen correspondingly the columns.
worksheet.set_column('A:A', 20)
worksheet.set_column('B:AZ', 45)
# Create the headers
headers=('Module','Affected Item', 'Issue', 'Class of Issue', 'Issue Root Cause', 'Type of Issue',
'Source of Issue', 'Test sequence', 'Current Issue appearances in module')
# Create the bold headers and font size
format1 = workbook.add_format({'bold': True, 'font_color': 'black',})
format1.set_font_size(14)
format1.set_border()
row=col=0
for item in (headers):
worksheet.write(row, col, item, format1)
col += 1
workbook.close()
I agree with #dantechguy that csv is probably easier (and more light weight) than writing a real xlsx file, but if you want to stick to Excel format, the code below will work. Also, based on the code you've provided, you don't need to import openpyxl, jira, pandas or numpy.
The regex here matches full paths with any drive letter A-Z, followed by "type:warning". If you don't need to check for the warning and simply want to get every path in the file, you can delete everything in the regex after S+. And if you know you'll only ever want drives C and E, just change A-Z to CE.
warningPathRegex = r"[A-Z]:\\\S+(?=\s*type:warning)"
compilerWarningFile = r"D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt"
warningPaths = []
with open(compilerWarningFile, 'r') as f:
fullWarningFile = f.read()
warningPaths = re.findall(warningPathRegex, fullWarningFile)
# ... open Excel file, then before workbook.close():
pathColumn = 1 # Affected item
for num, warningPath in enumerate(warningPaths):
worksheet.write(num + 1, pathColumn, warningPath) # num + 1 to skip header row

Way to compare two excel files and CSV file

I need to compare two excel files and a csv file, then write some data from one excel file to another.
It looks like this:
CSV file with names which I will compare. For example (spam, eggs)
First Excel file with name and value of it. For example (spam, 100)
Second Excel file with name. For example (eggs)
Now, when I input file (second) into program I need to ensure that eggs == spam with csv file and then save value of 100 to the eggs.
For operating on excel files I'm using openpyxl and for csv I'm using csv.
Can I count on your help? Maybe there are better libraries to do that, because my trials proved to be a total failure.
Got it by myself. Some complex way, but it works like I wanted to. Will be glad for some tips to it.
import openpyxl
import numpy as np
lines = np.genfromtxt("csvtest.csv", delimiter=";", dtype=None)
compdict = dict()
for i in range(len(lines)):
compdict[lines[i][0]] = lines[i][1]
wb1 = openpyxl.load_workbook('inputtest.xlsx')
wb2 = openpyxl.load_workbook(filename='spistest.xlsx')
ws = wb1.get_sheet_by_name('Sheet1')
spis = wb2.get_sheet_by_name('Sheet1')
for row in ws.iter_rows(min_row=1, max_row=ws.max_row, min_col=1):
for cell in row:
if cell.value in compdict:
for wiersz in spis.iter_rows(min_row=1, max_row=spis.max_row, min_col=1):
for komorka in wiersz:
if komorka.value == compdict[cell.value]:
cena = spis.cell(row=komorka.row, column=2)
ws.cell(row=cell.row, column=2, value=cena.value)
wb1.save('inputtest.xlsx')
wb2.close()

Python changing file name

My application offers the ability to the user to export its results. My application exports text files with name Exp_Text_1, Exp_Text_2 etc. I want it so that if a file with the same file name pre-exists in Desktop then to start counting from this number upwards. For example if a file with name Exp_Text_3 is already in Desktop, then I want the file to be created to have the name Exp_Text_4.
This is my code:
if len(str(self.Output_Box.get("1.0", "end"))) == 1:
self.User_Line_Text.set("Nothing to export!")
else:
import os.path
self.txt_file_num = self.txt_file_num + 1
file_name = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt" + "_" + str(self.txt_file_num) + ".txt")
file = open(file_name, "a")
file.write(self.Output_Box.get("1.0", "end"))
file.close()
self.User_Line_Text.set("A text file has been exported to Desktop!")
you likely want os.path.exists:
>>> import os
>>> help(os.path.exists)
Help on function exists in module genericpath:
exists(path)
Test whether a path exists. Returns False for broken symbolic links
a very basic example would be create a file name with a formatting mark to insert the number for multiple checks:
import os
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
#the "{}" is a formatting mark so we can do file_name.format(num)
num = 1
while os.path.exists(name_to_format.format(num)):
num+=1
new_file_name = name_to_format.format(num)
this would check each filename starting with Exp_Txt_1.txt then Exp_Txt_2.txt etc. until it finds one that does not exist.
However the format mark may cause a problem if curly brackets {} are part of the rest of the path, so it may be preferable to do something like this:
import os
def get_file_name(num):
return os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_" + str(num) + ".txt")
num = 1
while os.path.exists(get_file_name(num)):
num+=1
new_file_name = get_file_name(num)
EDIT: answer to why don't we need get_file_name function in first example?
First off if you are unfamiliar with str.format you may want to look at Python doc - common string operations and/or this simple example:
text = "Hello {}, my name is {}."
x = text.format("Kotropoulos","Tadhg")
print(x)
print(text)
The path string is figured out with this line:
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
But it has {} in the place of the desired number. (since we don't know what the number should be at this point) so if the path was for example:
name_to_format = "/Users/Tadhg/Desktop/Exp_Txt_{}.txt"
then we can insert a number with:
print(name_to_format.format(1))
print(name_to_format.format(2))
and this does not change name_to_format since str objects are Immutable so the .format returns a new string without modifying name_to_format. However we would run into a problem if out path was something like these:
name_to_format = "/Users/Bob{Cat}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Bobcat{}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Smiley{:/Desktop/Exp_Txt_{}.txt"
Since the formatting mark we want to use is no longer the only curly brackets and we can get a variety of errors:
KeyError: 'Cat'
IndexError: tuple index out of range
ValueError: unmatched '{' in format spec
So you only want to rely on str.format when you know it is safe to use. Hope this helps, have fun coding!

Pandas read_excel with Hyperlink

I have an Excel spreadsheet that I am reading into a Pandas DataFrame:
df = pd.read_excel("file.xls")
However, one of the columns of the spreadsheet contains text which have a hyperlink associated with it. How do I access the underlying hyperlink in Pandas?
This can be done with openpyxl, I'm not sure its possible with Pandas at all. Here's how I've done it:
import openpyxl
wb = openpyxl.load_workbook('yourfile.xlsm')
ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)
You can also use iPython, and set a variable equal to the hyperlink object:
t = ws.cell(row=2, column=1).hyperlink
then do t. and press tab to see all the options for what you can do with or access from the object.
Quick monkey patching, without converters or anything like this, if you would like to treat ALL cells with hyperlinks as hyperlinks, more sophisticated way, I suppose, at least be able to choose, what columns treat as hyperlinked or gather data, or save somehow both data and hyperlink in same cell at dataframe. And using converters, dunno. (BTW I played also with data_only, keep_links, did not helped, only changing read_only resulted ok, I suppose it can slow down your code speed).
P.S.: Works only with xlsx, i.e., engine is openpyxl
P.P.S.: If you reading this comment in the future and issue https://github.com/pandas-dev/pandas/issues/13439 still Open, don't forget to see changes in _convert_cell and load_workbook at pandas.io.excel._openpyxl and update them accordingly.
import pandas
from pandas.io.excel._openpyxl import OpenpyxlReader
import numpy as np
from pandas._typing import FilePathOrBuffer, Scalar
def _convert_cell(self, cell, convert_float: bool) -> Scalar:
from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
# here we adding this hyperlink support:
if cell.hyperlink and cell.hyperlink.target:
return cell.hyperlink.target
# just for example, you able to return both value and hyperlink,
# comment return above and uncomment return below
# btw this may hurt you on parsing values, if symbols "|||" in value or hyperlink.
# return f'{cell.value}|||{cell.hyperlink.target}'
# here starts original code, except for "if" became "elif"
elif cell.is_date:
return cell.value
elif cell.data_type == TYPE_ERROR:
return np.nan
elif cell.data_type == TYPE_BOOL:
return bool(cell.value)
elif cell.value is None:
return "" # compat with xlrd
elif cell.data_type == TYPE_NUMERIC:
# GH5394
if convert_float:
val = int(cell.value)
if val == cell.value:
return val
else:
return float(cell.value)
return cell.value
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from openpyxl import load_workbook
# had to change read_only to False:
return load_workbook(
filepath_or_buffer, read_only=False, data_only=True, keep_links=False
)
OpenpyxlReader._convert_cell = _convert_cell
OpenpyxlReader.load_workbook = load_workbook
And after adding this above in your python file, you will be able to call df = pandas.read_excel(input_file)
After writing all this stuff it came to me, that maybe it would be easier and cleaner just use openpyxl by itself ^_^
as commented by slaw it doesnt grab the hyperlink but only the text
here text.xlsx contains links in the 9th column
from openpyxl import load_workbook
workbook = load_workbook('test.xlsx')
worksheet = workbook.active
column_indices = [9]
for row in range(2, worksheet.max_row + 1):
for col in column_indices:
filelocation = worksheet.cell(column=col, row=row) # this is hyperlink
text = worksheet.cell(column=col + 1, row=row) # thi is your text
worksheet.cell(column=col + 1, row=row).value = '=HYPERLINK("' + filelocation.value + '","' + text.value + '")'
workbook.save('test.xlsx')
You cannot do that in pandas. You can try with other libraries designed to deal with excel files.

Resources