Python filepaths have double backslashes - python-3.x

Ultimately, I want to loop through every pdf in specified directory ('C:\Users\dude\pdfs_for_parsing') and print the metadata for each pdf. The issue is that when I try to loop through the "directory" I'm receiving the error "FileNotFoundError: [Errno 2] No such file or directory:". I understand this error is occurring because I now have double slashes in my filepaths for some reason.
Example Code
import PyPDF2
import os
path_of_the_directory = r'C:\Users\dude\pdfs_for_parsing'
directory = []
ext = ('.pdf')
def isolate_pdfs():
for files in os.listdir(path_of_the_directory):
if files.endswith(ext):
x = os.path.abspath(files)
directory.append(x)
for pdf in directory:
reader = PyPDF2.PdfReader(pdf)
information = reader.metadata
print(information)
isolate_pdfs()
If I print the file paths one at a time, I see that the files have single '/' like I'm expecting:
for pdf in directory:
print(pdf)
The '//' seems to get added when I try to open each of the PDFs 'PDFFile = open(pdf,'rb')'

Your issue has nothing to do with //, it's here:
os.path.abspath(files)
Say you have C:\Users....\x.pdf, you list that directory, so the files will contain x.pdf. You then take the absolute path of x.pdf, which the abspath supposes to be in the current directory. You should replace it with:
x = os.path.join(path_of_the_directory, files)
Other notes:
PDFFile and PDF shouldn't be in uppercase. Prefer pdf_file and pdf_reader. The latter also avoids the confusion with the for pdf in...
Try to use a debugger rather than print statements. This is how I found your bug. It can be in your IDE or in command line with python -i You can step through your code, test a few variations, fiddle with the variables...
Why is ext = ('.pdf') with braces ? It doesn't do anything but leads to think that it might be a tuple (but isn't).
As an exercise the first for can be written as: directory = [os.path.join(path_of_the_directory, x) for x in os.listdir(path_of_the_directory) if x.endswith(ext)]

Related

Problem using glob: file not found after os.path.join()

I met strange problem using glob (python 3.10.0/Linux):
if I use glob for location of the required file using following construct:
def get_last_file(folder, date=datetime.today().date()):
os.chdir(folder)
_files = glob.glob("*.csv")
_files.sort(key=os.path.getctime)
os.chdir(os.path.join("..", ".."))
for _filename in _files[::-1]:
string = str(date).split("-")
if "".join(string) in _filename:
return _filename
# if cannot find the specific date, return newest file
return _files[-1]
but when I try to
os.path.join(fileDir, file)
with the resulting file, I get the relative path which leads to:
FileNotFoundError: [Errno 2] No such file or directory: 'data/1109.csv'.
File certainly exist and whet i try os.path.join(fileDir, '1109.csv'), file is found.
The weirdest thing - if i do:
filez = get_last_file(fileDir, datetime.today().date())
file = '1109.csv''
I still get file not found for file after os.path.join(fileDir, file).
Should I avoid using glob at all?
I made such solution:
file =''
_mtime=0
for root, dirs, filenames in os.walk(fileDir):
for f in sorted(filenames):
if f.endswith(".csv"):
if os.path.getmtime(fileDir+f) > _mtime:
_mtime = os.path.getmtime(fileDir+f)
file = f
print (f'fails {file}')
and the resulting os.path.join(fileDir, file) gives (relative) path fit for further operations
Also the difference between getctime and getmtime is accounted for.
While not a direct solution, try looking at Python's Pathlib library. It often leads to cleaner, less buggy solutions.
from pathlib import Path
def get_last_file(folder, date=datetime.today().date()):
folder = pathlib.Path(folder) # Works for both relative and absolute paths
_files = Path.cwd().glob("*.csv")
_files.sort(key=os.path.getctime)
grandparent_path = folder.parents[1]
for _filename in _files[::-1]:
string = str(date).split("-")
if "".join(string) in _filename:
return _filename
# if cannot find the specific date, return newest file
return _files[-1]
Then instead of using os.path.join() you can do path_dir / file_name where path_dir is Path object. This may also be the case that you are changing the base path in within your function, leading to unexpected behaviour.

Python - Script that copy certain files by file name

I wrote a script to copy files with specific names from one folder to another.
The file name format I want to copy is 2021052444592AKC. However, the script I wrote copies all files with the ending AKC, but in the if condition I specified that it should copy only files if the filename starts with "202105" and ends with "AKC". In the folder I have other files in the same format that is"YYYYMMDD44592threeUpperCaseLetters"
Can anyone help, because I haven't found the answer to this problem, thanks in advance :)
P.S I'm using Python3 in PyCharm
import shutil
import os
os.chdir(r"C:\\")
# without a double backslash and the letter r, the compiler throws an error
dir_src = r"C:\\Users\\Adam\\Desktop\1\\"
dir_dst = r"C:\\Users\\Adam\\Desktop\\2\\"
for filename in os.listdir(dir_src):
if filename.startswith("202105") and filename.endswith("AKC"):
shutil.copy(dir_src + filename, dir_dst)
print("End")
I'm not sure exactly why your script is failing, but you might want to try a solution with a regular expression (re).
import re
pattern = re.compile(r'^202105(\d{2})44592AKC$')
os.chdir(r"C:\\")
# without a double backslash and the letter r, the compiler throws an error
dir_src = r"C:\\Users\\Adam\\Desktop\\1\\"
dir_dst = r"C:\\Users\\Adam\\Desktop\\2\\"
for filename in os.listdir(dir_src):
if pattern.match(filename):
shutil.copy(dir_src + filename, dir_dst)
print("End")

Double backslashes for filepath_or_buffer with pd.read_csv

Python 3.6, OS Windows 7
I am trying to read a .txt using pd.read_csv() using relative filepath. So, from pd.read_csv() API checked out that the filepath argument can be any valid string path.
So, in order to define the relative path I use pathlib module. I have defined the relative path as:
df_rel_path = pathlib.Path.cwd() / ("folder1") / ("folder2") / ("file.txt")
a = str(df_rel_path)
Finally, I just want to use it to feed pd.read_csv() as:
df = pd.read_csv(a, engine = "python", sep = "\s+")
However, I am just getting an error stating "No such file or directory: ..." showing double backslashes on the folder path.
I have tried to manually write the path on pd.read_csv() using a raw string, that is, using r"relative/path". However, I am still getting the same result, double backslashes. Is there something I am overlooking?
You can get what you want by using os module
df_rel_path = os.path.abspath(os.path.join(os.getcwd(), "folder1", "folder2"))
This way the os module will deal with the joining the path parts with the proper separator. You can omit os.path.abspath if you read a file that's within the same directory but I wrote it for the sake of completeness.
For more info, refer to this SO question: Find current directory and file's directory
You need a filename to call pd.read_csv. In the example 'a' is a only the path and does not point to a specific file. You could do something like this:
df_rel_path = pathlib.Path.cwd() / ("folder1") / ("folder2")
a = str(df_rel_path)
df = pd.read_csv(a+'/' +'filename.txt')
With the filename your code works for me (on Windows 10):
df_rel_path = pathlib.Path.cwd() / ("folder1") / ("folder2")/ ("file.txt")
a = str(df_rel_path)
df = pd.read_csv(a)

Rename multiple files in Python from another list

I am trying to rename multiple files from another list. Like rename the test.wav to test_1.wav from the list ['_1','_2'].
import os
list_2 = ['_1','_2']
path = '/Users/file_process/new_test/'
file_name = os.listdir(path)
for name in file_name:
for ele in list_2:
new_name = name.replace('.wav',ele+'.wav')
os.renames(os.path.join(path,name),os.path.join(path,new_name))
But turns out the error shows "FileNotFoundError: [Errno 2] No such file or directory: /Users/file_process/new_test/test.wav -> /Users/file_process/new_test/test_2.wav
However, the first file in the folder has changed to test_1.wav but not the rest.
You are looping against 1st file with a total list. You have to input both the list and filename in the single for loop.
This can be done using zip(file_name, list_2) function.
This will rename the file with appending whatever is sent through the list. We just have to make sure the list and the number of files are always equal.
Code:
import os
list_2 = ['_1','_2']
path = '/Users/file_process/new_test/'
file_name = os.listdir(path)
for name, ele in zip(file_name, list_2):
new_name = name.replace(name , name[:-4] + ele+'.wav')
print(new_name)
os.renames(os.path.join(path,name),os.path.join(path,new_name))
You've got error in your algorithm.
Your algorithm first gets through the outer loop (for name in file_name) and then in the inner loop, you replace the file test.wav to test_1.wav. At this step, there is no file named test.wav (it has been already replaced as test_1.wav); however, your algorithm, still, tries to rename the file named test.wav to test_2.wav; and can not find it, of course!

Can't Glob Several File Types

I've researched and tested this issue for a while and can't seem to get it to work.
user_path
Is provided by the user and it contains .xlsm, ,xlsb and .xlsx file types. I'm trying to catch all of them and convert them to .csv. This works individually if I substitute the extensions:
all_files = glob.glob(os.path.join(user_path, "*.xlsm")) #xlsb, xlsm
I've tried the following two methods, neither of which work (win32com just tells me Excel can't access the out_folder.)
all_files = glob.glob(os.path.join(user_path, "*"))
all_files = glob.glob(user_path)
How can I send these two file types together with user_path?
Thanks in advance.
By using just *, glob matches all files AND directories under the given folder, including those you have no access to, which in your case is the out_folder directory, so when you iterate over the file names, make sure if they end with one of the file extensions you're looking for before you try to open them.
Since glob can't test for multiple file extensions at a time, it's actually better to use os.listdir and do the filtering of multiple file extensions on your own.
for filename in os.listdir(user_path):
if any(map(filename.endswith, ('.xlsm', '.xlsb', '.xlsx'))):
do_something(filename)
Or, with list comprehension,
all_files = [filename for filename in os.listdir(user_path) if any(map(filename.endswith, ('.xlsm', '.xlsb', '.xlsx')))]
Edit by the OP (actual code):
pathlib.Path(path + '\out_folder').mkdir(parents = True, exist_ok = True)
newpath = os.path.join(path,'out_folder')
#this is the line I can't seem to get to read both file types - it works as is.
all_files_test = glob.glob(os.path.join(user_path, "*.xlsm")) #xlsb, xlsm
for file in all_files_test:
name1 = os.path.splitext(os.path.split(file)[1])[0]

Resources