Python os.path.join weird behaviour with ("any_path", "c:\") as input - python-3.x

So, I am making a pseudo login system and I've run into a few bugs with the os.path.join function.
It seems to act strangely when inputs such as "c:" or "d:" are input as the username, as it voids any path string before the root drives.
So, an input such as:
>>> import os
>>> os.path.exists( os.path.join( some_path, "this_is", "voided", "c:", "python34" ) )
True
Will have its first 3 statements completely ignored.
Is there any way to avoid this?

Seems like a known bug
On Windows, the drive letter is not reset when an absolute path component (e.g., r'\foo') is encountered. If a component contains a drive letter, all previous components are thrown away and the drive letter is reset. Note that since there is a current directory for each drive, os.path.join("c:", "foo") represents a path relative to the current directory on drive C: (c:foo), not c:\foo.
You can try pathlib.PurePath.joinpath

Related

I need to copy files from appdata/local to C drive and overwrite them each time the program is run

I use a program which, sadly corrupts some saved files at random times. To be helpful (although I am a novice at this) I am trying to make a Python program to basically backup those file from the AppData/local directory and put them in a folder on C. I need this program to overwrite the previously copied files each time it is run.
I need to generalize the AppData/local because each person who uses this program would, in theory, have a different user directory preceding the AppData folder.
I've tried running some of my own attempts at a solution.
I will post the results.
# Imports
import shutil
import os
import distutils
from distutils import dir_util
# Paths
# os.makedirs("C:/RevSaves-Backup")
path = '%LOCALAPPDATA%/Remnant'
backup_path = "C:/RevSaves-Backup"
# Procedures
print("The Very Basic Remnant Save Backup Utility")
print(" ")
print("Backing up the save source:")
print(path)
print(" ")
print("It is recommended you run this at regular intervals \nto ensure you have the latest saves up to date.")
distutils.dir_util.copy_tree(path, backup_path)
print("Backup completed.")
When I execute this via command prompt or PowerShell, I get the following message:
Traceback (most recent call last):
File "RevSaveBkUp.py", line 28, in
distutils.dir_util.copy_tree(path, backup_path)
File "C:\Users\candr\AppData\Local\Programs\Python\Python37-32\lib\distutils\dir_util.py", line 124, in copy_tree
"cannot copy tree '%s': not a directory" % src)
distutils.errors.DistutilsFileError: cannot copy tree '%LOCALAPPDATA%/Remnant': not a directory
I am having trouble "targeting" the system-specific local AppData folder.
After a lot of reading, I made the following solution if anyone else is trying to do something similar. I do not know if this is the "best" or "right" way of doing things, however.
Here is how I targeted the AppData Local folder regardless of the user logged in:
path = os.path.join(os.path.expanduser('~'), 'AppData', 'Local')
Some explanations for anyone who is new like me:
os.path.join basically connects folders together in the path. For example, using the above code, join would "connect" AppData to Local and the "User Folder" (referenced in the code as '~'). The output would look like this: C:\Users\your_username\AppData\Local
os.path.expanduser defines the user in question. For example, "~" targets the current user logged in. It goes inside the () because this is how you tell "your code" who, to target. If you wanted a specific user (if you had more than one) you could possibly use os.path.expanduser('Jane') I believe.
Keeping the notes above in reference, this method allowed me to define the variables I needed to and use them for the copy above, where I could not normally use the AppData directory as I wanted.
This was done by using the following code as an example:
path = os.path.join(os.path.expanduser('~'), 'AppData', 'Local')
backup_path = "C:/MyBackupFolder"
Finally we executed the copy with this:
distutils.dir_util.copy_tree(path, backup_path)
The above copied The AppData information I needed to the backup folder.
I hope this helps everyone learn as I did, it came in quite handy.

Copying files across a network with Python

I am trying to loop through a list of filepaths for files I have throughout the entire network at my company. The filepaths have locations of various drives throughout the network.
The user submitted the file once upon a time and the filepath was passed through at the point of submission. However, the file drive is not the same for every user and is not the same for what that drive is named on my machine.
For example: a path like X:\Users\Submissions\Bob's File.xlsx may coincide with the same drive and file but named differently on my machine:
K:\Users\Submissions\Bob's File.xlsx
Each user has the potential of using a different letter for that particular drive for a various number of reasons.
Is there a way I can make my pattern string that I pass in smart enough to be able to find the proper directory and locate that file? Any ideas would be great.
Thank you
import pandas as pd
import shutil as sh
copydir = r"C:\Users\me\Desktop\PythonSpyderDesktop\Extractor\Models"
file_path_list = r"C:\Users\me\Desktop\PythonSpyderDesktop\Extractor\FilePathList.csv"
data = pd.read_csv(file_path_list)
i = 1 #Start at 2nd row
for i in range(1, len(data)):
try:
sh.copyfile(data.FilePath[i], copydir)
print("Copied over file: " + data.FilePath[i])
except:
print ("File not found.")
Your question is unclear. It revolves around the source & dest arguments being passed to copyfile:
sh.copyfile(data.FilePath[i], copydir)
It's hard to tell what pathnames you're extracting from the .CSV, but apparently source files may have the "wrong" drive letter, and/or the destination directory copydir may have the "wrong" drive letter. The script apparently runs on multiple machines, and those machines have diverse drive letters mounted.
Write a helper function that finds the "right" drive letter. It should accept a pathname like copydir, then probe a search list, then return a corrected pathname.
Given a list of drive letters, you can iterate through them and test whether a pathname exists using os.path.exists(). Return the first one found.
Use splitdrive() to parse out components of the input pathname.
Suppose that both source and dest may need their drive letters fixed up. Then the call might look like this:
sh.copyfile(fix_path(data.FilePath[i]), fix_path(copydir))

use glob module to get certain files with python?

The following is a piece of the code:
files = glob.iglob(studentDir + '/**/*.py',recursive=True)
for file in files:
shutil.copy(file, newDir)
The thing is: I plan to get all the files with extension .py and also all the files whose names contain "write". Is there anything I can do to change my code? Many thanks for your time and attention.
If you want that recursive option, you could use :
patterns = ['/**/*write*','/**/*.py']
for p in patterns:
files = glob.iglob(studentDir + p, recursive=True)
for file in files:
shutil.copy(file, newDir)
If the wanted files are in the same directory you could simply use :
certainfiles = [glob.glob(e) for e in ['*.py', '*write*']]
for file in certainfiles:
shutil.copy(file, newDir)
I would suggest the use of pathlib which has been available from version 3.4. It makes many things considerably easier.
In this case '**' stands for 'descend the entire folder'.
'*.py' has its usual meaning.
path is an object but you can recover its string representation using the str function to get just the file name.
When you want the entire path name, use path.absolute and get the str of that.
Don't worry, you'll get used to it. :) If you look at the other goodies in pathlib you'll see it's worth it.
from pathlib import Path
studentDir = <something>
newDir = <something else>
for path in Path(studentDir).glob('**/*.py'):
if 'write' in str(path):
shutil.copy(str(path.absolute()), newDir)

Python3 multiprocessing

I am an absolute beginner. I fumble my way through code by analogy to examples so apologies for any misuse of terminology.
I have written a small piece of code in python 3 which:
takes a user input (a folder on their computer)
searches the folder for pdf files
turns each page of the PDF to an image with sequential numbering. Iterates through the jpgs in order of numbering, turning them black and white. OCR scans the files and outputs the text into an object, saves the text contents to a .txt file (via pytesseract). Deletes jpgs, leaving .txt file. Most time is taken in converting to jpgs and possibly making them black and white.
The code works, though I am sure it could be improved. It takes a while so I thought I'd try multiprocessing using Pools.
My code appears to create pools. I can also get the function to print a list of files in the folder, so it appears to have the list passed to it in one form or another.
I cannot get it to work and have now hacked the code about repeatedly with various errors. I think the main problem is, I am clueless.
My code begins:
User input block (asks for a folder in the user's directory, checks it is a valid folder etc).
OCR block as a function (parses PDF then outputs contents into single .txt file)
For loop block as a function (is supposed to loop over each PDF in folder and execute OCR block on it.
Multiprocessing block (is supposed to feed the list of files in the directory to the loop block.
To avoid writing War and Peace, I set out last version of the loop block and multiprocessing blocks below:
#import necessary modules
home_path = os.path.expanduser('~')
#ask for input with various checking mechanisms to make sure a useful pdfDir is obtained
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#various lines of code here
compilation_temp.close()
def per_file_process (subject_files):
for pdf in subject_files:
#decode the whole file name as a string
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in .pdf
if pdf_filename.endswith(".pdf"):
#call the OCR function on it
textExtractor()
else:
print ('nonsense')
if __name__ == '__main__':
pool = Pool(2)
pool.map(per_file_process, os.listdir(pdfDir))
Is anyone willing/able to point out my errors, please?
The relevant bits of the code whilst working:
#import necessary
home_path = os.path.expanduser('~')
#block accepting input
pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:')
def textExtractor():
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability
#various lines of code to OCR and output .txt file
compilation_temp.close()
subject_files = os.listdir(pdfDir)
for pdf in subject_files:
#decode the whole file name as a string you can see
pdf_filename = os.fsdecode(pdf)
#check whether the string ends in /pdf
if pdf_filename.endswith(".pdf"):
textExtractor()
else:
#print for debugging
Pool.map calls the worker function repeatedly with each name returned by os.listdir. In per_file_process, subject_files is a single filename and for pdf in subject_files: is enumerating the individual characters in the name. Further, listdir only shows the base name, without subdirectories, so you aren't looking in the right place for the pdf. You can use glob to filter by extension name and return a working path to the file.
Your example is confusing... textExtractor() takes no parameters so how is it to know which file it is processing? I'm going out on a limb and assuming that it really does take the path to the file processing. If so, you can parallelize rather easily just by feeding pdf's directory it via map. Assuming processing time will vary by pdf, I am setting chunksize to 1 so that an early finishing worker can grap extra files to process.
from glob import glob
import os
from multiprocessing import Pool
def textExtractor(pdf_filename):
#convert pdf to jpeg with a tesseract friendly resolution
with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries
#...various lines of code here
compilation_temp.close()
if __name__ == '__main__':
#pdfDir is the folder inputted by user
with Pool(2) as pool:
# assuming call signature: textExtractor(path_to_file)
pool.map(textExtractor,
(filename for filename in glob(os.path.join(pdfDir, '*.pdf'))
if os.path.isfile(filename))
chunksize=1)

How to make python spot a folder and print its files

I want to be able to make python print everything in my C drive. I have figured out this to print whats on the first "layer" of the drive,
def filespotter():
import os
path = 'C:/'
dirs = os.listdir( path )
for file in dirs:
print(file)
but I want it to go into the other folders and print what is in every other folder.
Disclaimer os.walk is just fine, I'm here to provide a easier solution.
If you're using python 3.5 or above, you can use glob.glob with '**' and recursive=True
For example: glob.glob(r'C:\**', recursive=True)
Please note that getting the entire file list of C:\ drive can take a lot of time.
If you don't need the entire list at the same time, glob.iglob is a reasonable choice. The usage is the same, except that you get an iterator instead of a list.
To print everything under C:\
for filename in glob.iglob(r'C:\**', recursive=True):
print(filename)
It gives you output as soon as possible.
However if you don't have python 3.5 available, you can see the source of glob.py and adapt it for your use case.

Resources