Extract text from first page of word document and use it as a folder name, then move the file inside that folder

Extract text from first page of word document and use it as a folder name, then move the file inside that folder - python-3.x

I have hundreds of word documents that needs to be processed but need to organized them first by versions in subfolders.
I basically get a drop of these word documents within a single folder and need to automate the organization moving forward before I get nuts.
So I have a script that basically creates a folder with the same name of the file and moves the file inside that folder, this part is done.
Now I need to go into each subfolder, and get the document version from within the first word page of each document, then create a sub-folder withe version number and move the word file into that subfolder.
The structure should be as follows (taking two folders as examples):
(Folder) Test
(Subfolder) 12.0
Test.docx
(Folder) Test1
(Subfolder) 13.0
Test1.docx
Luckily I was able to figure it out that "doc.paragraphs[6].text" will always return the version information in a single line as follows:
>>> doc.paragraphs[6].text
'Version Number: 12.0'
Would appreciate if someone can point me out to the right direction.
This is the script I have so far:
#!/usr/bin/env python3
import glob, os, shutil, docx, sys
folder = sys.argv[1]
#print(folder)
for file_path in glob.glob(os.path.join(folder, '*.docx')):
new_dir = file_path.rsplit('.', 1)[0]
#print(new_dir)
try:
os.mkdir(os.path.join(folder, new_dir))
except WindowsError:
# Handle the case where the target dir already exist.
pass
shutil.move(file_path, os.path.join(new_dir, os.path.basename(file_path)))

Please see below the complete solution to your requirement.
Note: To know about re.search go through https://www.geeksforgeeks.org/python-regex-re-search-vs-re-findall/
import docx, os, glob, re, shutil
from pathlib import Path
def create_dir(path): # function to check if a given path exist and create one if not
# Check whether the specified path exists or not
is_exist = os.path.exists(path)
# Create a new directory the path does not exist
if not is_exist:
os.makedirs(path)
folder = fr"C:\Users\rams\Documents\word_docs" #my local folder
for file in glob.glob(os.path.join(folder, '*.docx')):
# Test, Test1, Test2 in your structure
main_folder = os.path.join(folder,Path(file).stem)
file_name = os.path.basename(file)
# Get the first line from the docx
doc = docx.Document(file).paragraphs[0].text
# group(1) = Version Number: (.*)
version_no = re.search("(Version Number: (.*))", doc).group(1)
# extract the number portion from version_no
sub_folder = version_no.split(':')[1].strip()
# path to actual sub_folder with version_no
sub_folder = os.path.join(main_folder, sub_folder)
# destination path
dest_file_path = os.path.join(sub_folder, file_name)
for i in [main_folder,sub_folder]:
create_dir(i) # function call
# to move the file to the corresponding version folder (overwrite if exists)
if os.path.exists(dest_file_path):
os.remove(dest_file_path)
shutil.move(file, sub_folder)
else:
shutil.move(file, sub_folder)
Before execution:
After Execution

So you have a script that creates a folder name being the file name and moves the file inside that folder. This part is done. OK.
Now you know how to get the document version from within the first word page of each document you need to create a sub-folder with this version number and move the word file into that sub-folder. This can be done using the same code as before replacing:
new_dir = file_path.rsplit('.', 1)[0]
with
document_dir = os.path.dirname(file_path)
document_name = os.path.basename(file_path)
# check if the document is already in the right directory:
assert os.path.basename(document_dir) == document_name.rsplit('.', 1)[0]
# here comes: doc = some_function_getting_the_doc_object(file_path)
doc_version_tuple = doc.paragraphs[6].text.rsplit(': ', 1)
# check if doc_version_tuple has the right content:
assert doc_version_tuple[0] == 'Version Number'
doc_version = doc_version_tuple[1]
new_dir = os.path.join(document_dir, doc_version)
Notice that you can also do both of the two steps in one run over the list of full path document names.
Notice further that running the script you posted in your question twice without the check:
assert os.path.basename(document_dir) != document_name.rsplit('.', 1)[0]
giving an Error if the script was already run and the documents are already in folders with the document name will destroy what you already achieved and you will need to write another script to reverse it.
The above is the reason why it would be a good idea to have a backup copy of all the documents you can use to re-create the directory with the documents in case something goes wrong. And ... it is generally a good idea to have always a backup copy if you work on files especially when using a self-written script.

Related

Using Python to copy contents of multiple files and paste in a main file

I'll start by mentioning that I've no knowledge in Python but read online that it could help me with my situation.
I'd like to do a few things using (I believe?) a Python script.
I have a bunch of .yml files that I want to transfer the contents into one main .yml file (let's call it Main.yml). However, I'd also like to be able to take the name of each individual .yml and add it before it's content into Main.yml as "##Name". If possible, the script would look like each file in a directory, instead of having to list every .yml file I want it to look for (my directory in question only contains .yml files). Not sure if I need to specify, but just in case: I want to append the contents of all files into Main.yml & keep the indentation (spacing). P.S. I'm on Windows
Example of what I want:
File: Apes.yml
Contents:
Documentation:
"Apes":
year: 2009
img: 'link'
After running the script, my Main.yml would like like:
##Apes.yml
Documentation:
"Apes":
year: 2009
img: 'link'

I'm just starting out in Python too so this was a great opportunity to see if my newly learned skills work!
I think you want to use the os.walk function to go through all of the files and folders in the directory.
This code should work - it assumes your files are stored in a folder called "Folder" which is a subfolder of where your Python script is stored
# This ensures that you have the correct library available
import os
# Open a new file to write to
output_file = open('output.txt','w+')
# This starts the 'walk' through the directory
for folder , sub_folders , files in os.walk("Folder"):
# For each file...
for f in files:
# create the current path using the folder variable plus the file variable
current_path = folder+"\\"+f
# write the filename & path to the current open file
output_file.write(current_path)
# Open the file to read the contents
current_file = open(current_path, 'r')
# read each line one at a time and then write them to your file
for line in current_file:
output_file.write(line)
# close the file
current_file.close()
#close your output file
output_file.close()

Does the following program access a file in a subfolder of a folder?

using
import sys
folder = sys.argv[1]
for i in folder:
for file in i:
if file == "test.txt":
print (file)
would this access a file in the folder of a subfolder? For Example 1 main folder, with 20 subfolders, and each subfolder has 35 files. I want to pass the folder in commandline and access the first subfolder and the second file in it

Neither. This doesn't look at files or folders.
sys.argv[1] is just a string. i is the characters of that string. for file in i shouldn't work because you cannot iterate a character.
Maybe you want to glob or walk a directory instead?

Here's a short example using the os.walk method.
import os
import sys
input_path = sys.argv[1]
filters = ["test.txt"]
print(f"Searching input path '{input_path}' for matches in {filters}...")
for root, dirs, files in os.walk(input_path):
for file in files:
if file in filters:
print("Found a match!")
match_path = os.path.join(root, file)
print(f"The path is: {match_path}")
If the above file was named file_finder.py, and you wanted to search the directory my_folder, you would call python file_finder.py my_folder from the command line. Note that if my_folder is not in the same directory as file_finder.py, then you have to provide the full path.

No, this won't work, because folder will be a string, so you'll be iterating through the characters of the string. You could use the os module (e.g., the os.listdir() method). I don't know what exactly are you passing to the script, but probably it would be easiest by passing an absolute path. Look at some other methods in the module used for path manipulation.

Python 3.5:Not able to remove non alpha -numeric characters from file_name

i have written a python script to rename all the files present in a folder by removing all the numbers from the file name but this doesn't work .
Note :Same code works fine for python2.7
import os
def rename_files():
#(1) get file names from a folder
file_list = os.listdir(r"D:\prank")
print(file_list)
saved_path = os.getcwd()
print("Current working Directory is " + saved_path)
os.chdir(r"D:\prank")
#(2) for each file ,rename filename
for file_name in file_list:
os.rename(file_name, file_name.translate(None,"0123456789"))
rename_files()
Can anyone tell me how to make it work.Is the translate function which is not working properly

The problem is with os.rename() portion of your code.
os.rename() requires you to give it a full path to the file/folder you want to change it to, while you only gave it the file_name and not the full path.
You have to add the full path to the folders/files directory.
so it should look like this:
def rename_files():
# add the folder path
folder_path = "D:\prank\\"
file_list = os.listdir(r"D:\prank")
print(file_list)
saved_path = os.getcwd()
print("Current working Directory is " + saved_path)
os.chdir(r"D:\prank")
# Concat the folder_path with file_name to create the full path.
for file_name in file_list:
full_path = folder_path + file_name
print (full_path) # See the full path here.
os.rename(full_path, full_path.translate(None, "0123456789"))

look up the documentation for os, heres what ive found on rename:
os.rename(src, dst, *, src_dir_fd=None, dst_dir_fd=None)
Rename the file or directory src to dst. If dst is a directory, OSError will be raised. On Unix, if dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail on some Unix flavors if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement). On Windows, if dst already exists, OSError will be raised even if it is a file.
This function can support specifying src_dir_fd and/or dst_dir_fd to supply paths relative to directory descriptors.
If you want cross-platform overwriting of the destination, use replace().
New in version 3.3: The src_dir_fd and dst_dir_fd arguments.
heres a link to the documentation, hope this helps, thanks
https://docs.python.org/3/library/os.html

Others have pointed out other issues with your code, but as to your use of translate, in Python 3.x, you need to pass a dictionary mapping ordinals to new values (or None). This code would work:
import string
...
file_name.translate(dict(ord(c), None for c in string.digits))
but this seems easier to understand:
import re
...
re.sub(r'\d', '', file_name)

Python: Delete a special file

you know special files or folders where the first place is a point. For example: .example_folder or .examaple_file. Think about the .htaccess/.htpassword files. I know how to delete folders and files on the regular way with Python. But how can I delete some special files like this? The other problem is the special files don't have extensions like .txt oder .jpg etc.. When I try to delete the special files/folders on the regular way Python skips all the special files/folder. Does someone has any idea?
def delete_temp_update_files(path_files):
files = glob.glob(path_files)
for f in files:
os.remove(f)
def delete_temp_update_folders(folder_path, path_files):
folder_paths = glob.glob(folder_path)
'''
First, all folders are deleted in this current folder (folder_paths).
'''
if not folder_paths:
'''
The list (folder_paths) is empty, that means there aren't somer folders.
In this case its enough to delete all files - everything including
hidden files.
'''
result_files = delete_temp_update_files(path_files)
return result_files
else:
'''
There are some folders. Before the files are deleted, the folder must be deleted.
'''
for folder_element in folder_paths:
shutil.rmtree(folder_element, ignore_errors=True)
'''
Now all folders are delete have been deleted. Netx all files should be deleted.
'''
result_files = delete_temp_update_files(path_files)
return result_files
def on_delete_files_folders(update_temp_zip_file):
# The variable named (all_files) shows all files - everything including hidden files
all_files = os.path.join(update_temp_zip_file, '*')
# The variable named (all_folders) shows all folders in current folder.
all_folders = os.path.join(update_temp_zip_file, '*/')
delete_temp_update_folders(all_folders, all_files)
on_delete_files_folders("PATH/TO/YOUR/FOLDER")

Your decision is rather sophisticated.
You may use my script below:
import os
def del_recursive(path):
for file in os.listdir(path):
file = os.path.join(path, file)
if os.path.isdir(file):
try:
os.rmdir(file)
except:
del_recursive(file)
os.rmdir(file)
else:
os.remove(file)
If you want to delete everything in a directory you also may use shutil.rmtree(). Both variants work on Python 2.7.3.

create a list of files to be deleted

I am working on a search-and-destroy type program which I need it to do is search all directories with a certain file-name and append them to a list. after that delete all those files...not objects in list or the list...
import os
file_list=[]
for root, dirs, files in os.walk(path-to-dir'):
for f_name in files:
if f_name.startswith("file-name"):
file_list.append(f_name)
I could write up to appending part of the code but I don't know next...
Some help please

To remove a file from your computer, use os.remove(). It takes full path to the file as it's parameter, so instead of calling os.remove("infectedFile.dll") you would call os.remove("C:/program files/avira/infectedFile.dll")
So your file_list should contain full paths to the files, and then just call:
for file in file_list:
os.remove(file)

Modify your file_list.append(f_name). The f_name is only a bare name. You need to add the path to the file name in the time of processing, because you do not know where the file was found in the directory hierarchy:
file_list.append(os.path.join(root, f_name))
The root variable contains the path during walking.
To make check whether your code works, just print the content of the list:
print('\n'.join(file_list))
Or you can do it in the loop to get ready for the later part:
for fname in file_list:
print(fname)
Then you just add the os.remove(fname) to remove the file name:
for fname in file_list:
print('removing', fname)
os.remove(fname)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract text from first page of word document and use it as a folder name, then move the file inside that folder - python-3.x

Related

Using Python to copy contents of multiple files and paste in a main file

Does the following program access a file in a subfolder of a folder?

Python 3.5:Not able to remove non alpha -numeric characters from file_name

Python: Delete a special file

create a list of files to be deleted

Categories

Resources