Creating a directory tree and feed it as an input to create the same tree again - python-3.x

I am trying to get the directory tree only from a root folder in a server. Then feed the output to another program so it will create the same structure to another server.
The hard way is to create a JSON like structure of tree and then parse it to another server and create folders.
Is there any pythonic way to do this?

Why JSON? You could create just list of directories, e.g. with this script:
import os
def print_dir(path):
with os.scandir(path) as it:
for entry in it:
if entry.is_dir():
d = os.path.join(path, entry.name)
yield d
yield from print_dir(d)
for d in print_dir('/'):
print(d)
This prints whole directory tree from root ('/'):
/lib
/lib/crda
/lib/crda/pubkeys
/lib/terminfo
/lib/terminfo/m
/lib/terminfo/c
/lib/terminfo/x
/lib/terminfo/E
...etc.
You then send this list to the server, where you will read it line by line and run mkdir with -p argument (create parents, no error if existing, make parent directories as needed).

Related

Using regex and cp: cannot stat

I am trying to copy files over from an old file structure where data are stored in folders with improper names to a new (better) structure, but as there are 671 participants I need to copy, I want to use regex in order to streamline it (each participant has files saved in the same format). However, I keep getting a cp: cannot stat error message saying that no file/directory exists. I had assumed this meant that I had missed a / or put "" in the wrong location but I cannot see anything in the code that would suggest it.
My code is as follows (which I add a lot of comments so other collaborators can understand):
#!/bin/bash
# This code below copies the initial .nii file.
# These data are copied into my Trial Participant folders.
# Create a variable called parent_folder1 that describes initial mask directory e.g. folders for each participant which contains the files.
parent_folder1="/this/path/here/contains/Trial_Participants"
# The original folders are named according to ClinicalID_scandate_randomdigits e.g. folder 1234567890_20000101_987654.
# The destination folders are named according to TrialIDNumber e.g. LBC100001.
# The .nii files are saved under TrialIDNumber_1_ICV.nii.gz e.g. LBC1000001_1_ICV.nii.gz.
# These files need copied over from their directories into the Trial Participant folders, using the for loop function.
# The * symbol is used as a wildcard.
for i in $(ls -1d "${parent_folder1}"/*_20*); do
lbc=$(ls ${i}/finalMasks/*ICV* | sed 's/^.*\///'); lbc=${lbc:0:9}
cp "${parent_folder1}/${i}"/finalMasks/*_1_ICV.nii.gz /this/path/is/the/destination/path/${lbc}/
done
# This code uses regular expression to find the initial ICV file.
# ls asks for a list, -1 makes each new folder on a new line, d is for directory.
# *_20* refers to the name of the folders. The * covers the ClinicalID, _20* refers to the scan date and random digits.
# I have no idea what the | sed 's/^.*\///' does, but I think it strips the path.
# lbc=${lbc:0:9} is used to keep the ID numbers.
# cp copies the files that are named under TrialIDNumber(replaced by *)_1_ICV.nii.gz to the destination under the respective folder.
So after a bit of fooling around, I changed the code a lot (took out sed as it confuses me), and came up with this that worked. Thanks to those who commented!
# Create a variable called parent_folder1 that describes initial mask directory.
parent_folder1="/original/path/here"
# Iterate over directories in parent_folder1
for i in $(ls -1d "${parent_folder1}"/*_20*); do
# Extract the base name of the file in the finalMasks directory
lbc=$(basename $(ls "${i}"/finalMasks/*ICV*))
# Extract the LBC number from the file name
lbc=${lbc:0:9}
# Copy the file to the specific folder
cp "${i}"/finalMasks/${lbc}_1_ICV.nii.gz /destination/path/here/${lbc}/
done

Extract text from first page of word document and use it as a folder name, then move the file inside that folder

I have hundreds of word documents that needs to be processed but need to organized them first by versions in subfolders.
I basically get a drop of these word documents within a single folder and need to automate the organization moving forward before I get nuts.
So I have a script that basically creates a folder with the same name of the file and moves the file inside that folder, this part is done.
Now I need to go into each subfolder, and get the document version from within the first word page of each document, then create a sub-folder withe version number and move the word file into that subfolder.
The structure should be as follows (taking two folders as examples):
(Folder) Test
(Subfolder) 12.0
Test.docx
(Folder) Test1
(Subfolder) 13.0
Test1.docx
Luckily I was able to figure it out that "doc.paragraphs[6].text" will always return the version information in a single line as follows:
>>> doc.paragraphs[6].text
'Version Number: 12.0'
Would appreciate if someone can point me out to the right direction.
This is the script I have so far:
#!/usr/bin/env python3
import glob, os, shutil, docx, sys
folder = sys.argv[1]
#print(folder)
for file_path in glob.glob(os.path.join(folder, '*.docx')):
new_dir = file_path.rsplit('.', 1)[0]
#print(new_dir)
try:
os.mkdir(os.path.join(folder, new_dir))
except WindowsError:
# Handle the case where the target dir already exist.
pass
shutil.move(file_path, os.path.join(new_dir, os.path.basename(file_path)))
Please see below the complete solution to your requirement.
Note: To know about re.search go through https://www.geeksforgeeks.org/python-regex-re-search-vs-re-findall/
import docx, os, glob, re, shutil
from pathlib import Path
def create_dir(path): # function to check if a given path exist and create one if not
# Check whether the specified path exists or not
is_exist = os.path.exists(path)
# Create a new directory the path does not exist
if not is_exist:
os.makedirs(path)
folder = fr"C:\Users\rams\Documents\word_docs" #my local folder
for file in glob.glob(os.path.join(folder, '*.docx')):
# Test, Test1, Test2 in your structure
main_folder = os.path.join(folder,Path(file).stem)
file_name = os.path.basename(file)
# Get the first line from the docx
doc = docx.Document(file).paragraphs[0].text
# group(1) = Version Number: (.*)
version_no = re.search("(Version Number: (.*))", doc).group(1)
# extract the number portion from version_no
sub_folder = version_no.split(':')[1].strip()
# path to actual sub_folder with version_no
sub_folder = os.path.join(main_folder, sub_folder)
# destination path
dest_file_path = os.path.join(sub_folder, file_name)
for i in [main_folder,sub_folder]:
create_dir(i) # function call
# to move the file to the corresponding version folder (overwrite if exists)
if os.path.exists(dest_file_path):
os.remove(dest_file_path)
shutil.move(file, sub_folder)
else:
shutil.move(file, sub_folder)
Before execution:
After Execution
So you have a script that creates a folder name being the file name and moves the file inside that folder. This part is done. OK.
Now you know how to get the document version from within the first word page of each document you need to create a sub-folder with this version number and move the word file into that sub-folder. This can be done using the same code as before replacing:
new_dir = file_path.rsplit('.', 1)[0]
with
document_dir = os.path.dirname(file_path)
document_name = os.path.basename(file_path)
# check if the document is already in the right directory:
assert os.path.basename(document_dir) == document_name.rsplit('.', 1)[0]
# here comes: doc = some_function_getting_the_doc_object(file_path)
doc_version_tuple = doc.paragraphs[6].text.rsplit(': ', 1)
# check if doc_version_tuple has the right content:
assert doc_version_tuple[0] == 'Version Number'
doc_version = doc_version_tuple[1]
new_dir = os.path.join(document_dir, doc_version)
Notice that you can also do both of the two steps in one run over the list of full path document names.
Notice further that running the script you posted in your question twice without the check:
assert os.path.basename(document_dir) != document_name.rsplit('.', 1)[0]
giving an Error if the script was already run and the documents are already in folders with the document name will destroy what you already achieved and you will need to write another script to reverse it.
The above is the reason why it would be a good idea to have a backup copy of all the documents you can use to re-create the directory with the documents in case something goes wrong. And ... it is generally a good idea to have always a backup copy if you work on files especially when using a self-written script.

Zip compress without root folders

My problem is that I have to generate a zip file using the linux zip console command. My command is as follows:
zip -r /folder1/folder2/EXP_45.zip /folder1/folder2/EXP_45/
That returns a correct zip only that includes the root folders I want:
Returns
EXP_45.zip
-folder1
--folder2
---EXP_45
...
I want
EXP_45.zip
-EXP_45
...
EXP_45 is a folder that can contain files and folders and they must be present in the zip. I just want the tree structure to start with the EXP_45 folder.
Is there any solution?
The reason why I need it to be a single command is that it is an action of a job in a PL SQL function like:
BEGIN
DBMS_SCHEDULER.CREATE_JOB (
JOB_NAME=>'compress_files', --- job name
JOB_ACTION=>'/usr/bin/zip', --- executable file with path
JOB_TYPE=>'executable', ----- job type
NUMBER_OF_ARGUMENTS=>4, -- parameters in numbers
AUTO_DROP =>false,
CREDENTIAL_NAME=>'credentials' -- give credentials name which you have created before "credintial"
);
dbms_scheduler.set_job_argument_value('compress_files',1,'-r');
dbms_scheduler.set_job_argument_value('compress_files',2,'-m');
dbms_scheduler.set_job_argument_value('compress_files',3,'/folder1/folder2/EXP_45.zip');
dbms_scheduler.set_job_argument_value('compress_files',4,'/folder1/folder2/EXP_45/');
DBMS_SCHEDULER.RUN_JOB('compress_files');
END;
I haven't been able to find a solution to this problem using zip but I have found it using jar. The command would be:
jar cMf /folder1/folder2/EXP_45.zip -C /folder1/folder2/EXP_45 .
Also, the solution using a job in pl sql in case it works for someone would be:
BEGIN
DBMS_SCHEDULER.CREATE_JOB (
JOB_NAME=>'compress_files', --- job name
JOB_ACTION=>'/usr/bin/jar', --- executable file with path
JOB_TYPE=>'executable', ----- job type
NUMBER_OF_ARGUMENTS=>5, -- parameters in numbers
AUTO_DROP =>false,
CREDENTIAL_NAME=>'credentials' -- give credentials name which you have created before "credintial"
);
dbms_scheduler.set_job_argument_value('compress_files',1,'cMf');
dbms_scheduler.set_job_argument_value('compress_files',2,'/folder1/folder2/EXP_45.zip');
dbms_scheduler.set_job_argument_value('compress_files',3,'-C');
dbms_scheduler.set_job_argument_value('compress_files',4,'/folder1/folder2/EXP_45');
dbms_scheduler.set_job_argument_value('compress_files',5,'.');
DBMS_SCHEDULER.RUN_JOB('compress_files');
END;
You want to use the -j (or --junk-paths) option when you are creating the zip file. Below is from the zip man page.
-j
--junk-paths
Store just the name of a saved file (junk the path), and do not store directory names.
By default, zip will store the full path (relative to the current directory).
Update following Question Clarification
Why not put the equivalent to the code below in a shell script & get the SQL function to invoke that? You just need to pass the directory name to cd into and the name of the output zip.
cd folder1/folder2
zip -r /tmp/EXP_45.zip EXP_45

pathlib mkdir creates a folder by filename

I have the following preexisting folder in my machine
D:\scripts\myfolder
I want my script to create a folder named logs and create a file log.txt in it. So the path would look like
D:\scripts\myfolder\logs\somelog.txt
So I used
p = pathlib.Path("D:\scripts\myfolder\logs\somelog.txt")
p.mkdir(parents=True, exisit_ok=True)
Now
print(p.parents[0]) ==> D:\scripts\myfolder\logs
print(p.parents[1]) ==> D:\scripts\myfolder
print(p.parents[2]) ==> D:\scripts
So, as per Path.mkdir documentation
p.mkdir(parents=True, exisit_ok=True) should create the folders logs, myfolder or scripts and so on if they don't exist.
But it creates a folder by the name some.txt inside logs folder, although it is none of the parents. Why is that so?
I understand that the workaround is to use pathlib.Path("D:\scripts\myfolder\logs")
The entire point of mkdir is to create the directory pointed to by its argument. Passing in parents=True creates the parent folders in addition.
Create a new directory at this given path. [...] If parents is true, any missing parents of this path are created as needed; [1]
If you want to ensure the containing directory exists, create the parent of your path:
p = pathlib.Path("D:\scripts\myfolder\logs\somelog.txt")
p.parent.mkdir(parents=True, exist_ok=True)
That's the way Pathlib.mkdir works. It can't tell if the final component should be a file or a directory. parents=True means to create also parents, not only parents. If the final path component is always a file, you could avoid it like
p.parents[0].mkdir(parents=True)

Message <class WindowError> in reading recursavely a directory

I started to study Python 3 and, for fun, I'm trying to code a small software that, from a directory root, read recursively the filesystem and stores some informations about files in a csv file, but I'm receiving this exception
<class 'WindowsError'>
that I don't understand well. It looks like a problem of import, but I use this code
def buildsinglefile(currentfile, directory):
x = SingleFile(currentfile, directory)
return x
def walkdir(root_path):
# traverse root directory, and list directories as dirs and files as files
readfiles = []
print(root_path)
for root, dirs, files in os.walk(root_path):
path = root.split(os.sep)
for file in files:
print('file>'+file)
row = buildsinglefile(file, path ).tocsv()//here exception
readfiles.add(row)
for directory in dirs:
walkdir(directory)
All that I'm doing is (in Java developer state of mind :) ) in the main file (not inside a class):
parsing filesystem with os.walk
call the function buildsinglefile that creates an Object
put this object in a list
The point 2 is uncorrect, but I don't understand why.
Do you think everything?
Thank you

Resources