How to iterate over multiple files by name within given range? - python-3.x

So I'm trying to iterate over multiple xml files from a library which contains more then 100k files, I need to list files by their 3 last digits.
Expected result is a list of files named from 'asset-PD471090' to 'asset-PD471110' or 'asset-GT888185' to 'asset-GT888209', and so on.
My Code -
'''
import glob
strtid = input('From ID: ') # First file in range
seps = strtid[-3:]
endid = input('To ID: ') # Last file in range
eeps = endid[-3:]
FileId = strtid[:5] # always same File Id for whole range
for name in glob.iglob('asset-' + FileId + [seps-eeps] + '.xml', recursive=True):
print(name) # iterate over every file in given range and print file names.
'''
The error I'm getting is
TypeError: unsupported operand type(s) for -: 'str' and 'str'
How to load a specific range of input files ?

As the error tells you: you try to use - on strings:
strtid = input('From ID: ') # string
seps = strtid[-3:] # part of a string
endid = input('To ID: ') # string
eeps = endid[-3:] # part of a string
FileId = strtid[:5] # also part of a string
# [seps-eeps]: trying to substract a string from a string:
for name in glob.iglob('asset-' + FileId + [seps-eeps] + '.xml', recursive=True):
You can convert the string to a integer using int("1234") - won't help you much though, because then you only have one (wrong) number for your iglob.
If you wanted to give them as glob-pattern you would need to encase them in stringdelimiters - and glob does not work that way with numberranges:
"[123-678]" would be one digit of 1,2,3,4,5,6,7,8 - not 123 up to 678
However, you can test your files yourself:
import os
def get_files(directory, prefix, postfix, numbers):
lp = len(prefix) # your assets-GT
li = len(postfix) + 4 # your id + ".xml"
for root, dirs, files in os.walk(directory):
for file in sorted(files): # sorted to get files in order, might not need it
if int(file[lp:len(file)-li]) in numbers:
yield os.path.join(root,file)
d = "test"
prefix = "asset-GT" # input("Basename: ")
postfix = "185" # input("Id: ")
# create demo files to search into
os.makedirs(d)
for i in range(50,100):
with open (os.path.join(d,f"{prefix}{i:03}{postfix}.xml"),"w") as f:
f.write("")
# search params
fromto = "75 92" # input("From To (space seperated numbers): ")
fr, to = map(int,fromto.strip().split())
to += 1 # range upper limit is exclusive, so need to add 1 to include it
all_searched = list(get_files("./test", prefix, postfix, range(fr,to)))
print(*all_searched, sep="\n")
Output:
./test/asset-GT075185.xml
./test/asset-GT076185.xml
./test/asset-GT077185.xml
./test/asset-GT078185.xml
./test/asset-GT079185.xml
./test/asset-GT080185.xml
./test/asset-GT081185.xml
./test/asset-GT082185.xml
./test/asset-GT083185.xml
./test/asset-GT084185.xml
./test/asset-GT085185.xml
./test/asset-GT086185.xml
./test/asset-GT087185.xml
./test/asset-GT088185.xml
./test/asset-GT089185.xml
./test/asset-GT090185.xml
./test/asset-GT091185.xml
./test/asset-GT092185.xml

Related

How do I find multiple strings in a text file?

I need all the strings found in the text file to be found and capitalized. I have found out how to find the string but getting multiple is my issue if you can help me print, where the given string is throughout my code, would be great thanks.
import os
import subprocess
i = 1
string1 = 'biscuit eater'
# opens the text file
# if this is the path where my file resides, f will become an absolute path to it
f = os.path.expanduser("/users/acarroll55277/documents/Notes/new_myfile.txt")
# with this form of open, the wile will automatically close when exiting the code block
txtfile = open (f, 'r')
# print(f.read()) to print the text document in terminal
# this sets variables flag and index to 0
flag = 0
index = 0
# looks through the file line by line
for line in txtfile:
index += 1
#checking if the sting is in the line or not
if string1 in line:
flag = 1
break
# checking condition for sting found or not
if flag == 0:
print('string ' + string1 + ' not found')
else:
print('string ' + string1 + ' found in line ' + str(index))
I believe your approach would work, but it is very verbose and not very Pythonic. Try this out:
import os, subprocess
string1 = 'biscuit eater'
with open(os.path.expanduser("/users/acarroll55277/documents/Notes/new_myfile.txt"), 'r+') as fptr:
matches = list()
[matches.append(i) for i, line in enumerate(fptr.readlines()) if string1 in line.strip()]
fptr.read().replace(string1, string1.title())
if len(matches) == 0: print(f"string {string1} not found")
[print(f"string {string1} found in line {i}") for i in matches]
This will now print out a message for every occurrence of your string in the file. In addition, the file is handled safely and closed automatically at the end of the script thanks to the with statement.
You can use the str.replace-method. So in the line where you find the string, write line.replace(string1, string1.upper(), 1). The last 1 is there to only make the function replace 1 occurence of the string.
Either that or you read the text file as a string and use the replace-method on that entire string. That saves you the trouble of trying to find the occurence manually. In that case, you can write
txtfile = open(f, 'r')
content = txtfile.read()
content = content.replace(string1, string1.upper())

How to search for a specific string and replace a number in this string usign python?

I have a text file that is a Fortran code. I am trying to create copies (Fortran files) of this code. But in each copy I create I want to find and replace few things. As an example:
This is the code and I want to search for pMax, tShot, and replace the numbers next to it.
I would be grateful if someone can give me a hint on how this can be done using python 3.x. Thank you.
This is my try at it:
There is no error but for some reason, re.sub() doesn't replace the string with the desired string 'pMax' in the destination file. Although printing the return value of re.sub shows that the pMax is modified to the desired value.
vd_load_name = []
for i in range(len(Pressure_values)):
vd_load_name.append('{}_{}MPa'.format(n,int(Pressure_values[i])))
src_dir = os.getcwd() #get the current working dir
# create a dir where we want to copy and rename
dest_dir = os.mkdir('vd_loads')
os.listdir()
dest_dir = src_dir+"/vd_loads"
for i in range(len(vd_load_name)):
src_file = os.path.join(src_dir, 'n_t05_pressure_MPa.txt')
shutil.copy(src_file,dest_dir) #copy the file to destination dir
dst_file = os.path.join(dest_dir,'n_t05_pressure_MPa.txt')
new_dst_file_name = os.path.join(dest_dir, vd_load_name[i]+'.txt')
os.rename(dst_file, new_dst_file_name)#rename
os.chdir(dest_dir)
dest_file_path = dest_dir+'/'+vd_load_name[i]+'.txt'
with open(dest_file_path,'r+') as reader:
#reading all the lines in the file one by one
for line in reader:
re.sub(r"pMax=\d+", "pMax=" + str(int(Pressure_values[i])), line)
print(re.sub(r"pMax=\d+", "pMax=" + str(int(Pressure_values[i])), line))
Actualy part of the fortran code that I want to edit:
integer :: i !shot index in x
integer :: j !shot index in y
integer :: sigma !1D overlap
dimension curCoords(nblock,ndim), velocity(nblock,ndim),dirCos(nblock,ndim,ndim), value(nblock)
character*80 sname
pMax=3900d0 !pressure maximum [MPa] Needs to be updated!!
fact=1d0 !correction factor
JLTYP=0 !key that identifies the distributed load type, 0 for Surface-based load
t=stepTime !current time[s]
tShot=1.2d-7 !Time of a shot[s] Needs to be updated!!
sigma=0 !1D overlap in x&y [%]

rSplit file names according to some rules?

I have 1.5M files with 2 patterns:
"userId_page.jpg"
"date_userId_page.jpg"
I want to be able to split the file name to 2 or 3 parts according to the given patterns.
I know I can use:
file_name = '/2020_03_10_123456_001.jpg'
date_part, user_id, page_num = file_name.rsplit('_', 2)
and if my filename consists only from ID and page:
file_name = '/12232454345234_005.jpg'
user_id, page_num = file_name.rsplit('_', 1)
Should I count the number of "_" in each case and rsplit it with 1 or 2?
Is there any other better option?
You could use a regex to match the different parts and then assign them to each of the variables. By using an optional group, we can get one regex to match both filename patterns; when there is no date_part that variable will be an empty string:
import re
file_name = '/2020_03_10_123456_001.jpg'
date_part, user_id, page_num = re.findall(r'(?:(\w+)_)?(\d+)_(\d+)\..*$', file_name)[0]
print(f'date={date_part}, user={user_id}, page={page_num}')
file_name = '/12232454345234_005.jpg'
date_part, user_id, page_num = re.findall(r'(?:(\w+)_)?(\d+)_(\d+)\..*$', file_name)[0]
print(f'date={date_part}, user={user_id}, page={page_num}')
Output:
date=2020_03_10, user=123456, page=001
date=, user=12232454345234, page=005

How to prepare text in all TXT files in folder for python via terminal?

I have folder with a lot of TXT files (books) which have many special symbols (multiple spaces, paragraphs, #, -, '.', etc) in the beginning. It causes great variety of problems while reading the files in python (pandas). Usually it transfers into errors like:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 29, saw 2
or
Found 0 texts.
Can I use some terminal script for text preconditioning? Your assistance will be much appreciated!
example for one file:
and code:
texts = [] # list of text samples
labels_index = {} # dictionary mapping label name to numeric id
labels = [] # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
path = os.path.join(TEXT_DATA_DIR, name)
if os.path.isdir(path):
label_id = len(labels_index)
labels_index[name] = label_id
for fname in sorted(os.listdir(path)):
if fname.isdigit():
fpath = os.path.join(path, fname)
args = {} if sys.version_info < (3,) else {'encoding': 'utf-8'}
with open(fpath, **args) as f:
t = f.read()
i = t.find('\n\n') # skip header
if 0 < i:
t = t[i:]
texts.append(t)
labels.append(label_id)
print('Found %s texts.' % len(texts))
You can try the unicodedata.
text = unicodedata.normalize('NFKD', text)
It Replaces unicode characters with their normal representations

Split a string by '_'

I have a number of files in a directory with the following file format:
roll_#_oe_yyyy-mm-dd.csv
where # is a integer and yyyy-mm-dd is a date (for example roll_6_oe_2008-02-12).
I am trying to use the split function so I can return the number on its own. So for example:
roll_6_oe_2008-02-12 would yield 6
and
roll_14_oe_2008-02-12 would yield 14
I have tried :
filename.split("_")
but cannot write the number to a variable. What can I try next?
Supposing that: filename = 'roll_14_oe_2008-02-12'
print(filename.split('_')) evaluates to ['roll', '14', 'oe', '2008-02-12']
The number you want to retrieve is in the 2nd position of the list:
my_number = filename.split('_')[1]
You could also extract the number using regex:
import re
filename = 'roll_134_oe_2008-02-12'
number_match = re.match("roll_*(\d+)", filename)
if number_match:
print number_match.group(1)
Working example for both methods: http://www.codeskulptor.org/#user41_jEFOv5N5GN_2.py

Resources