rSplit file names according to some rules? - python-3.x

I have 1.5M files with 2 patterns:
"userId_page.jpg"
"date_userId_page.jpg"
I want to be able to split the file name to 2 or 3 parts according to the given patterns.
I know I can use:
file_name = '/2020_03_10_123456_001.jpg'
date_part, user_id, page_num = file_name.rsplit('_', 2)
and if my filename consists only from ID and page:
file_name = '/12232454345234_005.jpg'
user_id, page_num = file_name.rsplit('_', 1)
Should I count the number of "_" in each case and rsplit it with 1 or 2?
Is there any other better option?

You could use a regex to match the different parts and then assign them to each of the variables. By using an optional group, we can get one regex to match both filename patterns; when there is no date_part that variable will be an empty string:
import re
file_name = '/2020_03_10_123456_001.jpg'
date_part, user_id, page_num = re.findall(r'(?:(\w+)_)?(\d+)_(\d+)\..*$', file_name)[0]
print(f'date={date_part}, user={user_id}, page={page_num}')
file_name = '/12232454345234_005.jpg'
date_part, user_id, page_num = re.findall(r'(?:(\w+)_)?(\d+)_(\d+)\..*$', file_name)[0]
print(f'date={date_part}, user={user_id}, page={page_num}')
Output:
date=2020_03_10, user=123456, page=001
date=, user=12232454345234, page=005

Related

How to iterate over multiple files by name within given range?

So I'm trying to iterate over multiple xml files from a library which contains more then 100k files, I need to list files by their 3 last digits.
Expected result is a list of files named from 'asset-PD471090' to 'asset-PD471110' or 'asset-GT888185' to 'asset-GT888209', and so on.
My Code -
'''
import glob
strtid = input('From ID: ') # First file in range
seps = strtid[-3:]
endid = input('To ID: ') # Last file in range
eeps = endid[-3:]
FileId = strtid[:5] # always same File Id for whole range
for name in glob.iglob('asset-' + FileId + [seps-eeps] + '.xml', recursive=True):
print(name) # iterate over every file in given range and print file names.
'''
The error I'm getting is
TypeError: unsupported operand type(s) for -: 'str' and 'str'
How to load a specific range of input files ?
As the error tells you: you try to use - on strings:
strtid = input('From ID: ') # string
seps = strtid[-3:] # part of a string
endid = input('To ID: ') # string
eeps = endid[-3:] # part of a string
FileId = strtid[:5] # also part of a string
# [seps-eeps]: trying to substract a string from a string:
for name in glob.iglob('asset-' + FileId + [seps-eeps] + '.xml', recursive=True):
You can convert the string to a integer using int("1234") - won't help you much though, because then you only have one (wrong) number for your iglob.
If you wanted to give them as glob-pattern you would need to encase them in stringdelimiters - and glob does not work that way with numberranges:
"[123-678]" would be one digit of 1,2,3,4,5,6,7,8 - not 123 up to 678
However, you can test your files yourself:
import os
def get_files(directory, prefix, postfix, numbers):
lp = len(prefix) # your assets-GT
li = len(postfix) + 4 # your id + ".xml"
for root, dirs, files in os.walk(directory):
for file in sorted(files): # sorted to get files in order, might not need it
if int(file[lp:len(file)-li]) in numbers:
yield os.path.join(root,file)
d = "test"
prefix = "asset-GT" # input("Basename: ")
postfix = "185" # input("Id: ")
# create demo files to search into
os.makedirs(d)
for i in range(50,100):
with open (os.path.join(d,f"{prefix}{i:03}{postfix}.xml"),"w") as f:
f.write("")
# search params
fromto = "75 92" # input("From To (space seperated numbers): ")
fr, to = map(int,fromto.strip().split())
to += 1 # range upper limit is exclusive, so need to add 1 to include it
all_searched = list(get_files("./test", prefix, postfix, range(fr,to)))
print(*all_searched, sep="\n")
Output:
./test/asset-GT075185.xml
./test/asset-GT076185.xml
./test/asset-GT077185.xml
./test/asset-GT078185.xml
./test/asset-GT079185.xml
./test/asset-GT080185.xml
./test/asset-GT081185.xml
./test/asset-GT082185.xml
./test/asset-GT083185.xml
./test/asset-GT084185.xml
./test/asset-GT085185.xml
./test/asset-GT086185.xml
./test/asset-GT087185.xml
./test/asset-GT088185.xml
./test/asset-GT089185.xml
./test/asset-GT090185.xml
./test/asset-GT091185.xml
./test/asset-GT092185.xml

how to replace blank space with "_" on every element's list - Python

I imported a .csv file with this command:
mydata = pd.read_csv(file ,sep='\t' , engine='python' , dtype = {'Day' : np.datetime64 , 'Year' : np.int} )
But i noticed than some of the column name has blank spaces like Account id instead of Account_id
Now i got the list of my columns name with this:
dwb_col= data.columns
And i'd like to replace blank spaces " " with "_" sign on every column name (i.e. every dwb_col element).
in order to rename the columns in this way:
mydata.columns = [my_new_columns_list]
How i can do the find and replace part?
Is there any workaround/shortcut during the importing fase that let me collect the column name with "_"(underscore sign) over the "
" (space) ?
This will do, using str.replace:
df.columns = df.columns.str.replace(" ", "_")
Another way would be using regex \s+ which will match 1 or more white spaces whilst ' ' will only match one
dwb_col = df.columns.str.replace('\s+', '_')
then just re-assign
df.columns = dwb_col
if you have trailing or leading white space you want to remove first you can add a
str.strip
df.columns.str.strip().str.replace('\s+', '_')
regarding number 2 you can import your file and use the nrows argument to only collect the top n rows to gather the column names.
col_df = pd.read_csv(data,nrows=1)
cols = [col for col in col_df.columns.tolist() if '_' in col]
then read your data with usecols
df = pd.read_csv(data,usecols=cols)
Try this,assume your column names are like this
l = ["hello world","hello cat"]
cols = ['_'.join(i.split()) for i in l]
#outout
['hello_world', 'hello_cat']

Python: Trouble indexing a list from .split()

I'm currently working on a folder rename program that will crawl a directory, and rename specific words to their abbreviated version. These abbreviations are kept in a dictionary. When I try to replace mylist[mylist.index(w)] with the abbreviation, it replaces the entire list. The list shows 2 values, but it is treating them like a single index. Any help would be appreciated, as I am very new to Python.
My current test environment has the following:
c:\test\Accounting 2018
My expected result when this is completed, is c:\test\Acct 2018
import os
keyword_dict = {
'accounting': 'Acct',
'documents': 'Docs',
'document': 'Doc',
'invoice': 'Invc',
'invoices': 'Invcs',
'operations': 'Ops',
'administration': 'Admin',
'estimate': 'Est',
'regulations': 'Regs',
'work order': 'WO'
}
path = 'c:\\test'
def format_path():
for kw in os.walk(path, topdown=False):
#split the output to separate the '\'
usable_path = kw[0].split('\\')
#pull out the last folder name
string1 = str(usable_path[-1])
#Split this output based on ' '
mylist = [string1.lower().split(" ")]
#Iterate through the folders to find any values in dictionary
for i in mylist:
for w in i:
if w in keyword_dict.keys():
mylist[i.index(w)] = keyword_dict.get(w)
print(mylist)
format_path()
When I use print(mylist) prior to the index replacement, I get ['accounting', '2018'], and print(mylist[0]) returns the same result.
After the index replacement, the print(mylist) returns ['acct] the ['2018'] is now gone as well.
Why is it treating the list values as a single index?
I didn't test the following but it should point to the right direction. But first, not sure if it is a good idea spacing is the way to go (Accounting 2018) could come up as accounting2018 or accounting_2018. Better to use regular expression. Anyway, here is a slightly modified version of your code:
import os
keyword_dict = {
'accounting': 'Acct',
'documents': 'Docs',
'document': 'Doc',
'invoice': 'Invc',
'invoices': 'Invcs',
'operations': 'Ops',
'administration': 'Admin',
'estimate': 'Est',
'regulations': 'Regs',
'work order': 'WO'
}
path = 'c:\\test'
def format_path():
for kw in os.walk(path, topdown=False):
#split the output to separate the '\'
usable_path = kw[0].split('\\')
#pull out the last folder name
string1 = str(usable_path[-1])
#Split this output based on ' '
mylist = string1.lower().split(" ") #Remove [] since you are creating a list within a list for no reason
#Iterate through the folders to find any values in dictionary
for i in range(0,len(mylist)):
abbreviation=keyword_dict.get(mylist[i],'')
if abbreviation!='': #abbrevaition exists so overwrite it
mylist[i]=abbreviation
new_path=" ".join(mylist) #create new path (i.e. ['Acct', '2018']==>Acct 2018
usable_path[len(usable_path)-1]=new_path #replace the last item in the original path then rejoin the path
print("\\".join(usable_path))
What you need is:
import re, os
regex = "|".join(keyword_dict.keys())
repl = lambda x : keyword_dict.get(x.group().lower())
path = 'c:\\test'
[re.sub(regex,repl, i[0],re.I) for i in os.walk(path)]
You need to ensure the above is working.(So far it is working as expected) before you can rename

Split a string by '_'

I have a number of files in a directory with the following file format:
roll_#_oe_yyyy-mm-dd.csv
where # is a integer and yyyy-mm-dd is a date (for example roll_6_oe_2008-02-12).
I am trying to use the split function so I can return the number on its own. So for example:
roll_6_oe_2008-02-12 would yield 6
and
roll_14_oe_2008-02-12 would yield 14
I have tried :
filename.split("_")
but cannot write the number to a variable. What can I try next?
Supposing that: filename = 'roll_14_oe_2008-02-12'
print(filename.split('_')) evaluates to ['roll', '14', 'oe', '2008-02-12']
The number you want to retrieve is in the 2nd position of the list:
my_number = filename.split('_')[1]
You could also extract the number using regex:
import re
filename = 'roll_134_oe_2008-02-12'
number_match = re.match("roll_*(\d+)", filename)
if number_match:
print number_match.group(1)
Working example for both methods: http://www.codeskulptor.org/#user41_jEFOv5N5GN_2.py

Python changing file name

My application offers the ability to the user to export its results. My application exports text files with name Exp_Text_1, Exp_Text_2 etc. I want it so that if a file with the same file name pre-exists in Desktop then to start counting from this number upwards. For example if a file with name Exp_Text_3 is already in Desktop, then I want the file to be created to have the name Exp_Text_4.
This is my code:
if len(str(self.Output_Box.get("1.0", "end"))) == 1:
self.User_Line_Text.set("Nothing to export!")
else:
import os.path
self.txt_file_num = self.txt_file_num + 1
file_name = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt" + "_" + str(self.txt_file_num) + ".txt")
file = open(file_name, "a")
file.write(self.Output_Box.get("1.0", "end"))
file.close()
self.User_Line_Text.set("A text file has been exported to Desktop!")
you likely want os.path.exists:
>>> import os
>>> help(os.path.exists)
Help on function exists in module genericpath:
exists(path)
Test whether a path exists. Returns False for broken symbolic links
a very basic example would be create a file name with a formatting mark to insert the number for multiple checks:
import os
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
#the "{}" is a formatting mark so we can do file_name.format(num)
num = 1
while os.path.exists(name_to_format.format(num)):
num+=1
new_file_name = name_to_format.format(num)
this would check each filename starting with Exp_Txt_1.txt then Exp_Txt_2.txt etc. until it finds one that does not exist.
However the format mark may cause a problem if curly brackets {} are part of the rest of the path, so it may be preferable to do something like this:
import os
def get_file_name(num):
return os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_" + str(num) + ".txt")
num = 1
while os.path.exists(get_file_name(num)):
num+=1
new_file_name = get_file_name(num)
EDIT: answer to why don't we need get_file_name function in first example?
First off if you are unfamiliar with str.format you may want to look at Python doc - common string operations and/or this simple example:
text = "Hello {}, my name is {}."
x = text.format("Kotropoulos","Tadhg")
print(x)
print(text)
The path string is figured out with this line:
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
But it has {} in the place of the desired number. (since we don't know what the number should be at this point) so if the path was for example:
name_to_format = "/Users/Tadhg/Desktop/Exp_Txt_{}.txt"
then we can insert a number with:
print(name_to_format.format(1))
print(name_to_format.format(2))
and this does not change name_to_format since str objects are Immutable so the .format returns a new string without modifying name_to_format. However we would run into a problem if out path was something like these:
name_to_format = "/Users/Bob{Cat}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Bobcat{}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Smiley{:/Desktop/Exp_Txt_{}.txt"
Since the formatting mark we want to use is no longer the only curly brackets and we can get a variety of errors:
KeyError: 'Cat'
IndexError: tuple index out of range
ValueError: unmatched '{' in format spec
So you only want to rely on str.format when you know it is safe to use. Hope this helps, have fun coding!

Resources