Count multiple files in a directory with the same name - python-3.x

I'm relatively new to Python and was working on a project where the user can navigate to a folder, after which the program does a count of all the files in that folder with a specific name.
The problem is that I have a folder with over 5000 files many of them sharing the same name but different extensions. I wrote code that somewhat does what I want the final version to do but its VERY redundant and I can't see myself doing this for over 600 file names.
Wanted to ask if it is possible to make this program "automated" or less redundant where I don't have to manually type out the names of 600 files to return data for.
Sample code I currently have:
import os, sys
print(sys.version)
file_counting1 = 0
file_counting2 = 0
filepath = input("Enter file path here: ")
if os.path.exists(filepath):
for file in os.listdir(filepath):
if file.startswith('expressmail'):
file_counting1 += 1
print('expressmail')
print('Total files found:', file_counting1)
for file in os.listdir(filepath):
if file.startswith('prioritymail'):
file_counting2 += 1
print('prioritymail')
print('Total files found:', file_counting2)
Sample Output:
expressmail
Total files found: 3
prioritymail
Total files found: 1

The following script will count occurrences of files with the same name. If the file does not have an extension, the whole filename is treated as the name. It also does not traverse subdirectories, since the original question just asks about files in the given folder.
import os
dir_name = "."
files = next(os.walk(dir_name))[2] # get all the files directly in the directory
names = [f[:f.rindex(".")] for f in files if "." in f] # drop the extensions
names += [f for f in files if "." not in f] # add those without extensions
for name in set(names): # for each unique name-
print("{}\nTotal files found: {}".format(name, names.count(name)))
If you want to support files in subdirectories, you could use something like
files = [os.path.join(r,file) for r,d,f in os.walk(dir_name) for file in f]
If you don't want to consider files without extensions, just remove the line:
names += [f for f in files if "." not in f]

There are a number of ways you can do what you're trying to do. Partly it depends on whether or not you need to recover the list of extension for a given duplicated file.
Counter, from the collections module - use this for a simple count of file. Ignore the extensions when building the count.
Use the filename without extension as a dictionary key, add a list of items as the key-value, where the list of items is each occurrence of the file.
Here's an example using the Counter class:
import os, sys, collections
c = collections.Counter()
for root, dirs,files in os.walk('/home/myname/hg/2018/'):
# discard any path data and just use filename
for names in files:
name, ext = os.path.splitext(names)
# discard any extension
c[name] += 1
# Counter.most_common() gives the values in the form of (entry, count)
# Counter.most_common(x) - pass a value to display only the top x counts
# e.g. Counter.most_common(2) = top 2
for x in c.most_common():
print(x[0] + ': ' + str(x[1]))

you can use regular expressions:
import os, sys, re
print(sys.version)
filepath = input("Enter file path here: ")
if os.path.exists(filepath):
allfiles = "\n".join(os.listdir(filepath))
file_counting1 = len(re.findall("^expressmail",allfiles,re.M))
print('expressmail')
print('Total files found:', file_counting1)
file_counting2 = len(re.findall("^prioritymail",allfiles,re.M))
print('prioritymail')
print('Total files found:', file_counting2)

Related

How to copy merge files of two different directories with different extensions into one directory and remove the duplicated ones

I would need a Python function which performs below action:
I have two directories which in one of them I have files with .xml format and in the other one I have files with .pdf format. To simplify things consider this example:
Directory 1: a.xml, b.xml, c.xml
Directory 2: a.pdf, c.pdf, d.pdf
Output:
Directory 3: a.xml, b.xml, c.xml, d.pdf
As you can see the priority is with the xml files in the case that both extensions have similar names.
I would be thankful for your help.
You need to use the shutil module and the os module to achieve this. This function will work on the following assumption:
A given directory has all files with the same extension
The priority_directory will be the directory with file extensions to be prioritized
The secondary_directory will be the directory with file extensions to be dropped in case of a name collision
Try:
import os,shutil
def copy_files(priority_directory,secondary_directory,destination = "new_directory"):
file_names = [os.path.splitext(filename)[0] for filename in os.listdir(priority_directory)] # get the file names to check for collisions
os.mkdir(destination) # make a new directory
for file in os.listdir(priority_directory): # this loop copies the first direcotory as it is
file_path = os.path.join(priority_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
for file in os.listdir(secondary_directory): # this loop checks for collisions and drops files whose name collide
if(os.path.splitext(file)[0] not in file_names):
file_path = os.path.join(secondary_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
print(os.listdir(destination))
Let's run it with your direcotry names as arguments:
copy_files('directory_1','directory_2','directory_3')
You can now check a new directory with the name directory_3 will be created with the desired files in it.
This will work for all such similar cases no matter what the extension is.
Note: There should not be a need to do this i guess cause a directory can have two files with the same name as long as the extensions differ.
Rough working solution:
import os
from shutil import copy2
d1 = './d1/'
d2 = './d2/'
d3 = './d3/'
ext_1 = '.xml'
ext_2 = '.pdf'
def get_files(d: str, files: list):
directory = os.fsencode(d)
for file in os.listdir(d):
dup = False
filename = os.fsdecode(file)
if filename[-4:] == ext_2:
for (x, y) in files:
if y == filename[:-4] + ext_1:
dup = True
break
if dup:
continue
files.append((d, filename))
files = []
get_files(d1, files)
get_files(d2, files)
for d, file in files:
copy2(d+file, d3)
I'll see if I can get it to look/perform better.

For Loop to Move and Rename .html Files - Python 3

I'm asking for help in trying to create a loop to make this script go through all files in a local directory. Currently I have this script working with a single HTML file, but would like it so it picks the first file in the directory and just loops until it gets to the last file in the directory.
Another way to help would be adding a line to the string would add a (1), (2), (3), etc. at the end if the names are duplicate.
Can anyone help with renaming thousands of files with a string that is parsed with BeautifulSoup4. Each file contains a name and reference number at the same position/line. Could be same name and reference number, or could be different reference number with same name.
import bs4, shutil, os
src_dir = os.getcwd()
print(src_dir)
dest_dir = os.mkdir('subfolder')
os.listdir()
dest_dir = src_dir+"/subfolder"
src_file = os.path.join(src_dir, 'example_filename_here.html')
shutil.copy(src_file, dest_dir)
exampleFile = open('example_filename_here.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(), 'html.parser')
elems = exampleSoup.select('.bodycopy')
type(elems)
elems[2].getText()
dst_file = os.path.join(dest_dir, 'example_filename_here.html')
new_dst_file_name = os.path.join(dest_dir, elems[2].getText()+ '.html')
os.rename(dst_file, new_dst_file_name)
os.chdir(dest_dir)
print(elems[2].getText())

Compare by NAME only, and not by NAME + EXTENSION using existing code; Python 3.x

The python 3.x code (listed below) does a great job of comparing files from two different directories (Input_1 and Input_2) and finding the files that match (are the same between the two directories). Is there a way I can alter the existing code (below) to find files that are the same BY NAME ONLY between the two directories. (i.e. find matches by name only and not name + extension)?
comparison = filecmp.dircmp(Input_1, Input_2) #Specifying which directories to compare
common_files = ', '.join(comparison.common) #Finding the common files between the directories
TextFile.write("Common Files: " + common_files + '\n') # Writing the common files to a new text file
Example:
Directory 1 contains: Tacoma.xlsx, Prius.txt, Landcruiser.txt
Directory 2 contains: Tacoma.doc, Avalon.xlsx, Rav4.doc
"TACOMA" are two different files (different extensions). Could I use basename or splitext somehow to compare files by name only and have it return "TACOMA" as a matching file?
To get the file name, try:
from os import path
fil='..\file.doc'
fil_name = path.splitext(fil)[0].split('\\')[-1]
This stores file in file_name. So to compare files, run:
from os import listdir , path
from os.path import isfile, join
def compare(dir1,dir2):
files1 = [f for f in listdir(dir1) if isfile(join(dir1, f))]
files2 = [f for f in listdir(dir2) if isfile(join(dir2, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0] == path.splitext(j)[0]): #this compares it name by name.
common_files.append(i)
return common_files
Now just call it:
common_files = compare(dir1,dir2)
As you know python is case-sensitive, if you want common files, no matter if they contain uppers or lowers, then instead of:
if(path.splitext(i)[0] == path.splitext(j)[0]):
use:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
You're code worked very well! Thank you again, Infinity TM! The final use of the code is as follows for anyone else to look at. (Note: that Input_3 and Input_4 are the directories)
def Compare():
Input_3 = #Your directory here
Input_4 = #Your directory here
files1 = [f for f in listdir(Input_3) if isfile(join(Input_3, f))]
files2 = [f for f in listdir(Input_4) if isfile(join(Input_4, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
common_files.append(path.splitext(i)[0])

Select specific files with their name in folders then manipulate them (Python)

I am trying to write a script to select specific files from their names in several folders and them copy those files in a new folder.
More precisely, I have a directory that contains 29 folders, in each folders there are hundreds of '*.fits' files.
I want to select among those fits files those which do not have the numbers '4' or '8' in the last 3 digits before .fits
For example: "ngc6397id016000520jd2456870p5705f002.fits" has '002' as last three digits before the extension .fits
I am kind of lost here as I am pretty new to this, can anyone help ?
Thank you!
import os
import shutil
data_path = ".<your directory with 29 folders>"
data_dir = os.listdir(data_path)
# for each folder in the directory
for fits_dir in data_dir:
fits_path = data_path + "/" + log_dir + "/"
# for each .fits file in the folder
for file in os.listdir(fits_path):
# if neither 4 or 8 are in the last 3 digits before the dot
if '4' not in file.split(".")[0][-3:] and '8' not in file.split(".")[0][-3:]:
shutil.copy(fits_path + "/" + file, destination)
SO is for code help, however, this should get you started.
The below code will print out the desired files by traversing all subdirectories and contained files. There are improvements you must make to this code for it to be reliable; such as error and case checking, however, it will serve to get you going.
import os
TARGET_DIR = r"C:\yourDir"
IGNORE_NUM = ['4', '8']
for path, dirs, files in os.walk(TARGET_DIR):
for index, file in enumerate(files):
fileSplit = os.path.splitext(file)
if(fileSplit[1] != ".fits"):
continue
lastThree = fileSplit[0][-3:]
if(set(IGNORE_NUM).intersection(set(lastThree))):
continue
print(f"[{index}] {file}")
From that, it is trivial to copy the file over to your desired directory using the shutil library.
shutil.copyfile(src, dst)
Combine all of that and you have your script.

Create folders dynamically and write csv files to that folders

I would like to read several input files from a folder, perform some transformations,create folders on the fly and write the csv to corresponding folders. The point here is I have the input path which is like
"Input files\P1_set1\Set1_Folder_1_File_1_Hour09.csv" - for a single patient (This file contains readings of patient (P1) at 9th hour)
Similarly, there are multiple files for each patient and each patient files are grouped under each folder as shown below
So, to read each file, I am using wildcard regex as shown below in code
I have already tried using the glob package and am able to read it successfully but am facing issue while creating the output folders and saving the files. I am parsing the file string as shown below
f = "Input files\P1_set1\Set1_Folder_1_File_1_Hour09.csv"
f[12:] = "P1_set1\Set1_Folder_1_File_1_Hour09.csv"
filenames = sorted(glob.glob('Input files\P*_set1\*.csv'))
for f in filenames:
print(f) #This will print the full path
print(f[12:]) # This print the folder structure along with filename
df_transform = pd.read_csv(f)
df_transform = df_transform.drop(['Format 10','Time','Hour'],axis=1)
df_transform.to_csv("Output\" + str(f[12:]),index=False)
I expect the output folder to have the csv files which are grouped by each patient under their respective folders. The screenshot below shows how the transformed files should be arranged in output folder (same structure as input folder). Please note that "Output" folder is already existing (it's easy to create one folder you know)
So to read files in a folder use os library then you can do
import os
folder_path = "path_to_your_folder"
dir = os.listdir(folder_path)
for x in dir:
df_transform = pd.read_csv(f)
df_transform = df_transform.drop(['Format 10','Time','Hour'],axis=1)
if os.path.isdir("/home/el"):
df_transform.to_csv("Output/" + str(f[12:]),index=False)
else:
os.makedirs(folder_path+"/")
df_transform.to_csv("Output/" + str(f[12:]),index=False)
Now instead of user f[12:] split the x in for loop like
file_name = x.split('/')[-1] #if you want filename.csv
Let me know if this is what you wanted

Resources