Compare by NAME only, and not by NAME + EXTENSION using existing code; Python 3.x - python-3.x

The python 3.x code (listed below) does a great job of comparing files from two different directories (Input_1 and Input_2) and finding the files that match (are the same between the two directories). Is there a way I can alter the existing code (below) to find files that are the same BY NAME ONLY between the two directories. (i.e. find matches by name only and not name + extension)?
comparison = filecmp.dircmp(Input_1, Input_2) #Specifying which directories to compare
common_files = ', '.join(comparison.common) #Finding the common files between the directories
TextFile.write("Common Files: " + common_files + '\n') # Writing the common files to a new text file
Example:
Directory 1 contains: Tacoma.xlsx, Prius.txt, Landcruiser.txt
Directory 2 contains: Tacoma.doc, Avalon.xlsx, Rav4.doc
"TACOMA" are two different files (different extensions). Could I use basename or splitext somehow to compare files by name only and have it return "TACOMA" as a matching file?

To get the file name, try:
from os import path
fil='..\file.doc'
fil_name = path.splitext(fil)[0].split('\\')[-1]
This stores file in file_name. So to compare files, run:
from os import listdir , path
from os.path import isfile, join
def compare(dir1,dir2):
files1 = [f for f in listdir(dir1) if isfile(join(dir1, f))]
files2 = [f for f in listdir(dir2) if isfile(join(dir2, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0] == path.splitext(j)[0]): #this compares it name by name.
common_files.append(i)
return common_files
Now just call it:
common_files = compare(dir1,dir2)
As you know python is case-sensitive, if you want common files, no matter if they contain uppers or lowers, then instead of:
if(path.splitext(i)[0] == path.splitext(j)[0]):
use:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
You're code worked very well! Thank you again, Infinity TM! The final use of the code is as follows for anyone else to look at. (Note: that Input_3 and Input_4 are the directories)
def Compare():
Input_3 = #Your directory here
Input_4 = #Your directory here
files1 = [f for f in listdir(Input_3) if isfile(join(Input_3, f))]
files2 = [f for f in listdir(Input_4) if isfile(join(Input_4, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
common_files.append(path.splitext(i)[0])

Related

How to copy merge files of two different directories with different extensions into one directory and remove the duplicated ones

I would need a Python function which performs below action:
I have two directories which in one of them I have files with .xml format and in the other one I have files with .pdf format. To simplify things consider this example:
Directory 1: a.xml, b.xml, c.xml
Directory 2: a.pdf, c.pdf, d.pdf
Output:
Directory 3: a.xml, b.xml, c.xml, d.pdf
As you can see the priority is with the xml files in the case that both extensions have similar names.
I would be thankful for your help.
You need to use the shutil module and the os module to achieve this. This function will work on the following assumption:
A given directory has all files with the same extension
The priority_directory will be the directory with file extensions to be prioritized
The secondary_directory will be the directory with file extensions to be dropped in case of a name collision
Try:
import os,shutil
def copy_files(priority_directory,secondary_directory,destination = "new_directory"):
file_names = [os.path.splitext(filename)[0] for filename in os.listdir(priority_directory)] # get the file names to check for collisions
os.mkdir(destination) # make a new directory
for file in os.listdir(priority_directory): # this loop copies the first direcotory as it is
file_path = os.path.join(priority_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
for file in os.listdir(secondary_directory): # this loop checks for collisions and drops files whose name collide
if(os.path.splitext(file)[0] not in file_names):
file_path = os.path.join(secondary_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
print(os.listdir(destination))
Let's run it with your direcotry names as arguments:
copy_files('directory_1','directory_2','directory_3')
You can now check a new directory with the name directory_3 will be created with the desired files in it.
This will work for all such similar cases no matter what the extension is.
Note: There should not be a need to do this i guess cause a directory can have two files with the same name as long as the extensions differ.
Rough working solution:
import os
from shutil import copy2
d1 = './d1/'
d2 = './d2/'
d3 = './d3/'
ext_1 = '.xml'
ext_2 = '.pdf'
def get_files(d: str, files: list):
directory = os.fsencode(d)
for file in os.listdir(d):
dup = False
filename = os.fsdecode(file)
if filename[-4:] == ext_2:
for (x, y) in files:
if y == filename[:-4] + ext_1:
dup = True
break
if dup:
continue
files.append((d, filename))
files = []
get_files(d1, files)
get_files(d2, files)
for d, file in files:
copy2(d+file, d3)
I'll see if I can get it to look/perform better.

Having trouble using zipfile.ZipFile.extractall (Already read the docs)

I have a folder with many zipfiles, most of these zipfiles contain shapefiles and some of them have subfolders which contain zipfiles that contain shapefiles. I am trying to extract everything into one main folder wihtout keeping any folder structure. This is where I am now;
import os, zipfile
def getListOfFiles(dirName):
# create a list of file and sub directories
# names in the given directory
listOfFile = os.listdir(dirName)
allFiles = list()
# Iterate over all the entries
for entry in listOfFile:
# Create full path
fullPath = os.path.join(dirName, entry)
# If entry is a directory then get the list of files in this directory
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(fullPath)
return allFiles
def main():
dirName = r'C:\Users\myusername\My_Dataset'
# Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)
# Print the files
for elem in listOfFiles:
print(elem)
zipfile.ZipFile.extractall(elem)
print("****************")
if __name__ == '__main__':
main()
This script prints all the shapefiles (including the ones under subfolders). Now I need to extract all these listed shapefiles into one main folder. I try zipfile.ZipFile.extractall(elem) but it doesn't work.
line 1611, in extractall
members = self.namelist()
AttributeError: 'str' object has no attribute 'namelist'
Is the error I'm getting. zipfile.ZipFile.extractall(elem) is the line that doesn't work. I imagine it expects one zipfile but I'm trying to feed it a folder (or a list in this case?)
How would I change this script so that it extracts my listed shapefiles into a folder (preferably a new folder)
You need to make an instance of ZipFile first and use extractall on this instance:
for elem in listOfFiles:
my_zipfile = zipfile.ZipFile(elem)
my_zipfile.extractall()
I have added this code block to my script and it works now.
def getfiles(path):
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for name in files:
yield os.path.join(root, name)
else:
yield path
fromdir = r"C:\Users\username\My_Dataset\new"
for f in getfiles(fromdir):
filename = str.split(f, '/')[-1]
if os.path.isfile(destination + filename):
filename = f.replace(fromdir, "", 1).replace("/", "_")
# os.rename(f, destination+filename)
shutil.copy2(f, r"C:\Users\username\Documents\flatten")

How to find files in multilevel subdirectories

Suppose I have a directory that contains multiple subdirectories:
one_meter = r"C:\Projects\NED_1m"
Within the directory one_meter I want to find all of the files that end with '.xml' and contain the string "_meta". My problem is that some of the subdirectories have that file one level donw, while others have it 2 levels down
EX:
one_meter > USGS_NED_one_meter_x19y329_LA_Jean_Lafitte_2013_IMG_2015 > USGS_NED_one_meter_x19y329_LA_Jean_Lafitte_2013_IMG_2015_meta.xml
one_meter > NY_Long_Island> USGS_NED_one_meter_x23y454_NY_LongIsland_Z18_2014_IMG_2015 > USGS_NED_one_meter_x23y454_NY_LongIsland_Z18_2014_IMG_2015_meta.xml
I want to look in my main directory (one_meter') and find all of the_meta.xmlfiles (regardless of the subdirectory) and append them to a list (one_m_lister = []`).
I tried the following but it doesn't produce any results. What am I doing incorrectly?
one_m_list = []
for filename in os.listdir(one_meter):
if filename.endswith(".xml") and "_meta" in filename:
print(filename)
one_m_list.append(filename)
The answer of #JonathanDavidArndt is good but quite outdated. Since Python 3.5, you can use pathlib.Path.glob to search a pattern in any subdirectory.
For instance:
import pathlib
destination_root = r"C:\Projects\NED_1m"
pattern = "**/*_meta*.xml"
master_list = list(pathlib.Path(destination_root).glob(pattern))
The function you are looking for is os.walk
A simple and minimal working example is below. You should be able to modify this to suit your needs:
destination_root = "C:\Projects\NED_1m"
extension_to_find = ".xml"
master_list = []
extension_to_find_len = len(extension_to_find)
for path,dir,files in os.walk(destination_root):
for filename in files:
# and of course, you can add extra filter criteria
# such as "contains _meta" right in here
if filename[-extension_to_find_len:] == extension_to_find:
print(os.path.join(path, filename))
master_list.append(os.path.join(path, filename))

Count multiple files in a directory with the same name

I'm relatively new to Python and was working on a project where the user can navigate to a folder, after which the program does a count of all the files in that folder with a specific name.
The problem is that I have a folder with over 5000 files many of them sharing the same name but different extensions. I wrote code that somewhat does what I want the final version to do but its VERY redundant and I can't see myself doing this for over 600 file names.
Wanted to ask if it is possible to make this program "automated" or less redundant where I don't have to manually type out the names of 600 files to return data for.
Sample code I currently have:
import os, sys
print(sys.version)
file_counting1 = 0
file_counting2 = 0
filepath = input("Enter file path here: ")
if os.path.exists(filepath):
for file in os.listdir(filepath):
if file.startswith('expressmail'):
file_counting1 += 1
print('expressmail')
print('Total files found:', file_counting1)
for file in os.listdir(filepath):
if file.startswith('prioritymail'):
file_counting2 += 1
print('prioritymail')
print('Total files found:', file_counting2)
Sample Output:
expressmail
Total files found: 3
prioritymail
Total files found: 1
The following script will count occurrences of files with the same name. If the file does not have an extension, the whole filename is treated as the name. It also does not traverse subdirectories, since the original question just asks about files in the given folder.
import os
dir_name = "."
files = next(os.walk(dir_name))[2] # get all the files directly in the directory
names = [f[:f.rindex(".")] for f in files if "." in f] # drop the extensions
names += [f for f in files if "." not in f] # add those without extensions
for name in set(names): # for each unique name-
print("{}\nTotal files found: {}".format(name, names.count(name)))
If you want to support files in subdirectories, you could use something like
files = [os.path.join(r,file) for r,d,f in os.walk(dir_name) for file in f]
If you don't want to consider files without extensions, just remove the line:
names += [f for f in files if "." not in f]
There are a number of ways you can do what you're trying to do. Partly it depends on whether or not you need to recover the list of extension for a given duplicated file.
Counter, from the collections module - use this for a simple count of file. Ignore the extensions when building the count.
Use the filename without extension as a dictionary key, add a list of items as the key-value, where the list of items is each occurrence of the file.
Here's an example using the Counter class:
import os, sys, collections
c = collections.Counter()
for root, dirs,files in os.walk('/home/myname/hg/2018/'):
# discard any path data and just use filename
for names in files:
name, ext = os.path.splitext(names)
# discard any extension
c[name] += 1
# Counter.most_common() gives the values in the form of (entry, count)
# Counter.most_common(x) - pass a value to display only the top x counts
# e.g. Counter.most_common(2) = top 2
for x in c.most_common():
print(x[0] + ': ' + str(x[1]))
you can use regular expressions:
import os, sys, re
print(sys.version)
filepath = input("Enter file path here: ")
if os.path.exists(filepath):
allfiles = "\n".join(os.listdir(filepath))
file_counting1 = len(re.findall("^expressmail",allfiles,re.M))
print('expressmail')
print('Total files found:', file_counting1)
file_counting2 = len(re.findall("^prioritymail",allfiles,re.M))
print('prioritymail')
print('Total files found:', file_counting2)

Moving files in python based on file and folder name

Relatively new to python ( not using it everyday ). However I am trying to simplify some things. I basically have Keys which have long names however a subset of the key ( or file name ) has the same sequence of the associated folder.{excuse the indentation, it is properly indented.} I.E
file1 would be: 101010-CDFGH-8271.dat and folder is CDFGH-82
file2 would be: 101010-QWERT-7425.dat and folder is QWERT-74
import os
import glob
import shutil
files = os.listdir("files/location")
dest_1 = os.listdir("dest/location")
for f in files:
file = f[10:21]
for d in dest_1:
dire = d
if file == dire:
shutil.move(file, dest_1)
The code runs with no errors, however nothing moves. Look forward to your reply and chance to learn.
Sorry updated the format.
Try a variation of:
basedir = "dest/location"
for fname in os.listdir("files/location"):
dirname = os.path.join(basedir, fname[10:21])
if os.path.isdir(dirname):
path = os.path.join("files/location", fname)
shutil.move(path, dirname)

Resources