Recursive data scraping in excel sheets within nested folder structure

Recursive data scraping in excel sheets within nested folder structure - python-3.x

Help me out please. I would like to traverse through a directory structure that looks like this:
Topdir > subdir 1 > excel 1/2/3
Topdir > subdir 2 > excel 4
etc
I am scraping the excel's column B for a string, and that is working nicely. However, my script only goes through the topdir, and doesn't go into the subdirs. Below is my code:
import openpyxl, os, sys, warnings, glob
warnings.simplefilter("ignore")
targetString = str("Sample Error")
scriptPath = os.path.abspath(__file__)
outputFile = open('logging.txt', "w+")
def scrapeSheets():
for i in os.listdir(path='.'):
if i.endswith("data-eval.xlsm"):
print("Working on:", i)
wb = openpyxl.load_workbook(i, data_only=True)
sheet = wb["data-sheet"]
outputFile.write("{}\n".format(i))
for cellObj in sheet["B"]:
if cellObj.value == targetString:
print(cellObj.row, cellObj.value)
outputFile.write("\t{}\t{}\n".format(cellObj.row, cellObj.value))
def mainLoop():
for filename in glob.iglob('**/*.xlsm', recursive=True):
scrapeSheets()
if __name__ == "__main__":
mainLoop()
As I said, the scraping works, but I cannot go into the subfolders. I have a hunch it has to do with the line
for i in os.listdir(path='.')
however, I don't know the solution to make the loop's variable increment.

You can try like this:
for dirname in os.listdir(path='.'):
for main_dir, dirs, files in os.walk(dirname):
for f in files:
if f.endswith("data-eval.xlsm"):
print("Working on:", f)
wb = openpyxl.load_workbook(f, data_only=True)
sheet = wb["data-sheet"]
outputFile.write("{}\n".format(i))
for cellObj in sheet["B"]:
if cellObj.value == targetString:
print(cellObj.row, cellObj.value)
outputFile.write("\t{}\t{}\n".format(cellObj.row, cellObj.value))
Explanation:
Using listdir iterate over the directories.
for dirname in os.listdir(path='.')
Iterate over the sub-directories and the files inside those using :
for main_dir, dirs, files in os.walk(dirname):
Iterate over the files and continue your logic.
for f in files:
if f.endswith("data-eval.xlsm"):
print("Working on:", f)
wb = openpyxl.load_workbook(f, data_only=True)
sheet = wb["data-sheet"]
outputFile.write("{}\n".format(i))
for cellObj in sheet["B"]:
if cellObj.value == targetString:
print(cellObj.row, cellObj.value)
outputFile.write("\t{}\t{}\n".format(cellObj.row, cellObj.value))

For future reference, I figured out that implementing the for filename in glo.iglob part in the scraping function instead the os.listdir line works perfectly and loops through the content of the script's folder and its subfolders.

Related

Compare by NAME only, and not by NAME + EXTENSION using existing code; Python 3.x

The python 3.x code (listed below) does a great job of comparing files from two different directories (Input_1 and Input_2) and finding the files that match (are the same between the two directories). Is there a way I can alter the existing code (below) to find files that are the same BY NAME ONLY between the two directories. (i.e. find matches by name only and not name + extension)?
comparison = filecmp.dircmp(Input_1, Input_2) #Specifying which directories to compare
common_files = ', '.join(comparison.common) #Finding the common files between the directories
TextFile.write("Common Files: " + common_files + '\n') # Writing the common files to a new text file
Example:
Directory 1 contains: Tacoma.xlsx, Prius.txt, Landcruiser.txt
Directory 2 contains: Tacoma.doc, Avalon.xlsx, Rav4.doc
"TACOMA" are two different files (different extensions). Could I use basename or splitext somehow to compare files by name only and have it return "TACOMA" as a matching file?

To get the file name, try:
from os import path
fil='..\file.doc'
fil_name = path.splitext(fil)[0].split('\\')[-1]
This stores file in file_name. So to compare files, run:
from os import listdir , path
from os.path import isfile, join
def compare(dir1,dir2):
files1 = [f for f in listdir(dir1) if isfile(join(dir1, f))]
files2 = [f for f in listdir(dir2) if isfile(join(dir2, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0] == path.splitext(j)[0]): #this compares it name by name.
common_files.append(i)
return common_files
Now just call it:
common_files = compare(dir1,dir2)
As you know python is case-sensitive, if you want common files, no matter if they contain uppers or lowers, then instead of:
if(path.splitext(i)[0] == path.splitext(j)[0]):
use:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
You're code worked very well! Thank you again, Infinity TM! The final use of the code is as follows for anyone else to look at. (Note: that Input_3 and Input_4 are the directories)
def Compare():
Input_3 = #Your directory here
Input_4 = #Your directory here
files1 = [f for f in listdir(Input_3) if isfile(join(Input_3, f))]
files2 = [f for f in listdir(Input_4) if isfile(join(Input_4, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
common_files.append(path.splitext(i)[0])

Python3: Index out of range for script that worked before

the attached script returns:
IndexError: list index out of range
for the line starting with values = {line.split (...)
values=dict()
with open(csv) as f:
lines =f.readlines()
values = {line.split(',')[0].strip():line.split(',')[1].strip() for line in lines}
However, I could use it yesterday for doing exactly the same:
replacing certain text in a dir of xml-files with different texts
import os
from distutils.dir_util import copy_tree
drc = 'D:/Spielwiese/00100_Arbeitsverzeichnis'
backup = 'D:/Spielwiese/Backup/'
csv = 'D:/persons1.csv'
copy_tree(drc, backup)
values=dict()
with open(csv) as f:
lines =f.readlines()
values = {line.split(',')[0].strip():line.split(',')[1].strip() for line in lines}
#Getting a list of the full paths of files
for dirpath, dirname, filename in os.walk(drc):
for fname in filename:
#Joining dirpath and filenames
path = os.path.join(dirpath, fname)
#Opening the files for reading only
filedata = open(path,encoding="Latin-1").read()
for k,v in values.items():
filedata=filedata.replace(k,v)
f = open(path, 'w',encoding="Latin-1")
# We are writing the the changes to the files
f.write(filedata)
f.close() #Closing the files
print("In case something went wrong, you can find a backup in " + backup)
I don't see anything weird and I could, as mentioned before use it before ... :-o
Any ideas on how to fix it?
best Wishes,
K

Having trouble using zipfile.ZipFile.extractall (Already read the docs)

I have a folder with many zipfiles, most of these zipfiles contain shapefiles and some of them have subfolders which contain zipfiles that contain shapefiles. I am trying to extract everything into one main folder wihtout keeping any folder structure. This is where I am now;
import os, zipfile
def getListOfFiles(dirName):
# create a list of file and sub directories
# names in the given directory
listOfFile = os.listdir(dirName)
allFiles = list()
# Iterate over all the entries
for entry in listOfFile:
# Create full path
fullPath = os.path.join(dirName, entry)
# If entry is a directory then get the list of files in this directory
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(fullPath)
return allFiles
def main():
dirName = r'C:\Users\myusername\My_Dataset'
# Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)
# Print the files
for elem in listOfFiles:
print(elem)
zipfile.ZipFile.extractall(elem)
print("****************")
if __name__ == '__main__':
main()
This script prints all the shapefiles (including the ones under subfolders). Now I need to extract all these listed shapefiles into one main folder. I try zipfile.ZipFile.extractall(elem) but it doesn't work.
line 1611, in extractall
members = self.namelist()
AttributeError: 'str' object has no attribute 'namelist'
Is the error I'm getting. zipfile.ZipFile.extractall(elem) is the line that doesn't work. I imagine it expects one zipfile but I'm trying to feed it a folder (or a list in this case?)
How would I change this script so that it extracts my listed shapefiles into a folder (preferably a new folder)

You need to make an instance of ZipFile first and use extractall on this instance:
for elem in listOfFiles:
my_zipfile = zipfile.ZipFile(elem)
my_zipfile.extractall()

I have added this code block to my script and it works now.
def getfiles(path):
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for name in files:
yield os.path.join(root, name)
else:
yield path
fromdir = r"C:\Users\username\My_Dataset\new"
for f in getfiles(fromdir):
filename = str.split(f, '/')[-1]
if os.path.isfile(destination + filename):
filename = f.replace(fromdir, "", 1).replace("/", "_")
# os.rename(f, destination+filename)
shutil.copy2(f, r"C:\Users\username\Documents\flatten")

Python: traverse directory tree and check if last subdirectory has a file

I want to check if the last subdirectories in a directory tree have certain files.
For example, if there are following subdirectories that I want to look through:
C:\Test Dir\My Dir\ABC
C:\Test Dir\My Dir\Your Dir\XYZ
C:\Test Dir\My Dir\Your Dir\PQR
I want to check if ABC, XYZ, and PQR subdirectories has atleast one file, with following pattern:
*Orange*.txt
If say ABC has a file ABC_Orange_true.txt, and XYZ, and PQR don't have a file matching the above pattern, I want to get them in a list, as follows:
list = ['C:\Test Dir\My Dir\Your Dir\XYZ', 'C:\Test Dir\My Dir\Your Dir\PQR']
So far I've written the following code, but stuck here:
import os
subdir_list = []
txt_list = []
list = []
for dirName, subdirList, fileList in os.walk('.'):
subdir_list.append(subdirList)
for fname in fileList:
file_list.append(fname)
if '.txt' in fname:
if 'Orange' in fname:
txt_list.append(fname)
subdir_list = [i for i in subdir_list if i][-1]
print subdir_list
print txt_list
This code gives me the file names, and list of subdirectories as follows:
['ABC', 'XYZ', 'PQR']
['ABC_Orange_true.txt']
I need help to reach my end result of
>>list
>>['C:\Test Dir\My Dir\Your Dir\XYZ', 'C:\Test Dir\My Dir\Your Dir\PQR']

glob module is your friend here. Take a look at this https://docs.python.org/3/library/glob.html#glob.glob
Using it with the os module can solve your problem. Something like
import glob, os
def findSubDirs(files):
subDirs = []
for f in files:
if os.path.isdir(f):
subDirs.append(f)
return subDirs
def findEmptyLeafDirs(path, filename):
files, dirs = glob.glob(path + "/*"), []
subDirs = findSubDirs(files)
if len(subDirs) == 0:
fileMatches = glob.glob(path + "/" + filename)
if len(fileMatches) == 0:
dirs.append(path)
else:
for subd in subDirs:
dirs.extend(findEmptyLeafDirs(subd, filename))
return dirs
print(findEmptyLeafDirs("path", "file"))
should do it.

Moving files in python based on file and folder name

Relatively new to python ( not using it everyday ). However I am trying to simplify some things. I basically have Keys which have long names however a subset of the key ( or file name ) has the same sequence of the associated folder.{excuse the indentation, it is properly indented.} I.E
file1 would be: 101010-CDFGH-8271.dat and folder is CDFGH-82
file2 would be: 101010-QWERT-7425.dat and folder is QWERT-74
import os
import glob
import shutil
files = os.listdir("files/location")
dest_1 = os.listdir("dest/location")
for f in files:
file = f[10:21]
for d in dest_1:
dire = d
if file == dire:
shutil.move(file, dest_1)
The code runs with no errors, however nothing moves. Look forward to your reply and chance to learn.
Sorry updated the format.

Try a variation of:
basedir = "dest/location"
for fname in os.listdir("files/location"):
dirname = os.path.join(basedir, fname[10:21])
if os.path.isdir(dirname):
path = os.path.join("files/location", fname)
shutil.move(path, dirname)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Recursive data scraping in excel sheets within nested folder structure - python-3.x

For future reference, I figured out that implementing the for filename in glo.iglob part in the scraping function instead the os.listdir line works perfectly and loops through the content of the script's folder and its subfolders.

Related

Compare by NAME only, and not by NAME + EXTENSION using existing code; Python 3.x

Python3: Index out of range for script that worked before

Having trouble using zipfile.ZipFile.extractall (Already read the docs)

Python: traverse directory tree and check if last subdirectory has a file

Moving files in python based on file and folder name

Categories

Resources