os.walk help - processing data in chunks - python3 - python-3.x

I have some files that are scattered in many different folders within a directory, and I was wondering if there were a way to iterate through these folders in chunks.
Here's a picture of my directory tree
I'd want to go through all the files in my 2010A folder, then 2010B folder, then move to 2011A and 2011B etc..
My goal is to amend my current script, which only works for a single folder, so that it flows like this:
Start: ROOT FOLDER >
2010 > 2010A >
output to csv> re-start loop >
2010B > append csv after the last row
re-start loop > 2011 > 2011A >
append csv after the last row > and so on...
Is this possible?
Here's my code, it currently works if I run it on a single folder containing my txt files, e.g., for the 2010A folder:
import re
import pandas as pd
import os
from collections import Counter
#get file list in current directory
filelist = os.listdir(r'root_folder\2010\2010A')
dict1 = {}
#open and read files, store into dictionary
for file in filelist:
with open(file) as f:
items = f.read()
dict1[file] = items
#create filter for specific words
filter = [ "cat", "dog", "elephant", "fowl"]
dict2 = {}
# count occurrence of words in each file
for k, v in dict1.items():
list= []
for i in filter:
list.extend(re.findall(r"{}".format(i),v))
dict2[k] = dict(Counter(new))
dict3 ={}
# count total words in each file, store in separate dictionary
dict3 = {k: {'total':len(v)} for k,v in dict1.items()}
join_dict = {}
#join both dictionaries
join_dict = {k:{**dict2[k], **dict3[k]} for k in out}
#convert to pandas dataframe
df = pd.DataFrame.from_dict(join_dict, orient='index').fillna(0).astype(int)
#output to csv
df.to_csv(r'path\output.csv',index = True, header=True)
I have a feeling I need to replace:
for file in filelist:
with for (root,dirs,files) in os.walk(r'root_folder', topdown=True):
But I'm not exactly sure how, since I'm quite new to coding and python in general.

You can use glob to get list of files like this
import glob
files = glob.glob('root_folder\\*.txt', recursive=True)

Related

File comparison in two directories

I am comparing all files in two directories, if comparison is greater than 90% so i continue the outer loop and i want to remove the file in the second directory that was matched so that the second file in the first directory doesn't compare with the file that's already matched.
Here's what i've tried:
for i for i in sorted_files:
for j in sorted_github_files:
#pdb.set_trace()
with open(f'./files/{i}') as f1:
try:
text1 = f1.read()
except:
pass
with open(f'./github_files/{j}') as f2:
try:
text2 = f2.read()
except:
pass
m = SequenceMatcher(None, text1, text2)
print("file1:", i, "file2:", j)
if m.ratio() > 0.90:
os.remove(f'./github_files/{j}')
break
I know i cannot change the iteration once it's in action that's why its returning me file not found error i dont want to use try except blocks. Any ideas appreciated
A couple of things to point out:
Always provide a minimal reproducible example
Your first for loop is not working since you used `for i for i ..``
If you want to iterate over the files in list1 (sorted_files) first, then read the file outside of the second loop
I would add the files that match with a ratio over 0.90 to a new list and remove the files afterward so your items do not change during the iteration
You can find the test-data i created and used here
import os
from difflib import SequenceMatcher
# define your two folders, full paths
first_path = os.path.abspath(r"C:\Users\XYZ\Desktop\testfolder\a")
second_path = os.path.abspath(r"C:\Users\XYZ\Desktop\testfolder\b")
# get files from folder
first_path_files = os.listdir(first_path)
second_path_files = os.listdir(second_path)
# join path and filenames
first_folder = [os.path.join(first_path, f) for f in first_path_files]
second_folder = [os.path.join(second_path, f) for f in second_path_files]
# empty list for matching results
matched_files = []
# iterate over the files in the first folder
for file_one in first_folder:
# read file content
with open(file_one, "r") as f:
file_one_text = f.read()
# iterate over the files in the second folder
for file_two in second_folder:
# read file content
with open(file_two, "r") as f:
file_two_text = f.read()
# match the two file contents
match = SequenceMatcher(None, file_one_text, file_two_text)
if match.ratio() > 0.90:
print(f"Match found ({match.ratio()}): '{file_one}' | '{file_two}'")
# TODO: here you have to decide if you rather want to remove files from the first or second folder
matched_files.append(file_two) # i delete files from the second folder
# remove duplicates from the resulted list
matched_files = list(set(matched_files))
# remove the files
for f in matched_files:
print(f"Removing file: {f}")
os.remove(f)

write specific file name in specific list in python

I have folder contain multiple .txt files ,the names of files(one.txt,two.txt,three.txt,...) I need to read the one.txt and then write the content of this file in list has name onefile[], then read two.txt and write the content in list twofile[] and so on. how can do this?
Update! Iam try this code, now how can print the values in each list ?
def writeinlist(file_path,i):
multilist = {}
output = open(file_path,'r')
globals()['List%s' % i] = output
print('List%s' % i)
input_path = Path(Path.home(), "Desktop", "NN")
index=1
for root, dirs, files in os.walk(input_path):
for file in files:
file_path = Path(root, file)
writeinlist(file_path,index)
index+=1
Update2: how can delete \n from values?
value_list1 = files_dict['file1']
print('Values of file1 are:')
print(value_list1)
I used the following to create a dictionary with dynamic keys (with the names of the files) and the respective values being a list with elements the lines of the file.
First, contents of onefile.txt:
First file first line
First file second line
First file third line
Contents of twofile.txt:
Second file first line
Second file second line
My code:
import os
import pprint
files_dict = {}
for file in os.listdir("/path/to/folder"):
if file.endswith(".txt"):
key = file.split(".")[0]
full_filename = os.path.join("/path/to/folder", file)
with open(full_filename, "r") as f:
files_dict[key] = f.readlines()
pprint.pprint(files_dict)
Output:
{'onefile': ['First file first line\n',
'First file second line\n',
'First file third line'],
'twofile': ['Second file first line\n', 'Second file second line']}
Another way to do this that's a bit more Pythonic:
import os
import pprint
files_dict = {}
for file in [
f
for f in os.listdir("/Users/itroulli/Downloads/data_eng_challenge3/files")
if f.endswith(".txt")
]:
with open(os.path.join("/path/to/folder", file), "r") as fo:
files_dict[file.split(".")[0]] = fo.readlines()
pprint.pprint(files_dict)

Converting list to dictionary, and tokenizing the key values - possible?

So basically I have a folder of files I'm opening and reading into python.
I want to search these files and count the keywords in each file, to make a dataframe like the attached image.
I have managed to open and read these files into a list, but my problem is as follows:
Edit 1:
I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.
import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
output =
AttributeError: 'list' object has no attribute 'lower'
Pre-edit post:
1. Each list value doesn't retain the filename key. So I don't have the rows I need.
2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.
Here's my code for opening the files, converting them to lowercase and storing them in a list.
How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?
import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read()
# convert to lower-case value
Lcase_content.append(content.lower())
## this two lines below don't work - index out of range error
tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
num = num + 1
You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:
Counter(nltk.tokenize.word_tokenize())
Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:
{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}),
'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}
If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.
import os
import nltk
from collections import Counter
import pandas as pd
result = dict()
filenames = os.listdir('.')
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read().lower()
result[filename] = Counter(nltk.tokenize.word_tokenize(content))
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)
df['total words'] = df.sum(axis=1)
df.to_csv('words_count.csv', index=True)
Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.
Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Applying function to a list of file-paths and writing csv output to the respective paths

How do I apply a function to a list of file paths I have built, and write an output csv in the same path?
read file in a subfolder -> perform a function -> write file in the
subfolder -> go to next subfolder
#opened xml by filename
with open(r'XML_opsReport 100001.xml', encoding = "utf8") as fd:
Odict_parsedFromFilePath = xmltodict.parse(fd.read())
#func called in func below
def activity_to_df_one_day (list_activity_this_day):
ib_list = [pd.DataFrame(list_activity_this_day[i], columns=list_activity_this_day[i].keys()).drop("#uom") for i in range(len(list_activity_this_day))]
return pd.concat(ib_list)
#Processes parsed xml and writes csv
def activity_to_df_all_days (Odict_parsedFromFilePath, subdir): #writes csv from parsed xml after some processing
nodes_reports = Odict_parsedFromFilePath['opsReports']['opsReport']
list_activity = []
for i in range(len(nodes_reports)):
try:
df = activity_to_df_one_day(nodes_reports[i]['activity'])
list_activity.append(df)
except KeyError:
continue
opsReport = pd.concat(list_activity)
opsReport['dTimStart'] = pd.to_datetime(opsReport['dTimStart'], infer_datetime_format =True)
opsReport.sort_values('dTimStart', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
opsReport.to_csv("subdir\opsReport.csv") #write to the subdir
def scanfolder(): #fetches list of file-paths with desired starting name.
list_files = []
for path, dirs, files in os.walk(r'C:\..\xml_objects'): #directory containing several subfolders
for f in files:
if f.startswith('XML_opsReport'):
list_files.append(os.path.join(path, f))
return list_files
filepaths = scanfolder() #list of file-paths
Every function works well, the xml processing is good, so I am not sharing the xml structure. There are 100+ paths in filepaths , each a different subdirectory. I want to be able to apply above flow in future as well, where I can get filepaths and perform desired actions. It's important to write the csv file to it's sub directory.
To get the directory that a file is in, you can use:
import os
for root, dirs, files, in os.walk(some_dir):
for f in files:
print(root)
output_file = os.path.join(root, "output_file.csv")
print(output_file)
Is that what you're looking for?
Output:
somedir
somedir\output_file.csv
See also Python 3 - travel directory tree with limited recursion depth and Find current directory and file's directory.
Was able to solve with os.path.join.
exceptions_path_list =[]
for i in filepaths:
try:
with open(i, encoding = "utf8") as fd:
doc = xmltodict.parse(fd.read())
activity_to_df_all_days (doc, i)
except ValueError:
exceptions_path_list.append(os.path.dirname(i))
continue
def activity_to_df_all_days (Odict_parsedFromFilePath, filepath):
...
...
...
opsReport.to_csv(os.path.join(os.path.dirname(filepath), "opsReport.csv"))

Resources