How to skip entire file when iterating through folder? - python-3.x

I have a file path with many files in it.
Some of the files dont have the data I want in them, how do I skip over these files and move on to the next set of files?
path ='/path/' # use your path
allFiles = glob.glob(path + "/*.json")
for file_ in allFiles:
#print(file_)
with open(file_) as f:
data = json.load(f)
df = json_normalize(data['col_to_be_flattened'])
REST OF THE OPERATIONS
once the data is in the dataframe at the point df, the REST OF THE OPERATIONS relies on a column called 'Rows.Row', if this column does not exist in df I want to skip it. How do I do this?

Just check if 'Rows.Row' is in the title of the columns before continue.
path ='/path/' # use your path
allFiles = glob.glob(path + "/*.json")
for file_ in allFiles:
#print(file_)
with open(file_) as f:
data = json.load(f)
df = json_normalize(data['col_to_be_flattened'])
if 'Rows.Row' in df.columns.tolist():
REST OF THE OPERATIONS

Related

Python. Append CSV files in a folder into one big file

I am a little confused with a Pandas library and would really appreciate your help.
The task is to combine all *.csv files in a folder into one big file.
CSV files don't have a header, so I just want to append all of them and add a header in the end.
Here is the code I use.
The final file is "ALBERTA GENERAL", in the beginning I delete the old one before creating an updated version.
os.chdir(dataFolder)
with io.open("ALBERTA GENERAL.csv", "w+", encoding='utf8') as f:
os.remove("ALBERTA GENERAL.csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f, error_bad_lines=False) for f in all_filenames], axis=0, ignore_index = True)
print(combined_csv)
with io.open('ALBERTA GENERAL.csv', "w+", encoding='utf8') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames=["Brand, Name, Strain, Genre, Product type, Date"],delimiter=";")
writer.writeheader()
combined_csv.to_csv(outcsv, index=False, encoding='utf-8-sig')
But I get a confusing result that I don't know how to fix.
The final file doesn't append intermediate files one below another, but it adds columns for the next file. I tried to add the same headers to the intermediate files but it did not help.
Other than that the header is not split by columns and is recognized as one line.
Can anyone help me to fix my code, please?
Here is the link to the files
Just to fix the irregularities of the first file:
with open('ALBERTA GENERAL.csv','r') as f_in, open('ALBERTA GENERAL_fixed.csv','w') as f_out:
for line in f_in:
line = line.replace(',',';')
line = line.strip().rstrip(';')
line = line.strip().lstrip(';')
f_out.write(line + '\n')
os.remove('ALBERTA_GENERAL.csv')
We will import the first file separately because it has different requirements than the others:
df1 = pd.read_csv('ALBERTA GENERAL_fixed.csv',header=0,sep=';')
We can then do the other two:
df2 = pd.read_csv('file_ALBERTA_05.14.2020.csv',header=None,sep=';')
df3 = pd.read_csv('file_ALBERTA_05.18.2020.csv',header=None,sep=';')
df2.columns = df1.columns
df3.columns = df1.columns
Final steps:
combined = pd.concat([df1,df2,df3])
combined.to_csv('out.csv',index=False)

os.walk help - processing data in chunks - python3

I have some files that are scattered in many different folders within a directory, and I was wondering if there were a way to iterate through these folders in chunks.
Here's a picture of my directory tree
I'd want to go through all the files in my 2010A folder, then 2010B folder, then move to 2011A and 2011B etc..
My goal is to amend my current script, which only works for a single folder, so that it flows like this:
Start: ROOT FOLDER >
2010 > 2010A >
output to csv> re-start loop >
2010B > append csv after the last row
re-start loop > 2011 > 2011A >
append csv after the last row > and so on...
Is this possible?
Here's my code, it currently works if I run it on a single folder containing my txt files, e.g., for the 2010A folder:
import re
import pandas as pd
import os
from collections import Counter
#get file list in current directory
filelist = os.listdir(r'root_folder\2010\2010A')
dict1 = {}
#open and read files, store into dictionary
for file in filelist:
with open(file) as f:
items = f.read()
dict1[file] = items
#create filter for specific words
filter = [ "cat", "dog", "elephant", "fowl"]
dict2 = {}
# count occurrence of words in each file
for k, v in dict1.items():
list= []
for i in filter:
list.extend(re.findall(r"{}".format(i),v))
dict2[k] = dict(Counter(new))
dict3 ={}
# count total words in each file, store in separate dictionary
dict3 = {k: {'total':len(v)} for k,v in dict1.items()}
join_dict = {}
#join both dictionaries
join_dict = {k:{**dict2[k], **dict3[k]} for k in out}
#convert to pandas dataframe
df = pd.DataFrame.from_dict(join_dict, orient='index').fillna(0).astype(int)
#output to csv
df.to_csv(r'path\output.csv',index = True, header=True)
I have a feeling I need to replace:
for file in filelist:
with for (root,dirs,files) in os.walk(r'root_folder', topdown=True):
But I'm not exactly sure how, since I'm quite new to coding and python in general.
You can use glob to get list of files like this
import glob
files = glob.glob('root_folder\\*.txt', recursive=True)

Dynamically read and load files into Python

In Python, is there a way to import csv or text files dynamically.We process multiple files a week that have different names and I don't want to update the with open statement manually each time the script runs. I have a function to read the file name which I pass to a variable for later use in my code.
I can see and read the files in the directory but I am not sure if I can add the contents of the folder into a variable that can then be used in the with open statement.
import os
os.chdir('T:\Credit Suite')
DIR = os.listdir()
print(DIR)
import csv,sys
with open('July 19.csv',mode='r') as csv_file:
ROWCOUNT = 0
FILENAME = (csv_file.name)
output = csv.writer(open('test2.txt', 'w', newline=''))
reader =csv.DictReader(csv_file)
for records in reader:
ROWCOUNT += 1
EIN = records['EIN']
DATE = records['Date Established']
DUNS = records['DUNS #']
COMPANYNAME = records['Company Name']
lineout =('<S>'+ EIN+'$EIN '+EIN+'*'+DATE+')'+ COMPANYNAME +'#D-U-N-S '+DUNS).upper()
output.writerow([lineout])
print("writing completed")
I will be running my script when a file hits a folder using a monitor and scheduler in an automated process. I want the code to run no matter what the inbound file name is labeled as in the folder and I wont have to update the code manually for the file name or change the file name to a standard name each time.
os.chdir('T:\Credit Suite')
for root, dirs, files in os.walk("."):
for filename in files:
if filename.endswith('.csv'):
f=filename
import csv,sys
with open(f,mode='r') as csv_file:
os.listdir() returns a list of all the files in the dir, you can just loop all the files:
import os
os.chdir('T:\Credit Suite')
DIR = os.listdir()
print(DIR)
import csv,sys
for file in DIR:
if file.endswith('.csv'):
with open(file,mode='r') as csv_file:
ROWCOUNT = 0
FILENAME = (csv_file.name)
output = csv.writer(open(FILENAME + '_output.txt', 'w', newline=''))
reader =csv.DictReader(csv_file)
all_lines = []
for records in reader:
ROWCOUNT += 1
EIN = records['EIN']
DATE = records['Date Established']
DUNS = records['DUNS #']
COMPANYNAME = records['Company Name']
lineout =('<S>'+ EIN+'$EIN '+EIN+'*'+DATE+')'+ COMPANYNAME +'#D-U-N-S '+DUNS).upper()
all_lines.append(lineout)
output.writerow(all_lines)
print("writing completed")
# remove file to avoid reprocessing the file again in the next run
# of the script, or just move it elsewhere with os.rename
os.remove(file)

Applying function to a list of file-paths and writing csv output to the respective paths

How do I apply a function to a list of file paths I have built, and write an output csv in the same path?
read file in a subfolder -> perform a function -> write file in the
subfolder -> go to next subfolder
#opened xml by filename
with open(r'XML_opsReport 100001.xml', encoding = "utf8") as fd:
Odict_parsedFromFilePath = xmltodict.parse(fd.read())
#func called in func below
def activity_to_df_one_day (list_activity_this_day):
ib_list = [pd.DataFrame(list_activity_this_day[i], columns=list_activity_this_day[i].keys()).drop("#uom") for i in range(len(list_activity_this_day))]
return pd.concat(ib_list)
#Processes parsed xml and writes csv
def activity_to_df_all_days (Odict_parsedFromFilePath, subdir): #writes csv from parsed xml after some processing
nodes_reports = Odict_parsedFromFilePath['opsReports']['opsReport']
list_activity = []
for i in range(len(nodes_reports)):
try:
df = activity_to_df_one_day(nodes_reports[i]['activity'])
list_activity.append(df)
except KeyError:
continue
opsReport = pd.concat(list_activity)
opsReport['dTimStart'] = pd.to_datetime(opsReport['dTimStart'], infer_datetime_format =True)
opsReport.sort_values('dTimStart', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
opsReport.to_csv("subdir\opsReport.csv") #write to the subdir
def scanfolder(): #fetches list of file-paths with desired starting name.
list_files = []
for path, dirs, files in os.walk(r'C:\..\xml_objects'): #directory containing several subfolders
for f in files:
if f.startswith('XML_opsReport'):
list_files.append(os.path.join(path, f))
return list_files
filepaths = scanfolder() #list of file-paths
Every function works well, the xml processing is good, so I am not sharing the xml structure. There are 100+ paths in filepaths , each a different subdirectory. I want to be able to apply above flow in future as well, where I can get filepaths and perform desired actions. It's important to write the csv file to it's sub directory.
To get the directory that a file is in, you can use:
import os
for root, dirs, files, in os.walk(some_dir):
for f in files:
print(root)
output_file = os.path.join(root, "output_file.csv")
print(output_file)
Is that what you're looking for?
Output:
somedir
somedir\output_file.csv
See also Python 3 - travel directory tree with limited recursion depth and Find current directory and file's directory.
Was able to solve with os.path.join.
exceptions_path_list =[]
for i in filepaths:
try:
with open(i, encoding = "utf8") as fd:
doc = xmltodict.parse(fd.read())
activity_to_df_all_days (doc, i)
except ValueError:
exceptions_path_list.append(os.path.dirname(i))
continue
def activity_to_df_all_days (Odict_parsedFromFilePath, filepath):
...
...
...
opsReport.to_csv(os.path.join(os.path.dirname(filepath), "opsReport.csv"))

How to exclude some files when reading with glob.glob()?

I am reading some files using glob.glob(). I want to read all the files with names 123*.txt except those with 123*error.txt. Also, is there a way to print the filenames in the for loop, which is inside pd.concat()?
fields = ['StudentID', 'Grade']
path= 'C:/script_testing/'
parse = lambda f: pd.read_csv(f, usecols=fields)
table3 = pd.concat(
[parse(f) for f in glob.glob('C:/script_testing/**/*.txt', recursive=True)]
).pipe(lambda d: pd.crosstab(d.StudentID, d.Grade))
Use this pattern
files = glob.glob('C:/script_testing/**/123*[!error].txt`, recursive=True)
Then proceed
fields = ['StudentID', 'Grade']
path= 'C:/script_testing/'
parse = lambda f: pd.read_csv(f, usecols=fields)
table3 = pd.concat(
[parse(f) for f in files]
).pipe(lambda d: pd.crosstab(d.StudentID, d.Grade))
Reference this post

Resources