I am reading some files using glob.glob(). I want to read all the files with names 123*.txt except those with 123*error.txt. Also, is there a way to print the filenames in the for loop, which is inside pd.concat()?
fields = ['StudentID', 'Grade']
path= 'C:/script_testing/'
parse = lambda f: pd.read_csv(f, usecols=fields)
table3 = pd.concat(
[parse(f) for f in glob.glob('C:/script_testing/**/*.txt', recursive=True)]
).pipe(lambda d: pd.crosstab(d.StudentID, d.Grade))
Use this pattern
files = glob.glob('C:/script_testing/**/123*[!error].txt`, recursive=True)
Then proceed
fields = ['StudentID', 'Grade']
path= 'C:/script_testing/'
parse = lambda f: pd.read_csv(f, usecols=fields)
table3 = pd.concat(
[parse(f) for f in files]
).pipe(lambda d: pd.crosstab(d.StudentID, d.Grade))
Reference this post
Related
I am comparing all files in two directories, if comparison is greater than 90% so i continue the outer loop and i want to remove the file in the second directory that was matched so that the second file in the first directory doesn't compare with the file that's already matched.
Here's what i've tried:
for i for i in sorted_files:
for j in sorted_github_files:
#pdb.set_trace()
with open(f'./files/{i}') as f1:
try:
text1 = f1.read()
except:
pass
with open(f'./github_files/{j}') as f2:
try:
text2 = f2.read()
except:
pass
m = SequenceMatcher(None, text1, text2)
print("file1:", i, "file2:", j)
if m.ratio() > 0.90:
os.remove(f'./github_files/{j}')
break
I know i cannot change the iteration once it's in action that's why its returning me file not found error i dont want to use try except blocks. Any ideas appreciated
A couple of things to point out:
Always provide a minimal reproducible example
Your first for loop is not working since you used `for i for i ..``
If you want to iterate over the files in list1 (sorted_files) first, then read the file outside of the second loop
I would add the files that match with a ratio over 0.90 to a new list and remove the files afterward so your items do not change during the iteration
You can find the test-data i created and used here
import os
from difflib import SequenceMatcher
# define your two folders, full paths
first_path = os.path.abspath(r"C:\Users\XYZ\Desktop\testfolder\a")
second_path = os.path.abspath(r"C:\Users\XYZ\Desktop\testfolder\b")
# get files from folder
first_path_files = os.listdir(first_path)
second_path_files = os.listdir(second_path)
# join path and filenames
first_folder = [os.path.join(first_path, f) for f in first_path_files]
second_folder = [os.path.join(second_path, f) for f in second_path_files]
# empty list for matching results
matched_files = []
# iterate over the files in the first folder
for file_one in first_folder:
# read file content
with open(file_one, "r") as f:
file_one_text = f.read()
# iterate over the files in the second folder
for file_two in second_folder:
# read file content
with open(file_two, "r") as f:
file_two_text = f.read()
# match the two file contents
match = SequenceMatcher(None, file_one_text, file_two_text)
if match.ratio() > 0.90:
print(f"Match found ({match.ratio()}): '{file_one}' | '{file_two}'")
# TODO: here you have to decide if you rather want to remove files from the first or second folder
matched_files.append(file_two) # i delete files from the second folder
# remove duplicates from the resulted list
matched_files = list(set(matched_files))
# remove the files
for f in matched_files:
print(f"Removing file: {f}")
os.remove(f)
I have started with a code that is intended to write many textfiles by first reading one textfile. More details of question after the started code.
The textfile (Im reading from texfile called alphabet.txt):
a
b
c
:
d
e
f
:
g
h
i
:
I want the result to be like this:
file1:
a
b
c
file2:
d
e
f
file3:
g
h
i
enter code here
with open('alphabet.txt', 'r') as f:
a = []
for i in f:
i.split(':')
a.append(a)
The code is of course not done. Question: I don't know how to continue with the code. Is it possible to write the textfiles and to place them in a specific folder and too maybe rename them as 'file1, file2...' without hardcoding the naming (directly from the code)?
You could implement that function with something like this
if __name__ == '__main__':
with open('alphabet.txt', 'r') as f:
split_alph = f.read().split(':\n')
for i in range(len(split_alph)):
x = open(f"file_{i}", "w")
x.write(split_alph[i])
x.close()
Depending on whether there is a last : in the alphabet.txt file, you'd have to dismiss the last element in split_alph with
split_alph = f.read().split(':\n')[:-1]
If you got any further questions regarding the solution, please tell me.
file = open("C:/Users/ASUS/Desktop/tutl.py" , "r") # This is input File
text = file.read() # Reading The Content of The File.
file.close() # Closing The File
splitted = text.split(":") # Creating a List ,Containing strings of all Splitted part.
destinationFolder = "path_to_Folder" # Replace this With Your Folder Path
for x in range(len(splitted)): # For loop for each parts
newFile = open(destinationFolder + "/File"+str(x)+".txt" , "w") # Creating a File for a part.
newFile.write(splitted[x]) # Writing the content
newFile.close() # Closing The File
I am a little confused with a Pandas library and would really appreciate your help.
The task is to combine all *.csv files in a folder into one big file.
CSV files don't have a header, so I just want to append all of them and add a header in the end.
Here is the code I use.
The final file is "ALBERTA GENERAL", in the beginning I delete the old one before creating an updated version.
os.chdir(dataFolder)
with io.open("ALBERTA GENERAL.csv", "w+", encoding='utf8') as f:
os.remove("ALBERTA GENERAL.csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_csv = pd.concat([pd.read_csv(f, error_bad_lines=False) for f in all_filenames], axis=0, ignore_index = True)
print(combined_csv)
with io.open('ALBERTA GENERAL.csv', "w+", encoding='utf8') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames=["Brand, Name, Strain, Genre, Product type, Date"],delimiter=";")
writer.writeheader()
combined_csv.to_csv(outcsv, index=False, encoding='utf-8-sig')
But I get a confusing result that I don't know how to fix.
The final file doesn't append intermediate files one below another, but it adds columns for the next file. I tried to add the same headers to the intermediate files but it did not help.
Other than that the header is not split by columns and is recognized as one line.
Can anyone help me to fix my code, please?
Here is the link to the files
Just to fix the irregularities of the first file:
with open('ALBERTA GENERAL.csv','r') as f_in, open('ALBERTA GENERAL_fixed.csv','w') as f_out:
for line in f_in:
line = line.replace(',',';')
line = line.strip().rstrip(';')
line = line.strip().lstrip(';')
f_out.write(line + '\n')
os.remove('ALBERTA_GENERAL.csv')
We will import the first file separately because it has different requirements than the others:
df1 = pd.read_csv('ALBERTA GENERAL_fixed.csv',header=0,sep=';')
We can then do the other two:
df2 = pd.read_csv('file_ALBERTA_05.14.2020.csv',header=None,sep=';')
df3 = pd.read_csv('file_ALBERTA_05.18.2020.csv',header=None,sep=';')
df2.columns = df1.columns
df3.columns = df1.columns
Final steps:
combined = pd.concat([df1,df2,df3])
combined.to_csv('out.csv',index=False)
I have a file path with many files in it.
Some of the files dont have the data I want in them, how do I skip over these files and move on to the next set of files?
path ='/path/' # use your path
allFiles = glob.glob(path + "/*.json")
for file_ in allFiles:
#print(file_)
with open(file_) as f:
data = json.load(f)
df = json_normalize(data['col_to_be_flattened'])
REST OF THE OPERATIONS
once the data is in the dataframe at the point df, the REST OF THE OPERATIONS relies on a column called 'Rows.Row', if this column does not exist in df I want to skip it. How do I do this?
Just check if 'Rows.Row' is in the title of the columns before continue.
path ='/path/' # use your path
allFiles = glob.glob(path + "/*.json")
for file_ in allFiles:
#print(file_)
with open(file_) as f:
data = json.load(f)
df = json_normalize(data['col_to_be_flattened'])
if 'Rows.Row' in df.columns.tolist():
REST OF THE OPERATIONS
How do I apply a function to a list of file paths I have built, and write an output csv in the same path?
read file in a subfolder -> perform a function -> write file in the
subfolder -> go to next subfolder
#opened xml by filename
with open(r'XML_opsReport 100001.xml', encoding = "utf8") as fd:
Odict_parsedFromFilePath = xmltodict.parse(fd.read())
#func called in func below
def activity_to_df_one_day (list_activity_this_day):
ib_list = [pd.DataFrame(list_activity_this_day[i], columns=list_activity_this_day[i].keys()).drop("#uom") for i in range(len(list_activity_this_day))]
return pd.concat(ib_list)
#Processes parsed xml and writes csv
def activity_to_df_all_days (Odict_parsedFromFilePath, subdir): #writes csv from parsed xml after some processing
nodes_reports = Odict_parsedFromFilePath['opsReports']['opsReport']
list_activity = []
for i in range(len(nodes_reports)):
try:
df = activity_to_df_one_day(nodes_reports[i]['activity'])
list_activity.append(df)
except KeyError:
continue
opsReport = pd.concat(list_activity)
opsReport['dTimStart'] = pd.to_datetime(opsReport['dTimStart'], infer_datetime_format =True)
opsReport.sort_values('dTimStart', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
opsReport.to_csv("subdir\opsReport.csv") #write to the subdir
def scanfolder(): #fetches list of file-paths with desired starting name.
list_files = []
for path, dirs, files in os.walk(r'C:\..\xml_objects'): #directory containing several subfolders
for f in files:
if f.startswith('XML_opsReport'):
list_files.append(os.path.join(path, f))
return list_files
filepaths = scanfolder() #list of file-paths
Every function works well, the xml processing is good, so I am not sharing the xml structure. There are 100+ paths in filepaths , each a different subdirectory. I want to be able to apply above flow in future as well, where I can get filepaths and perform desired actions. It's important to write the csv file to it's sub directory.
To get the directory that a file is in, you can use:
import os
for root, dirs, files, in os.walk(some_dir):
for f in files:
print(root)
output_file = os.path.join(root, "output_file.csv")
print(output_file)
Is that what you're looking for?
Output:
somedir
somedir\output_file.csv
See also Python 3 - travel directory tree with limited recursion depth and Find current directory and file's directory.
Was able to solve with os.path.join.
exceptions_path_list =[]
for i in filepaths:
try:
with open(i, encoding = "utf8") as fd:
doc = xmltodict.parse(fd.read())
activity_to_df_all_days (doc, i)
except ValueError:
exceptions_path_list.append(os.path.dirname(i))
continue
def activity_to_df_all_days (Odict_parsedFromFilePath, filepath):
...
...
...
opsReport.to_csv(os.path.join(os.path.dirname(filepath), "opsReport.csv"))