How to combine multiple csv files based on file name - python-3.x

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.

I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Related

How to change data type of a same column in 7000 csv file and replace the files with the updated ones

I have 7000 csv files with 30000 records in each file and all the files have same column headers and same column count i.e 14.
I want to change the data type of 1,3,4 column respectively which I can do by using pandas but my question how can I do it to all the files with out loading them one by one or you can say how can I achieve this using loop as I want to replace the same file with the updated columns?
I tried this code and honestly I have copied it from some other place so I don't know where to give the path of my csv files folder and how will it replace the files.
import pandas as pd
import os
import glob
def main():
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
for f in csv_files:
df = pd.read_csv(f)
df[['LOAD DATE','DATE OF ISSUE','DATE OF DEPARTURE']] = df[['LOAD DATE','DATE OF ISSUE','DATE OF DEPARTURE']].apply(pd.to_datetime, errors='coerce')
df.to_csv(f, index=False)
main()

What is the appropriate way to take in files that have a filename with a timestamp in it?

What is the appropriate way to take in files that have a filename with a timestamp in it and read properly?
One way I'm thinking of so far is to take these filenames into one single text file to read all at once.
For example, filenames such as
1573449076_1570501819_file1.txt
1573449076_1570501819_file2.txt
1573449076_1570501819_file3.txt
Go into a file named filenames.txt
Then something like
with open('/Documents/filenames.txt', 'r') as f:
for item in f:
if item.is_file():
file_stat = os.stat(item)
item = item.replace('\n', '')
print("Fetching {}".format(convert_times(file_stat)))
My question is how would I go about this where I can properly read the names in the text file given that they have timestamps in the actual names? Once figuring that out I can convert them.
If you just want to get the timestamps from the file names, assuming that they all use the same naming convention, you can do so like this:
import glob
import os
from datetime import datetime
# Grab all .txt files in the specified directory
files = glob.glob("<path_to_dir>/*.txt")
for file in files:
file = os.path.basename(file)
# Check that it contains an underscore
if not '_' in file:
continue
# Split the file name using the underscore as the delimiter
stamps = file.split('_')
# Convert the epoch to a legible string
start = datetime.fromtimestamp(int(stamps[0])).strftime("%c")
end = datetime.fromtimestamp(int(stamps[1])).strftime("%c")
# Consume the data
print(f"{start} - {end}")
...
You'll want to add some error checking and handling; for instance, if the first or second index in the stamps array isn't a parsable int, this will fail.

How to merge two files from different folders base on few character match in files name

I have two folders with text files, i want to read files from first folder and check in second folder if some specfic character matched in files name then merge on column 'Time' , do this for several files
folder 1:
07k0ms_610s_hh85m_sq150_t40k0_sn183_0
08k0ms_610s_hh85m_sq150_t40k0_sn183_20
011k0ms_610s_hh85m_sq150_t40k0_sn183_-10
folder 2:
07k0m_t40k0_try-0.2
08k0m_t40k0_try-0.2
32k0m_t40k0_try-0.2
read file from folder 1 and check if 07k0m_t40k0 or 08k0m_t40k0 or 11k0m_t40k0 match in file name then folder two file merge in folder 1 file and save in csv one by one
Try the following:
import glob
import pandas as pd
lst_folders = ['folder_1',
'folder_2']
lst_str_find = ['07k0m_t40k0', '08k0m_t40k0', '11k0m_t40k0']
lst_files_1 = sorted(glob.glob(lst_folders[0]+'/*.txt'))
lst_files_2 = sorted(glob.glob(lst_folders[1]+'/*.txt'))
for file_1 in lst_files_1:
str_search = file_1[file_1.find("/")+1:file.find("s_")]
if any([(str_search in i) for i in lst_str_find]):
for file_2 in lst_files_2:
if file_name in file_2:
print(file_1)
print(file_2)
# here load,merge and save file_1 & file_2 - the specific code
# depends on the structure of your files and the way you want
# to import them. Should look similar to:
#
# merge_1 = pd.read_csv(file_1)
# merge_2 = pd.read_csv(file_2)
# merged_file = pd.concat([merge_1, merge_2])
# merged_file.to_csv (lst_folders[0]+'/merged_'+str_search+'.csv', index=None)
Notes:
read/merge/write might need to be adjusted, depending on the actual
structure of your files, which did not become clear from your post
the code assumes that it lives in the same directory as the folders. If that is not the case, the paths must be adjusted accordingly
Let me know, if it worked :)

os.walk help - processing data in chunks - python3

I have some files that are scattered in many different folders within a directory, and I was wondering if there were a way to iterate through these folders in chunks.
Here's a picture of my directory tree
I'd want to go through all the files in my 2010A folder, then 2010B folder, then move to 2011A and 2011B etc..
My goal is to amend my current script, which only works for a single folder, so that it flows like this:
Start: ROOT FOLDER >
2010 > 2010A >
output to csv> re-start loop >
2010B > append csv after the last row
re-start loop > 2011 > 2011A >
append csv after the last row > and so on...
Is this possible?
Here's my code, it currently works if I run it on a single folder containing my txt files, e.g., for the 2010A folder:
import re
import pandas as pd
import os
from collections import Counter
#get file list in current directory
filelist = os.listdir(r'root_folder\2010\2010A')
dict1 = {}
#open and read files, store into dictionary
for file in filelist:
with open(file) as f:
items = f.read()
dict1[file] = items
#create filter for specific words
filter = [ "cat", "dog", "elephant", "fowl"]
dict2 = {}
# count occurrence of words in each file
for k, v in dict1.items():
list= []
for i in filter:
list.extend(re.findall(r"{}".format(i),v))
dict2[k] = dict(Counter(new))
dict3 ={}
# count total words in each file, store in separate dictionary
dict3 = {k: {'total':len(v)} for k,v in dict1.items()}
join_dict = {}
#join both dictionaries
join_dict = {k:{**dict2[k], **dict3[k]} for k in out}
#convert to pandas dataframe
df = pd.DataFrame.from_dict(join_dict, orient='index').fillna(0).astype(int)
#output to csv
df.to_csv(r'path\output.csv',index = True, header=True)
I have a feeling I need to replace:
for file in filelist:
with for (root,dirs,files) in os.walk(r'root_folder', topdown=True):
But I'm not exactly sure how, since I'm quite new to coding and python in general.
You can use glob to get list of files like this
import glob
files = glob.glob('root_folder\\*.txt', recursive=True)

Converting list to dictionary, and tokenizing the key values - possible?

So basically I have a folder of files I'm opening and reading into python.
I want to search these files and count the keywords in each file, to make a dataframe like the attached image.
I have managed to open and read these files into a list, but my problem is as follows:
Edit 1:
I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.
import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
output =
AttributeError: 'list' object has no attribute 'lower'
Pre-edit post:
1. Each list value doesn't retain the filename key. So I don't have the rows I need.
2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.
Here's my code for opening the files, converting them to lowercase and storing them in a list.
How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?
import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read()
# convert to lower-case value
Lcase_content.append(content.lower())
## this two lines below don't work - index out of range error
tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
num = num + 1
You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:
Counter(nltk.tokenize.word_tokenize())
Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:
{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}),
'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}
If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.
import os
import nltk
from collections import Counter
import pandas as pd
result = dict()
filenames = os.listdir('.')
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read().lower()
result[filename] = Counter(nltk.tokenize.word_tokenize(content))
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)
df['total words'] = df.sum(axis=1)
df.to_csv('words_count.csv', index=True)
Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.
Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.

Resources