loop through read_csv files python - python-3.x

First time post, Python newbie, appreciate your help and patience. Looking for help iterating through read_csv() files and assigning each to their own new dataframe. The 'filesnames' get pulled in correctly, just having issues making a unique dataframe for each csv imported.
#makes a variable list of filenames with common patterns via glob
filenames = sorted(glob.glob('617*.txt'))
#gets the number of files from folder
count = len(filenames)
i =0
for i in range(0, count):
file = pd.read_csv('C:/Users/fjehlik/Desktop/python\
data_frame_excel/'+filenames[i], sep = '\t', header=0)

IIUC you want to create a dictionary of DataFrames:
dfs = {f:pd.read_csv(f, sep='\t') for f in filenames}

Related

Python 3.9: For loop is not producing output files eventhough no errors are displayed

everyone, I am fairly new to using python for data analysis,so apologies for silly questions:
IDE : PyCharm
What I have : A massive .xyz file (with 4 columns) which is a combination of several datasets, each dataset can be determined by the third column of the file which goes from 10,000 to -10,000 with 0 in between and 100 as spacing and repeats (so every 201 rows is one dataset)
What I want to do : Split the massive file into its individual datasets (201 rows each)and save each file under a different name.
What I have done so far :
# Import packages
import os
import pandas as pd
import numpy as np #For next steps
import math #For next steps
#Check and Change directory
path = 'C:/Clayton/lines/profiles_aufmod'
os.chdir(path)
print(os.getcwd()) #Correct path is printed
# split the xyz file into different files for each profile
main_xyz = 'bathy_SPO_1984_50x50_profile.xyz'
number_lines = sum(1 for row in (open(main_xyz)))
print(number_lines) # 10854 is the output
rowsize = 201
for i in range(number_lines, rowsize):
profile_raw_df = pd.read_csv(main_xyz, delimiter=',', header=None, nrows=rowsize,
skiprows=i)
out_xyz = 'Profile' + str(i) + '.xyz'
profile_raw_df.to_csv(out_xyz, index=False,
header=False, mode='a')
Problems I am facing :
The for loop was at first giving output files as seen in the image,check Proof of output but now it does not produce any outputs and it is not rewriting the previous files either. The other mystery is that I am not getting an error either,check Code executed without error.
What I tried to fix the issue :
I updated all the packages and restarted Pycharm
I ran each line of code one by one and everything works until the for loop
While counting the number of rows in
number_lines = sum(1 for row in (open(main_xyz)))
you have exhausted the iterator that loops over the lines of the file. But you do not close the file. But this should not prevent Pandas from reading the same file.
A better idiom would be
with open(main_xyz) as fh:
number_lines = sum(1 for row in fh)
Your for loop as it stands does not do what you probably want. I guess you want:
for i in range(0, number_lines, rowsize):
so, rowsize is the step-size, instead of the end value of the for loop.
If you want to number the output files by data set, keep a counnt of the dataset, like this
data_set = 0
for i in range(0, number_lines, rowsize):
data_set += 1
...
out_xyz = f"Profile{data_set}.xyz"
...

python iterating on multiple files

I have
file_2000.dta, file_2001.dta, file_2002.dta and so on.
I also have
file1_2000.dta, file1_2001.dta, file1_2002.dta and so on.
I want to iterate on the file year.
Let (year) = 2000, 2001, 2002, etc
import file_(year) using pandas.
import file1_(year) using pandas.
file_(year)['name'] = file_(year).index
file1_(year)['name'] = file1_(year).index2
merged = pd.merge(file_(year), file1_(year), on='name')
write/export merged_(year).dta
It seems to me that you need to use the read_stata function, based on your .dta extensions, to read the files in a loop, create a list of the separate dataframes to be able to work with them separately, and then concatenate all dataframes into one.
Something like:
list_of_files = ['file_2000.dta', 'file_2001.dta', 'file_2002.dta'] # full paths here...
frames = []
for f in list_of_files:
df = pd.read_stata(f)
frames.append(df)
consolidated_df = pd.concat(frames, axis=0, ignore_index=True)
These questions might be relevant to your case:
How to Read multiple files in Python for Pandas separate dataframes
Pandas read_stata() with large .dta files
As much as I know there is not 'Let' keyword in Python. To iterate over multiple files in a directory you can simply use for loop with os module like the following:
import os
directory = r'C:\Users\admin'
for filename in os.listdir(directory):
if filename.startswith("file_200") and filename.endswith(".dat"):
# do something
else:
continue
Another approach is to use regex to tell python the files names to match during the iteration. the pattern should be: pattern = r"file_20\d+"

Converting multiple .pdf files with multiple pages into 1 single .csv file

I am trying to convert .pdf data to a spreadsheet. Based on some research, some guys recommended transforming it into csv first in order to avoid errors.
So, I made the below coding which is giving me:
"TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid"
Error appears at 'pd.concat' command.
'''
import tabula
import pandas as pd
import glob
path = r'C:\Users\REC.AC'
all_files = glob.glob(path + "/*.pdf")
print (all_files)
df = pd.concat(tabula.read_pdf(f1) for f1 in all_files)
df.to_csv("output.csv", index = False)
'''
Since this might be a common issue, I am posting the solution I found.
"""
df = []
for f1 in all_files:
df = pd.concat(tabula.read_pdf(f1))
"""
I believe that breaking the item iteration in two parts would generate the dataframe it needed and therefore would work.

Converting list to dictionary, and tokenizing the key values - possible?

So basically I have a folder of files I'm opening and reading into python.
I want to search these files and count the keywords in each file, to make a dataframe like the attached image.
I have managed to open and read these files into a list, but my problem is as follows:
Edit 1:
I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.
import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
output =
AttributeError: 'list' object has no attribute 'lower'
Pre-edit post:
1. Each list value doesn't retain the filename key. So I don't have the rows I need.
2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.
Here's my code for opening the files, converting them to lowercase and storing them in a list.
How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?
import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read()
# convert to lower-case value
Lcase_content.append(content.lower())
## this two lines below don't work - index out of range error
tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
num = num + 1
You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:
Counter(nltk.tokenize.word_tokenize())
Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:
{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}),
'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}
If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.
import os
import nltk
from collections import Counter
import pandas as pd
result = dict()
filenames = os.listdir('.')
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read().lower()
result[filename] = Counter(nltk.tokenize.word_tokenize(content))
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)
df['total words'] = df.sum(axis=1)
df.to_csv('words_count.csv', index=True)
Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.
Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Resources