Converting list to dictionary, and tokenizing the key values - possible? - python-3.x

So basically I have a folder of files I'm opening and reading into python.
I want to search these files and count the keywords in each file, to make a dataframe like the attached image.
I have managed to open and read these files into a list, but my problem is as follows:
Edit 1:
I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.
import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
output =
AttributeError: 'list' object has no attribute 'lower'
Pre-edit post:
1. Each list value doesn't retain the filename key. So I don't have the rows I need.
2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.
Here's my code for opening the files, converting them to lowercase and storing them in a list.
How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?
import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read()
# convert to lower-case value
Lcase_content.append(content.lower())
## this two lines below don't work - index out of range error
tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
num = num + 1

You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:
Counter(nltk.tokenize.word_tokenize())
Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:
{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}),
'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}
If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.
import os
import nltk
from collections import Counter
import pandas as pd
result = dict()
filenames = os.listdir('.')
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read().lower()
result[filename] = Counter(nltk.tokenize.word_tokenize(content))
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)
df['total words'] = df.sum(axis=1)
df.to_csv('words_count.csv', index=True)
Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.
Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

Splitting a list entry in Python

I am importing a CSV file into a list in Python. When I split it into list elements then print a index,the entry is printed like this.
2000-01-03,3.745536,4.017857,3.631696,3.997768,2.695920,133949200
How would I split this list so if I wanted to just print a solo element like this?
2000-01-03Here is my code so far.
def main():
list = []
filename = "AAPL.csv"
with open(filename) as x:
for line in x.readlines():
val = line.strip('\n').split(',')
list.append(val)
print(list[2])
Your current code build a list of lists, precisely a list (of rows) of lists (of fields).
To extract one single element, say first field of third row, you could do:
...
print(list[2][0])
But except for trivial tasks, you should use the csv module when processing csv file, because it is robust to corner cases like newlines or field separarors contained in fields. Your code could become:
def main():
list = []
filename = "AAPL.csv"
with open(filename) as x:
rd = csv.reader(x)
for val in rd: # the reader is an iterator of lists of fields
list.append(val)
print(list[2][0])

os.walk help - processing data in chunks - python3

I have some files that are scattered in many different folders within a directory, and I was wondering if there were a way to iterate through these folders in chunks.
Here's a picture of my directory tree
I'd want to go through all the files in my 2010A folder, then 2010B folder, then move to 2011A and 2011B etc..
My goal is to amend my current script, which only works for a single folder, so that it flows like this:
Start: ROOT FOLDER >
2010 > 2010A >
output to csv> re-start loop >
2010B > append csv after the last row
re-start loop > 2011 > 2011A >
append csv after the last row > and so on...
Is this possible?
Here's my code, it currently works if I run it on a single folder containing my txt files, e.g., for the 2010A folder:
import re
import pandas as pd
import os
from collections import Counter
#get file list in current directory
filelist = os.listdir(r'root_folder\2010\2010A')
dict1 = {}
#open and read files, store into dictionary
for file in filelist:
with open(file) as f:
items = f.read()
dict1[file] = items
#create filter for specific words
filter = [ "cat", "dog", "elephant", "fowl"]
dict2 = {}
# count occurrence of words in each file
for k, v in dict1.items():
list= []
for i in filter:
list.extend(re.findall(r"{}".format(i),v))
dict2[k] = dict(Counter(new))
dict3 ={}
# count total words in each file, store in separate dictionary
dict3 = {k: {'total':len(v)} for k,v in dict1.items()}
join_dict = {}
#join both dictionaries
join_dict = {k:{**dict2[k], **dict3[k]} for k in out}
#convert to pandas dataframe
df = pd.DataFrame.from_dict(join_dict, orient='index').fillna(0).astype(int)
#output to csv
df.to_csv(r'path\output.csv',index = True, header=True)
I have a feeling I need to replace:
for file in filelist:
with for (root,dirs,files) in os.walk(r'root_folder', topdown=True):
But I'm not exactly sure how, since I'm quite new to coding and python in general.
You can use glob to get list of files like this
import glob
files = glob.glob('root_folder\\*.txt', recursive=True)

Python 3 - files imported as dictionary, but the values are lists - how to resolve?

I'm importing files from a folder into python as a dictionary, with the
key = filename and value = file contents.
e.g.,
mydict = {'File1': ['this is a string as a list'],
'File2': ['second string in file 2 is also a list']}
Whereas what I'd like is to have:
mydict = {'File1': 'this is a string as a list', 'File2:': '...'}
I also want to count the following strings: "string", "this is", "second string" to output it in a datastructure. I'm guessing I'd need to use counter from collections to do that - but do I first need to tokenize my values?
Code to import text into dictionary:
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
print(file_dict)
To make the values all lowercase (doesn't work since they are in list):
#convert dictionary to lower case
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
AttributeError: 'list' object has no attribute 'lower'
Any help is greatly appreciated.
You are splitting the input when you load it:
items = [i.strip() for i in f.read().split(",")]
If you want just the file contents, then you can just use:
items = f.read()
If you still want to trim whitespaces around commas (,), then you can recombine them with str.join
items = ','.join([i.strip() for i in f.read().split(",")]) # This will reinsert commas
items = ''.join([i.strip() for i in f.read().split(",")]) # This will not

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Resources