Counting the frequency distribution of letters in a text file - python-3.x

I'm writing a program that will count how many of each letter there are.
Currently, it's working but it counts upper and lower case letters separately. I tried to convert all of the characters to upper case but it didn't work.
myFile = open('textFile.txt', 'r+')
with open('textFile.txt', 'r') as fileinput:
for line in fileinput:
line = line.upper()
d = {}
for i in myFile.read():
d[i] = d.get(i,0) + 1
for k,v in sorted(d.items()):
print("{}: {}".format(k,v))
If my text file consists of:
abc
ABC
it will print:
(space) : 1
A: 1
B: 1
C: 1
a: 1
b: 1
c: 1
I would like it to print:
A: 2
B: 2
C: 2

the result of line = line.upper() is not used anywhere. Perhaps move the counting code into the block of code that performs the uppercase transformation. Then count the characters in each uppercased line.

in this you are changing character to upper case but reading file only
see line 4 , do somthing like this
myFile = open('textFile.txt', 'r+')
with open('textFile.txt', 'r') as fileinput:
for line in fileinput:
line = line.upper()
d = {}
#change is here
for i in line:
d[i] = d.get(i,0) + 1
for k,v in sorted(d.items()):
print("{}: {}".format(k,v))

In Python, indenting is critical, you are converting the input to uppercase, but then throwing it away.
Try rearranging it like this:
d = {}
#myFile = open('textFile.txt', 'r+') - removed as not needed due to "with" variant of file processing below.
with open('textFile.txt', 'r') as fileinput:
for line in fileinput:
line = line.upper()
for i in line:
d[i] = d.get(i,0) + 1
for k,v in sorted(d.items()):
print("{}: {}".format(k,v))

This will do it.
chars = []
with open('textFile.txt', 'r') as fileinput:
for line in fileinput:
for c in line:
chars.append(c.upper())
d = {}
for i in chars:
d[i] = d.get(i, 0) + 1
for k,v in sorted(d.items()):
print("{}: {}".format(k,v))
Or this:
d = {}
with open('textFile.txt', 'r') as fileinput:
for line in fileinput:
line = line.upper()
for i in line:
d[i] = d.get(i,0) + 1
for k,v in sorted(d.items()):
print("{}: {}".format(k,v))

Related

How does tell() in python file handling work

f = open("test.txt","w")
s= "This\nThis\nThis"
f.write(s)
f.close()
f= open("test.txt","r")
w=''
for i in f:
for j in i:
w = w+j
print(w)
print("Number of Characters",len(w))
print("Current Position of handler",f.tell())
f.close()
The output of the above is
This
This
This
Number of Characters 14
Current Position of handler 16
As per the file, there are 12 characters and 2 escape sequences so the number of characters is 14. I got it. But I did not get why the tell() function returns 17
Just an assumption.
I think, in your case, just for '\n', the pointer is moving twice. First, after reading the newline character i.e. \n, the pointer is moving one step right. Secondly, because of newline character, the pointer is going to the beginning of the next line. That's why an extra count is being added to tell() function's result. This won't be happen for other escape characters like '\t' etc.
I ran some examples on my system. You can notice the results one by one.
Example 1
f = open("test.txt","w")
s= "\t"
f.write(s)
f.close()
f= open("test.txt","r")
w=''
for i in f:
for j in i:
w = w+j
print(w)
print("Number of Characters",len(w))
print("Current Position of handler",f.tell())
f.close()
Output
>>>python .\test.py
Number of Characters 1
Current Position of handler 1
Example 2
f = open("test.txt","w")
s= "\n"
f.write(s)
f.close()
f= open("test.txt","r")
w=''
for i in f:
for j in i:
w = w+j
print(w)
print("Number of Characters",len(w))
print("Current Position of handler",f.tell())
f.close()
Output
>>>python .\test.py
Number of Characters 1
Current Position of handler 2
Example 3
f = open("test.txt","w")
s= "ThisThisThis"
f.write(s)
f.close()
f= open("test.txt","r")
w=''
for i in f:
for j in i:
w = w+j
print(w)
print("Number of Characters",len(w))
print("Current Position of handler",f.tell())
f.close()
Output
>>>python .\test.py
ThisThisThis
Number of Characters 12
Current Position of handler 12
Example 4
f = open("test.txt","w")
s= "ThisThisThis\n"
f.write(s)
f.close()
f= open("test.txt","r")
w=''
for i in f:
for j in i:
w = w+j
print(w)
print("Number of Characters",len(w))
print("Current Position of handler",f.tell())
f.close()
Output
>>>python .\test.py
ThisThisThis
Number of Characters 13
Current Position of handler 14
Example 5
f = open("test.txt","w")
s= "ThisThisThis\t"
f.write(s)
f.close()
f= open("test.txt","r")
w=''
for i in f:
for j in i:
w = w+j
print(w)
print("Number of Characters",len(w))
print("Current Position of handler",f.tell())
f.close()
Output
>>>python .\test.py
ThisThisThis
Number of Characters 13
Current Position of handler 13
For your case, you used \n two times in your string. You can count 2 instead of 1 for every \n while guessing the result of tell(). So, 4 + 2 + 4 + 2 + 4 = 16.

Python treat words with comas the same as those without in a dictionary

I am making a program, that reads a .txt file and prints how many times a certain word has been used:
filename = 'for_python.txt'
with open(filename) as file:
contents = file.read().split()
dict = {}
for word in contents:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
dict = sorted(dict.items(), key=lambda x: x[1], reverse=True)
for i in dict:
print(i[0], i[1])
It works, but it treats words with commas as different words. Is there an easy and efficient way to solve this?
This is what I did.
filename = 'for_python.txt'
with open(filename) as file:
contents = file.read().splitlines()
dict = {}
for sentence in contents:
word_list = sentence.split(" ")
for word in word_list:
cleaned_word = " "
for character in word:
if character.isalnum():
cleaned_word += character
if cleaned_word not in dict:
dict[cleaned_word] = 1
else:
dict[cleaned_word] += 1
dict = sorted(dict.items(), key=lambda x: x[1], reverse=True)
for i in dict:
print(i[0], i[1])

Separate and write unique files from delimited text

I am following this tutorial here to separate and write out a delimited text file, but only get one file on output. Is this a python2 -> 3 issue? Help please.
filename = ('file path')
with open(filename) as Input:
op = ''
start = 0
count = 1
for x in Input.read().split("\n"):
if (x == 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'):
if (start == 1):
with open(str(count) + '.txt', 'w') as Output:
Output.write(op)
Output.close()
op = ''
count = + 1
else:
start = 1
Input.close()
You have allways count = 1.
Change this line:
count = + 1
to
count += 1

have a file "httpd-access.txt" and i want to count the number of times a line starts with '81' in Python

numberOfTime = 0
with open('httpd-access.txt') as f:
for line in f:
finded = line.find('81')
if finded != -1 and finded != 0:
numberOfTime += 1
How can i do for just to count the number of times a line starts with 81.
Thanx!
You can use startswith() for this:
numberOfTime = 0
with open('httpd-access.txt') as f:
for line in f:
if line.startswith('81'):
numberOfTime += 1
Here a function to achieve what you need:
def count_lines(httpd_file):
counter = 0
with open(httpd_file) as my_file:
for line in my_file:
if line.startswith('81'):
counter += 1
return counter
if __name__ == '__main__':
print(count_lines('httpd-access.txt'))

Print Word & Line Number Where Word Occurs in File Python

I am trying to print the word and line number(s) where the word occurs in the file in Python. Currently I am getting the correct numbers for second word, but the first word I look up does not print the right line numbers. I must iterate through infile, use a dictionary to store the line numbers, remove new line chars, remove any punctuation & skip over blank lines when pulling the number. I need to add a value that is actually a list, so that I may add the line numbers to the list if the word is contained on multiple lines.
Adjusted code:
def index(f,wordf):
infile = open(filename, 'r')
dct = {}
count = 0
for line in infile:
count += 1
newLine = line.replace('\n', ' ')
if newLine == ' ':
continue
for word in wordf:
if word in split_line:
if word in dct:
dct[word] += 1
else:
dct[word] = 1
for word in word_list:
print('{:12} {},'.format(word,dct[word]))
infile.close()
Current Output:
>>> index('leaves.txt',['cedars','countenance'])
pines [9469, 9835, 10848, 10883],
counter [792, 2092, 2374],
Desired output:
>>> index2('f.txt',['pines','counter','venison'])
pines [530, 9469, 9835, 10848, 10883]
counter [792, 2092, 2374]
There is some ambiguity for how your file is set up, but I think it understand.
Try this:
import numpy as np # add this import
...
for word in word_f:
if word in split_line:
np_array = np.array(split_line)
item_index_list = np.where(np_array == word)
dct[word] = item_index_list # note, you might want the 'index + 1' instead of the 'index'
for word in word_f:
print('{:12} {},'.format(word,dct[word]))
...
btw, as far as I can tell, you're not using your 'increment' variable.
I think that'll work, let me know if it doesn't and I'll fix it
per request, I made an additional answer (that I think works) without importing another library
def index2(f,word_f):
infile = open(f, 'r')
dct = {}
# deleted line
for line in infile:
newLine = line.replace('\n', ' ')
if newLine == ' ':
continue
# deleted line
newLine2 = removePunctuation(newLine)
split_line = newLine2.split()
for word in word_f:
count = 0 # you might want to start at 1 instead, if you're going for 'word number'
# important note: you need to have 'word2', not 'word' here, and on the next line
for word2 in split_line: # changed to looping through data
if word2 == word:
if word2 in dct:
temp = dct[word]
temp.append(count)
dct[word] = temp
else:
temp = []
temp.append(count)
dct[word] = temp
count += 1
for word in word_f:
print('{:12} {},'.format(word,dct[word]))
infile.close()
Do be aware, I don't think this code will handle if the words passed in are not in the file. I'm not positive on the file that you're grabbing from, so I can't be sure, but I think it'll seg fault if you pass in a word that doesn't exist in the file.
Note: I took this code from my other post to see if it works, and it seems that it does
def index2():
word_list = ["work", "many", "lots", "words"]
infile = ["lots of words","many many work words","how come this picture lots work","poem poem more words that rhyme"]
dct = {}
# deleted line
for line in infile:
newLine = line.replace('\n', ' ') # shouldn't do anything, because I have no newlines
if newLine == ' ':
continue
# deleted line
newLine2 = newLine # ignoring punctuation
split_line = newLine2.split()
for word in word_list:
count = 0 # you might want to start at 1 instead, if you're going for 'word number'
# important note: you need to have 'word2', not 'word' here, and on the next line
for word2 in split_line: # changed to looping through data
if word2 == word:
if word2 in dct:
temp = dct[word]
temp.append(count)
dct[word] = temp
else:
temp = []
temp.append(count)
dct[word] = temp
count += 1
for word in word_list:
print('{:12} {}'.format(word, ", ".join(map(str, dct[word])))) # edited output so it's comma separated list without a trailing comma
def main():
index2()
if __name__ == "__main__":main()
and the output:
work 2, 5
many 0, 1
lots 0, 4
words 2, 3, 3
and the explanation:
infile = [
"lots of words", # lots at index 0, words at index 2
"many many work words", # many at index 0, many at index 1, work at index 2, words at index 3
"how come this picture lots work", # lots at index 4, work at index 5
"poem poem more words that rhyme" # words at index 3
]
when they get appended in that order, they get the correct word placement position
My biggest error was that I was not properly adding the line number to the counter. I completely used the wrong call, and did nothing to increment the line number as the word was found in the file. The proper format was dct[word] += [count] not dct[word] += 1
def index(filename,word_list):
infile = open(filename, 'r')
dct = {}
count = 0
for line in infile:
count += 1
newLine = line.replace('\n', ' ')
if newLine == ' ':
continue
newLine2 = removePunctuation(newLine)
split_line = newLine2.split()
for word in word_list:
if word in split_line:
if word in dct:
dct[word] += [count]
else:
dct[word] = [count]
for word in word_list:
print('{:12} {}'.format(word,dct[word]))
infile.close()

Resources