Counting the occurrences of all letters in a txtfile [duplicate] - python-3.x

This question already has answers here:
I'm trying to count all letters in a txt file then display in descending order
(4 answers)
Closed 6 years ago.
I'm trying to open a file and count the occurrences of letters.
So far this is where I'm at:
def frequencies(filename):
infile=open(filename, 'r')
wordcount={}
content = infile.read()
infile.close()
counter = {}
invalid = "ā€˜'`,.?!:;-_\nā€”' '"
for word in content:
word = content.lower()
for letter in word:
if letter not in invalid:
if letter not in counter:
counter[letter] = content.count(letter)
print('{:8} appears {} times.'.format(letter, counter[letter]))
Any help would be greatly appreciated.

best way is using numpy packages, the example would be like this
import numpy
text = "xvasdavawdazczxfawaczxcaweac"
text = list(text)
a,b = numpy.unique(text, return_counts=True)
x = sorted(zip(b,a), reverse=True)
print(x)
in your case, you can combine all your words into single string, then convert the string into list of character
if you want to remove all except character, you can use regex to clean it
#clean all except character
content = re.sub(r'[^a-zA-Z]', r'', content)
#convert to list of char
content = list(content)
a,b = numpy.unique(content, return_counts=True)
x = sorted(zip(b,a), reverse=True)
print(x)

If you are looking for a solution not using numpy:
invalid = set([ch for ch in "ā€˜'`,.?!:;-_\nā€”' '"])
def frequencies(filename):
counter = {}
with open(filename, 'r') as f:
for ch in (char.lower() for char in f.read()):
if ch not in invalid:
if ch not in counter:
counter[ch] = 0
counter[ch] += 1
results = [(counter[ch], ch) for ch in counter]
return sorted(results)
for result in reversed(frequencies(filename)):
print result

I would suggest using collections.Counter instead.
Compact Solution
from collections import Counter
from string import ascii_lowercase # a-z string
VALID = set(ascii_lowercase)
with open('in.txt', 'r') as fin:
counter = Counter(char.lower() for line in fin for char in line if char.lower() in VALID)
print(counter.most_common()) # print values in order of most common to least.
More readable solution.
from collections import Counter
from string import ascii_lowercase # a-z string
VALID = set(ascii_lowercase)
with open('in.txt', 'r') as fin:
counter = Counter()
for char in (char.lower() for line in fin for char in line):
if char in VALID:
counter[char] += 1
print(counter)
If you don't want to use a Counter then you can just use a dict.
from string import ascii_lowercase # a-z string
VALID = set(ascii_lowercase)
with open('test.txt', 'r') as fin:
counter = {}
for char in (char.lower() for line in fin for char in line):
if char in VALID:
# add the letter to dict
# dict.get used to either get the current count value
# or default to 0. Saves checking if it is in the dict already
counter[char] = counter.get(char, 0) + 1
# sort the values by occurrence in descending order
data = sorted(counter.items(), key = lambda t: t[1], reverse = True)
print(data)

Related

TypeError: string indices must be integers --> Python

I wanted to create a python function which should read each
character of a text file and then count and display
the occurrence of alphabets E and T individually (including
small cases e and t too).
def test():
f = open("poem.txt",'r')
count = 0
count1 =0
try:
line = f.readlines()
for i in line:
for x in line:
if (i[x] in 'Ee'):
count+=1
else:
if (i[x] in 'Tt'):
count1+=1
print("E or e",count)
print("T or t",count1)
except EOFError:
f.close()
test()
This is what I tried
And it gave :
File "/Users/ansusinha/Desktop/Tution/Untitled15.py", line 23, in test
if (i[x] in 'Ee'):
TypeError: string indices must be integers
What am I missing here?
You are missing the fact that Python strings come with a .count() method.
You can read the entire file with
file_as_string = f.read()
and then count occurrences of any substring with
amount_of_E = file_as_string.count('E')
Check out str.count in Python documentation.
With
amount_of_Ee = file_as_string.lower().count('e')
you count occurrences of both E and e and with
amount_of_Tt = file_as_string.lower().count('t')
you are done with counting using two lines of code.
In your own code you try to index a string with another string, but string indices must be integers.
With for x in line: you actually wanted for x in i: where then the x will be a single character of line i you could directly use in if x in 'eE':.
But there is no need for the loops at all as Python strings come with the .count() method, so just use it.
Because, f.readlines() does not read only line, it reads all lines.
Code should be like this
def test():
f = open("poem.txt",'r')
count = 0
count1 =0
try:
lines = f.readlines()
for line in lines:
for char_in_line in line:
if (char_in_line in 'Ee'):
count+=1
elif (char_in_line in 'Tt'):
count1+=1
print("E or e",count)
print("T or t",count1)
except EOFError:
f.close()
test()
If your poem.txt is this,
LaLaLa
I'm shoes.
Who are you?
Then i[x] should be like this i["LaLaLa"]

cs50 Pset 6 DNA - Issue creating list

I have a code which iterates through the text, and tells me which is the maximum amount of times each dna STR is found. The only step missing to be able to match these values with the CSV file, is to store them into a list, BUT I AM NOT ABLE TO DO SO. When I run the code, the maximum values are printed independently for each STR sequence.
I have tried to "append" the values into a list, but I was not successful, thus, I cannot match it with the dna sequences of the CSV (large nor small).
Any help or advcise is greatly appreciated!
Here is my code, and the results I get with using "text 1" and "small csv":
`
import cs50
import sys
import csv
import os
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
csv_db = sys.argv[1]
file_seq = sys.argv[2]
with open(csv_db, newline='') as csvfile: #with open(csv_db) as csv_file
csv_reader = csv.reader(csvfile, delimiter=',')
header = next(csv_reader)
i = 1
while i < len(header):
STR = header[i]
len_STR = len(STR)
with open(file_seq, 'r') as my_file:
file_reader = my_file.read()
counter = 0
a = 0
b = len_STR
list = []
for text in file_reader:
if file_reader[a:b] != STR:
a += 1
b += 1
else:
counter += 1
a += len_STR
b += len_STR
list.append(counter)
print(list)
i += 1
`
The problem is in place of variable "list" declaration. Every time you iterates through STRs in variable "header" you declares:
list = []
Thus, you create absolutely new variable, which stores only the length of current STR. To make a list with all STRs appended you need to declare variable "list" before the while loop and operator "print" after the while loop:
list = []
while i < len(header):
<your loop code>
print(list)
This should solve your problem.
P.S. Avoid to use "list" as a variable declaration. The "list" is a python built-in function and it is automatically declared. So, when you redeclare it, you will not be able to use list() function in your code.

Concatenate returned strings into a single line python without using end=" "

#open a file for input
#loop through the contents to find four letter words
#split the contents of the string
#if length of string = 4 then print the word
my_file = open("myfile.txt", 'r')
for sentence in my_file:
single_strings = sentence.split()
for word in single_strings:
if len(word) == 4:
print(word)
I would like my code to return four letter words in a single string and instead it returns each string on a new line. How can I return the strings as one string so that I can split() them and get their length to print out.
All problems are simpler when broke in small parts. First write a function that return an array containing all words from a file:
def words_in_file(filename):
with open(filename, 'r') as f:
return [word for sentence in f for word in sentence.split()]
Then a function that filters arrays of words:
def words_with_k_letters(words, k=-1):
return filter(lambda w: len(w) == k, words)
Once you have these two function the problem becomes trivial:
words = words_in_file("myfile.txt")
words = words_with_k_letters(words, k=4)
print(', '.join(words))

How to scramble/shuffle/randomize all the letters of a string in python except the first and the last letter?

For example :
Example 1:
string = "Jack and the bean stalk."
updated_string = "Jcak and the baen saltk."
Example 2:
string = "Hey, Do you want to boogie? Yes, Please."
updated_string = "Hey, Do you wnat to bogoie? Yes, Palsee."
Now this string is stored in a file.
I want to read this string from the file. And write the updated string back in the file at same positions.
Letters of each word of the string with length greater than 3 must be scrambled/shuffled while keeping first and last letter as it is. Also if there is any punctuation mark the punctuation mark stays as it is.
My approach:
import random
with open("path/textfile.txt","r+") as file:
file.seek(0)
my_list = []
for line in file.readlines():
word = line.split()
my_list.append(word)
scrambled_list =[]
for i in my_list:
if len(i) >3:
print(i)
s1 = i[1]
s2 = i[-1]
s3 = i[1:-1]
random.shuffle(s3)
y = ''.join(s3)
z = s1+y+s2+' '
print(z)
This is one approach.
Demo:
from random import shuffle
import string
punctuation = tuple(string.punctuation)
for line in file.readlines(): #Iterate lines
res = []
for i in line.split(): #Split sentence to words
punch = False
if len(i) >= 4: #Check if word is greater than 3 letters
if i.endswith(punctuation): #Check if words ends with punctuation
val = list(i[1:-2]) #Exclude last 2 chars
punch = True
else:
val = list(i[1:-1]) #Exclude last 1 chars
shuffle(val) #Shuffle letters excluding the first and last.
if punch:
res.append("{0}{1}{2}".format(i[0], "".join(val), i[-2:]))
else:
res.append("{0}{1}{2}".format(i[0], "".join(val), i[-1]))
else:
res.append(i)
print(" ".join(res))

How can I simplify and format this function?

So I have this messy code where I wanted to get every word from frankenstein.txt, sort them alphabetically, eliminated one and two letter words, and write them into a new file.
def Dictionary():
d = []
count = 0
bad_char = '~!##$%^&*()_+{}|:"<>?\`1234567890-=[]\;\',./ '
replace = ' '*len(bad_char)
table = str.maketrans(bad_char, replace)
infile = open('frankenstein.txt', 'r')
for line in infile:
line = line.translate(table)
for word in line.split():
if len(word) > 2:
d.append(word)
count += 1
infile.close()
file = open('dictionary.txt', 'w')
file.write(str(set(d)))
file.close()
Dictionary()
How can I simplify it and make it more readable and also how can I make the words write vertically in the new file (it writes in a horizontal list):
abbey
abhorred
about
etc....
A few improvements below:
from string import digits, punctuation
def create_dictionary():
words = set()
bad_char = digits + punctuation + '...' # may need more characters
replace = ' ' * len(bad_char)
table = str.maketrans(bad_char, replace)
with open('frankenstein.txt') as infile:
for line in infile:
line = line.strip().translate(table)
for word in line.split():
if len(word) > 2:
words.add(word)
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words)) # note 'lines'
A few notes:
follow the style guide
string contains constants you can use to provide the "bad characters";
you never used count (which was just len(d) anyway);
use the with context manager for file handling; and
using a set from the start prevents duplicates, but they aren't ordered (hence sorted).
Using re module.
import re
words = set()
with open('frankenstein.txt') as infile:
for line in infile:
words.extend([x for x in re.split(r'[^A-Za-z]*', line) if len(x) > 2])
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words))
From r'[^A-Za-z]*' in re.split, replace 'A-Za-z' with the characters which you want to include in dictionary.txt.

Resources