How can I simplify and format this function? - python-3.x

So I have this messy code where I wanted to get every word from frankenstein.txt, sort them alphabetically, eliminated one and two letter words, and write them into a new file.
def Dictionary():
d = []
count = 0
bad_char = '~!##$%^&*()_+{}|:"<>?\`1234567890-=[]\;\',./ '
replace = ' '*len(bad_char)
table = str.maketrans(bad_char, replace)
infile = open('frankenstein.txt', 'r')
for line in infile:
line = line.translate(table)
for word in line.split():
if len(word) > 2:
d.append(word)
count += 1
infile.close()
file = open('dictionary.txt', 'w')
file.write(str(set(d)))
file.close()
Dictionary()
How can I simplify it and make it more readable and also how can I make the words write vertically in the new file (it writes in a horizontal list):
abbey
abhorred
about
etc....

A few improvements below:
from string import digits, punctuation
def create_dictionary():
words = set()
bad_char = digits + punctuation + '...' # may need more characters
replace = ' ' * len(bad_char)
table = str.maketrans(bad_char, replace)
with open('frankenstein.txt') as infile:
for line in infile:
line = line.strip().translate(table)
for word in line.split():
if len(word) > 2:
words.add(word)
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words)) # note 'lines'
A few notes:
follow the style guide
string contains constants you can use to provide the "bad characters";
you never used count (which was just len(d) anyway);
use the with context manager for file handling; and
using a set from the start prevents duplicates, but they aren't ordered (hence sorted).

Using re module.
import re
words = set()
with open('frankenstein.txt') as infile:
for line in infile:
words.extend([x for x in re.split(r'[^A-Za-z]*', line) if len(x) > 2])
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words))
From r'[^A-Za-z]*' in re.split, replace 'A-Za-z' with the characters which you want to include in dictionary.txt.

Related

Concatenate returned strings into a single line python without using end=" "

#open a file for input
#loop through the contents to find four letter words
#split the contents of the string
#if length of string = 4 then print the word
my_file = open("myfile.txt", 'r')
for sentence in my_file:
single_strings = sentence.split()
for word in single_strings:
if len(word) == 4:
print(word)
I would like my code to return four letter words in a single string and instead it returns each string on a new line. How can I return the strings as one string so that I can split() them and get their length to print out.
All problems are simpler when broke in small parts. First write a function that return an array containing all words from a file:
def words_in_file(filename):
with open(filename, 'r') as f:
return [word for sentence in f for word in sentence.split()]
Then a function that filters arrays of words:
def words_with_k_letters(words, k=-1):
return filter(lambda w: len(w) == k, words)
Once you have these two function the problem becomes trivial:
words = words_in_file("myfile.txt")
words = words_with_k_letters(words, k=4)
print(', '.join(words))

Unable to save the file correctly

I have a text file contains a text about a story and I want to find a word "like" and get the next word after it and call a function to find synonyms for that word. here is my code:
file = 'File1.txt'
with open(file, 'r') as open_file:
read_file = open_file.readlines()
output_lines = []
for line in read_file:
words = line.split()
for u, word in enumerate(words):
if 'like' == word:
next_word = words[u + 1]
find_synonymous(next_word )
output_lines.append(' '.join(words))
with open(file, 'w') as open_file:
open_file.write(' '.join(words))
my only problem I think in the text itself, because when I write one sentence including the word (like) it works( for example 'I like movies'). but when I have a file contains a lot of sentences and run the code it deletes all text. can anyone know where could be the problem
You have a couple of problems. find_synonymous(next_word ) doesn't replace the word in the list, so at best you will get the original text back. You do open(file, 'w') inside the for loop, so the file is overwritten for each line. next_word = words[u + 1] will raise an index error if like happens to be the last word on the line and you don't handle the case where the thing that is liked continues on the next line.
In this example, I track an "is_liked" state. If a word is in the like state, it is converted. That way you can handle sentences that are split across lines and don't have to worry about index errors. The list is written to the file outside the loop.
file = 'File1.txt'
with open(file, 'r') as open_file:
read_file = open_file.readlines()
output_lines = []
is_liked = False
for line in read_file:
words = line.split()
for u, word in enumerate(words):
if is_liked:
words[u] = find_synonymous(word)
is_liked = False
else:
is_liked = 'like' == word
output_lines.append(' '.join(words) + '\n')
with open(file, 'w') as open_file:
open_file.writelines(output_lines)

Counting the occurrences of all letters in a txtfile [duplicate]

This question already has answers here:
I'm trying to count all letters in a txt file then display in descending order
(4 answers)
Closed 6 years ago.
I'm trying to open a file and count the occurrences of letters.
So far this is where I'm at:
def frequencies(filename):
infile=open(filename, 'r')
wordcount={}
content = infile.read()
infile.close()
counter = {}
invalid = "ā€˜'`,.?!:;-_\nā€”' '"
for word in content:
word = content.lower()
for letter in word:
if letter not in invalid:
if letter not in counter:
counter[letter] = content.count(letter)
print('{:8} appears {} times.'.format(letter, counter[letter]))
Any help would be greatly appreciated.
best way is using numpy packages, the example would be like this
import numpy
text = "xvasdavawdazczxfawaczxcaweac"
text = list(text)
a,b = numpy.unique(text, return_counts=True)
x = sorted(zip(b,a), reverse=True)
print(x)
in your case, you can combine all your words into single string, then convert the string into list of character
if you want to remove all except character, you can use regex to clean it
#clean all except character
content = re.sub(r'[^a-zA-Z]', r'', content)
#convert to list of char
content = list(content)
a,b = numpy.unique(content, return_counts=True)
x = sorted(zip(b,a), reverse=True)
print(x)
If you are looking for a solution not using numpy:
invalid = set([ch for ch in "ā€˜'`,.?!:;-_\nā€”' '"])
def frequencies(filename):
counter = {}
with open(filename, 'r') as f:
for ch in (char.lower() for char in f.read()):
if ch not in invalid:
if ch not in counter:
counter[ch] = 0
counter[ch] += 1
results = [(counter[ch], ch) for ch in counter]
return sorted(results)
for result in reversed(frequencies(filename)):
print result
I would suggest using collections.Counter instead.
Compact Solution
from collections import Counter
from string import ascii_lowercase # a-z string
VALID = set(ascii_lowercase)
with open('in.txt', 'r') as fin:
counter = Counter(char.lower() for line in fin for char in line if char.lower() in VALID)
print(counter.most_common()) # print values in order of most common to least.
More readable solution.
from collections import Counter
from string import ascii_lowercase # a-z string
VALID = set(ascii_lowercase)
with open('in.txt', 'r') as fin:
counter = Counter()
for char in (char.lower() for line in fin for char in line):
if char in VALID:
counter[char] += 1
print(counter)
If you don't want to use a Counter then you can just use a dict.
from string import ascii_lowercase # a-z string
VALID = set(ascii_lowercase)
with open('test.txt', 'r') as fin:
counter = {}
for char in (char.lower() for line in fin for char in line):
if char in VALID:
# add the letter to dict
# dict.get used to either get the current count value
# or default to 0. Saves checking if it is in the dict already
counter[char] = counter.get(char, 0) + 1
# sort the values by occurrence in descending order
data = sorted(counter.items(), key = lambda t: t[1], reverse = True)
print(data)

Writing python scripts

I need to write a standalone program that would run on a python cmd. This program counts the number of characters in every line of HumptyDumpty.txt file, and outputs this to a new file.
Note that the new file needs to contain only the number of characters per line.
Here's my code:
import sys
infilename = sys.argv[1]
outfilename = sys.argv[2]
infile=open(infilename)
outfile=open(outfilename, 'w')
char_=0
for line in infile:
line.split()
char_= len(line.strip("\n"))
outfile.write(str(char_ ))
print(line,end='')
infile.close()
outfile.close()
The ouput file has only one line, the concatenation of xyz instead of
x
y
z
"\n" doesnt seem to be doing the trick. Any suggestions?
If you don't want to include the white space between the words then you should replace them with an empty string.
for line in infile:
nline = line.replace(" ", "")
nline = nline.strip("\n")
char= len(nline)
outfile.write(str(char))
outfile.write("\n")
print(line, end='')
print(char)

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

Resources