splitting words by syllable with CMU Pronunciation Dictionary, NLTK, and Python3 - python-3.x

I am working on a natural language processing project and am stuck on splitting words into syllables (using nltk and cmudict.dict() in python 3).
I currently count syllables by looking a word in my corpus up in the CMU Pronunciation Dictionary and counting the number of stresses in its list of phonemes. This appears to work pretty well.
What I am stuck on is how to use this information to split the accompanying grapheme after counting, as I do not understand how to either translate the phonemes back to the graphemes (seems error prone) or use the list of phonemes to somehow split the grapheme.
Here is the function I wrote to do this (word tokenization happens elsewhere):
def getSyllables(self):
pronunciation = cmudict.dict() # get the pronunciation dictionary
syllableThreshold = 1 # we dont care about 1 syllable words right now
for word in self.tokens:
for grapheme, phonemes in pronunciation.items():
if grapheme == word.lower(): # all graphemes are lowercase, we have to word.lower() to match
syllableCounter = 0
for x in phonemes[0]:
for y in x:
if y[-1].isdigit(): # an item ending in a number is a stress (syllable)
syllableCounter += 1
if syllableCounter > syllableThreshold:
output = ' '.join([word, "=", str(syllableCounter)])
print(output)
print(phonemes)
else:
print(word)
Just as an example, my current output is:
Once
an
angry = 2
[['AE1', 'NG', 'G', 'R', 'IY0']]
man
How can I split the word angry, for example, into an - gry?

Related

(Python) Why doesn't my program run anything while my CPU usage skyrockets?

So I was following this tutorial on youtube however when it comes to running the program, nothing shows up on the terminal while my CPU doubles in usage. I am using VScode to run this program. PS. I am very new to this and I always have frequent trouble when it comes to coding in general so i decided to resort in asking questions.
import random
from words import words
import string
def get_valid_word(words):
word = random.choice(words) #randomly chooses something from the list
while '-' in word or '' in word: #helps pick words that have a dash or space in it
word = random.choice(words)
return word.upper() #returns in uppercase
def hangman():
word = get_valid_word(words)
word_letters = set(word) # saves all the letters in the word
alphabet = set(string.ascii_uppercase) #imports predetermined list from english dictionary
used_letters = set() #keeps track of what user guessed
#getting user input
while len(word_letters) > 0: #keep guessing till you have no letters to guess the word
#letters used
#'' .join (['a', 'b', 'cd']) --> 'a, b cd'
print('you have used these letters: ',' '.join(used_letters))
#what the current word is (ie W - R D)
word_list = [letter if letter in used_letters else '-' for letter in word]
print('current word: ',' '.join(word_list))
user_letter = input('Guess a letter: ')
if user_letter in alphabet - used_letters: #letters haven't used
used_letters.add(user_letter) #add the used letters
if user_letter in word_letters: #if the letter that you've just guessed is in the word then you remove that letter from word_letters. So everytime you've guessed correctly from the word letters which is keeping track of all the letters in the word decreases in size.
word_letters.remove(user_letter)
elif user_letter in used_letters:
print('You have used this')
else:
print('Invalid')
#gets here when len(word_letters) == 0
hangman()

find a sequence in string

Hi I've got problem set in cs50 and having difficulties as this is my first week in Python and I would be appreciate if you don't directly write an open answer but forward me to the right functions or method to use.
We've been given a long string sequence in a .txt file, one line and no white spaces. I have to find the longest consecutive sequence of words of given DNA string
example txt:
GGAGGCCAAAGTCTTGTGATATCGGGCAACTCCCCGGGAGGAACACAGGCCCACCGAAAACAGCTTGAAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG
original text is around 5000 characters but it goes like the example below. My task is to find the longest consecutive sequences of 'AGATC' string.
lets say the first consequtive sequence is 23 times, after i kept reading and find another consequtive sequences in 34 times, I have to store the biggest number.
My problem is not to find a way to read and analyse a string in this way. I can read a string can find the total repetitive times and so on but finding the longest repetition is not making sense in every way I've tried. I thought C was hard but I can write this code with C so easily as I we can manipulate strings in so much way in C. At least in C there are ways to read in a size but as far as I see Python reads at once and there is no control over read. In Python it doesn't seem you can make much with, at least in my level of knowledge at the moment :/ Probably Python got one line solutions for this, please don't judge this is my 3rd day and 4th program in Python.
What functions or methods I should look to analyze a string in this way. I've watched videos for a similiar thing but for sequence of single character, not a string. Also bought the Python Crash Course to get some knowledge about the string manipulation but couldn't find anything related in this case. Also checked the Python documentation but obviously it's so much complicated for day 3 in Python.
Could anyone help me please.TIA
here is my not-working and not-making-sense code
import csv
import sys
#check the arguments count
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(1)
#create a dictionary to store str results
SEQ = {
"AGATC": 0,
"AATG": 0,
"TATC": 0
}
counter = 0 #keeps the the length of the sequence
seq = 0 #keeps the longest sequence
DNA = '' ## keeps the key of SEQ, "AGATC" etc.
#find the longest consecutive sequence of DNA
def findSEQ(file, DNA): #get the sequences text file and the string of the key as parameters
for DNA in (DNA, file):
if file[i:i + len(DNA)] == DNA: #if find a match
counter += 1 #count up the sequence
else:
if counter > seq: #if it's not a sequence the next thing it reads
seq = counter
counter = 0
return seq
seq = 0
#open sequence file and read
with open(sys.argv[2],'r') as file:
reader = csv.reader(file)
#find the longest sequence of AGATC
findSEQ("AGATC", file)
#update the seq dictionary
SEQ["AGATC"] = seq
#find the longest sequence of AATG
findSEQ(file, "AATG")
#update the seq dictionary
SEQ["AATG"] = seq
#find the longest sequence of TATC
findSEQ(file, "TATC")
#update the seq dictionary
SEQ["TATC"] = seq
#open and read database
with open(sys.argv[1], "r") as file:
reader = csv.reader(file)
#skip the first row
next(reader)
#compare the seq dictionary results with database
for row in reader:
seq1, seq2, seq3 = row[1], row[2], row[3]
#if found any match print the name
if SEQ[seq1] == row[1] and SEQ[seq2] == row[2] and SEQ[seq3] == row[3]:
print(row[0])
#otherwise print not found
else:
print("Not found any match.")
To elaborate on my comment, please find the following example:
import re
text = 'GGAGGCCAAGATCAAGTCTTGTGATATCGGGCAACTCCCCGGGAAGATCAGATCAGATCGGAACACAGGCCCACCGAAAACAGCTTGAAGATCAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG'
sequence = 'AGATC'
pattern = f'(?:{sequence})+'
findings = sorted(re.findall(pattern, text), key=len)
longest_sequence = len(findings[-1]) / len(sequence)
print(f'longest sequence: {longest_sequence}')
This program uses regex (regular expressions) to find sequences of the pattern you're looking for. It then sorts the findings by length (in an ascending order), allowing you to find the longest sequences in the last index of the list.

Spliting a sentence in python using iteration

I have a challenge in my class that is to split a sentence into a list of separate words using iteration. I can't use any .split functions. Anybody had any ideas?
sentence = 'how now brown cow'
words = []
wordStartIndex = 0
for i in range(0,len(sentence)):
if sentence[i:i+1] == ' ':
if i > wordStartIndex:
words.append(sentence[wordStartIndex:i])
wordStartIndex = i + 1
if i > wordStartIndex:
words.append(sentence[wordStartIndex:len(sentence)])
for w in words:
print('word = ' + w)
Needs tweaking for leading spaces or multiple spaces or punctuation.
I never miss an opportunity to drag out itertools.groupby():
from itertools import groupby
sentence = 'How now brown cow?'
words = []
for isalpha, characters in groupby(sentence, str.isalpha):
if isalpha: # characters are letters
words.append(''.join(characters))
print(words)
OUTPUT
% python3 test.py
['How', 'now', 'brown', 'cow']
%
Now go back and define what you mean by 'word', e.g. what do you want to do about hyphens, apostrophes, etc.

Compressing a sentence in python

so i have this task for school and it states:
develop a program that identifies individual words in a sentence, stores these in a list and replaces each word with the position of that word in the list. The sentence has to inputted by the user of the program.
The example we have to use is:
ask not what your country can do for you ask what you can do for your country
should become:
1,2,3,4,5,6,7,8,9,1,3,9,6,7,8,4,5
I have no idea how to even go about starting this task
You could do something like that:
sentence = input().split(' ')
word_list = []
for i, word in enumerate(sentence, start=1):
if word not in word_list:
word_list.append(word)
to_print = str(len(word_list) - 1)
else:
to_print = str(word_list.index(word))
print(to_print, end=',') if i < len(sentence) else print(to_print, end='')

stemming. i need to write a code for this

If you search for something in Google and use a word like "running", Google is smart enough to match "run" or "runs" as well. That's because search engines do what's called stemming before matching words.
In English, stemming involves removing common endings from words to produce a base word. It's hard to come up with a complete set of rules that work for all words, but this simplified set does a pretty good job:
If the word starts with a capital letter, output it without changes.
If the word ends in 's', 'ed', or 'ing' remove those letters, but if the resulting stemmed word is only 1 or 2 letters long (e.g. chopping the ing from sing), use the original word.
Your program should read one word of input and print out the corresponding stemmed word. For example:
Enter the word: states
state
Another example interaction with your program is:
Enter the word: rowed
row
Remember that capitalised words should not be stemmed:
Enter the word: James
James
and nor should words that become too short after stemming:
Enter the word: sing
sing
Here is the code:
word = input("Enter the word:")
x = 'ing'
y = 'ed'
z = 's'
first = word[:1]
last = word[-1:]
uppercase = first.upper
if word == uppercase:
print("")
elif (x in word) == True:
word = (word.replace('ing',''))
print(word)
elif (y in word) == True:
word = (word.replace('ed',''))
print(word)
elif (z in word) == True:
word = (word.replace('s',''))
print(word)
I see two options. Either this is a homework question, in which case - please try to solve your own homework.
The other case - you need this in real life. If so, please look at NLTK for Python natural language processing needs. In particular see http://nltk.org/api/nltk.stem.html
Install NLTK toolkit
and try this
from nltk.stem.porter import PorterStemmer
PorterStemmer.stem_word(word)

Resources