How to train/extend an nltk vocabulary in a non-English language

How to train/extend an nltk vocabulary in a non-English language - python-3.x

I'm parsing a German text with many hyphens in it. To check if a word is a proper German word (and only got seperated by a hyphen because it was the end of the line) or needs those hyphens because that is actually how it should be written, I am currently extending a collection of lemmatized words that I found here:
https://github.com/michmech/lemmatization-lists
Can you point me to a way how that can be done with nltk?
What I do: when my parser encounters a word with a hyphen, I check spelling without hyphen (i.e. if it is contained in my list with lemmatized words). If it is not contained in my list (currently some 420,000 words) I will check myself if it should be added to my list or written with hyphen.
This is the function that does the work:
**function(sents, german_words, hyphened_words):**
clutter = '[*!?,.;:_\s()\u201C\u201D\u201E\u201F\u2033\u2036\u0022]'
sentences = list()
new_hyphened_words = list()
new_german_words = list()
skip = False
for i, sentence in enumerate(sents):
if skip:
skip = False
continue
new_sentence = ''
words = sentence.split(' ')
words = list(filter(None, words))
new_words = list() # words to make a correct sentence
last_word = words[-1]
last_word = last_word.strip()
if last_word[-1] == '-':
try:
next_sentence = sents[i+1]
except IndexError as e:
raise e
next_words = next_sentence.split(' ')
next_words = list(filter(None, next_words))
first_word = next_words[0]
new_word = last_word[:-1] + first_word
new_word = re.sub(clutter, '', word)
if _is_url_or_mail_address(new_word):
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
elif new_word in german_stopwords:
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
elif new_word in german_words:
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
else:
new_word = last_word + first_word # now with hyphen!
new_word = re.sub(clutter, '', word)
if new_word in hyphened_words:
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
else: # found neither with nor without hyphen
with_hyphen = re.sub(clutter, '', last_word + first_word)
without_hyphen = re.sub(clutter, '', last_word[:-1] + first_word)
print(f'1: {with_hyphen}, 2: {without_hyphen}')
choose = input('1 or 2, or . if correction')
if choose == '1':
new_hyphened_words.append(with_hyphen)
new_words = words[:-1] + [last_word+first_word] + next_words[1:]
skip = True
continue
elif choose == '2':
new_german_words.append(without_hyphen)
new_words = words[:-1] + [last_word[:-1]+first_word] +\
next_words[1:]
skip = True
continue
else:
corrected_word = input('Corrected word: ')
print()
new_german_words.append(corrected_word)
print(f'Added to dict: "{corrected_word}"')
ok = input('Also add to speech? ./n')
if ok == 'n':
speech_word = input('Speech word: ')
new_words = words[:-1] + [speech_word] + next_words[1:]
skip = True
continue
else:
new_words = words[:-1] + [corrected_word] + next_words[1:]
skip = True
continue
else:
new_words = words
new_sentence = ' '.join(w for w in new_words)
sentences.append(new_sentence)
return sentences
The lists "german_words" and "hyphened_words" get updated every now and again so they contain the new words from the sessions before.
What I do works, however it is slow work. I have been searching for ways to do this with nltk but I seem to have looked at the wrong places. Can you point me to a way that trains an nltk collection of words or that uses a more efficient way of processing this?

Related

index-error: string out of range in wordle similar game

So i'm making a Wordle clone and i keep getting a "indexerror: string is out of range" and i dont know what to do and every time i do the second guess it will say indexerrror string is out off range
at line 30
import random
def Wordle():
position = 0
clue = ""
print("Welcome to Wordle, in this game your're going to be guessing a random word of 5 letters, if the word have a a correct letters it will tell you, if the letter is in the word but isn't at the place that it's supposed to be it will say that the letter is in the word but ins't at the correct place, and if it ins't at the word it will say nothing.And you only have 6 tries")
user_name = input("Enter your name:")
valid_words = ['sweet','shark','about','maybe','tweet','shard','pleat','elder','table','birds','among','share','label','frame','water','earth','winds','empty','audio','pilot','radio','steel','words','chair','drips','mouse','moose','beach','cloud','yours','house','holes','short','small','large','glass','ruler','boxes','charm','tools']
TheAnswer = random.choice(valid_words)
print(TheAnswer)
number_of_guesses = 1
guessed_correct = False
while number_of_guesses < 7 and not guessed_correct:
print(' ')
TheGuess = input('Enter your guess (it have to be a 5-letter word and not captalied letters):')
if len(TheGuess) > 5:
print('word is not acceptable')
else:
print(' ')
for letter in TheGuess:
if letter == TheAnswer[position]:
clue += ' V'
elif letter in TheAnswer:
clue += ' O'
else:
clue += ' F'
position += 1
print(clue)
if TheGuess == TheAnswer:
guessed_correct = True
else:
guessed_correct = False
number_of_guesses += 1
clue = ""
if guessed_correct:
print('Congradulations', user_name,', you won and only used', number_of_guesses,'guesses, :D')
else:
print(user_name,'Unfortunatealy you usead all your guesses and lost :( . the word was', TheAnswer)

You are indexing TheAnswer[position] which basically means if TheAnswer is x length long, then the position should be,
But since the position variable is a local variable of the function, it continuously increases on each guess. Does not get reset on each guess.
Try moving the
position = 0
Into top of the while loop.
Code,
import random
def Wordle():
clue = ""
print("Welcome to Wordle, in this game your're going to be guessing a random word of 5 letters, if the word have a a correct letters it will tell you, if the letter is in the word but isn't at the place that it's supposed to be it will say that the letter is in the word but ins't at the correct place, and if it ins't at the word it will say nothing.And you only have 6 tries")
user_name = input("Enter your name:")
valid_words = ['sweet', 'shark', 'about', 'maybe', 'tweet', 'shard', 'pleat', 'elder', 'table', 'birds', 'among', 'share', 'label', 'frame', 'water', 'earth', 'winds', 'empty', 'audio',
'pilot', 'radio', 'steel', 'words', 'chair', 'drips', 'mouse', 'moose', 'beach', 'cloud', 'yours', 'house', 'holes', 'short', 'small', 'large', 'glass', 'ruler', 'boxes', 'charm', 'tools']
TheAnswer = random.choice(valid_words)
print(TheAnswer)
number_of_guesses = 1
guessed_correct = False
while number_of_guesses < 7 and not guessed_correct:
position = 0
print(' ')
TheGuess = input(
'Enter your guess (it have to be a 5-letter word and not captalied letters):')
if len(TheGuess) > 5:
print('word is not acceptable')
else:
print(' ')
for letter in TheGuess:
if letter == TheAnswer[position]:
clue += ' V'
elif letter in TheAnswer:
clue += ' O'
else:
clue += ' F'
position += 1
print(clue)
if TheGuess == TheAnswer:
guessed_correct = True
else:
guessed_correct = False
number_of_guesses += 1
clue = ""
if guessed_correct:
print('Congradulations', user_name, ', you won and only used',
number_of_guesses, 'guesses, :D')
else:
print(
user_name, 'Unfortunatealy you usead all your guesses and lost :( . the word was', TheAnswer)
Thanks!

How do I print the output in one line as opposed to it creating a new line?

For some reason, I cannot seem to find where I have gone wrong with this program. It simply takes a file and reverses the text in the file, but for some reason all of separate sentences print on a new and I need them to print on the same line.
Here is my code for reference:
def read_file(filename):
try:
sentences = []
with open(filename, 'r') as infile:
sentence = ''
for line in infile.readlines():
if(line.strip())=='':continue
for word in line.split():
if word[-1] in ['.', '?', '!']:
sentence += word
sentences.append(sentence)
sentence = ''
else:
sentence += word + ' '
return sentences
except:
return None
def reverse_line(sentence):
stack = []
punctuation=sentence[-1]
sentence=sentence[:-1].lower()
words=sentence.split()
words[-1] = words[-1].title()
for word in words:
stack.append(word)
reversed_sentence = ''
while len(stack) != 0:
reversed_sentence += stack.pop() + ' '
return reversed_sentence.strip()+punctuation
def main():
filepath = input('File: ')
sentences = read_file(filepath)
if sentences is None:
print('Unable to read data from file: {}'.format(filepath))
return
for sentence in sentences:
reverse_sentence = reverse_line(sentence)
print(reverse_sentence)
main()

You can use the end keyword argument:
print(reverse_sentence, end=' ')
The default value for the end is \n, printing a new-line character at the end.
https://docs.python.org/3.3/library/functions.html#print

How can i complete this project ? it about sentmental classification

I was trying to write code for sentmental classifier for twitter csv file.
The code is working but in the coursera platform it gets stuck it did not work. It show me some errors; it is coursera pyhton function dictionaries and files specialization course.
projectTwitterDataFile = open("project_twitter_data.csv","r")
resultingDataFile = open("resulting_data.csv","w")
punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '#']
# lists of words to use
positive_words = []
with open("positive_words.txt") as pos_f:
for lin in pos_f:
if lin[0] != ';' and lin[0] != '\n':
positive_words.append(lin.strip())
def get_pos(strSentences):
strSentences = strip_punctuation(strSentences)
listStrSentences= strSentences.split()
count=0
for word in listStrSentences:
for positiveWord in positive_words:
if word == positiveWord:
count+=1
return count
negative_words = []
with open("negative_words.txt") as pos_f:
for lin in pos_f:
if lin[0] != ';' and lin[0] != '\n':
negative_words.append(lin.strip())
def get_neg(strSentences):
strSentences = strip_punctuation(strSentences)
listStrSentences = strSentences.split()
count=0
for word in listStrSentences:
for negativeWord in negative_words:
if word == negativeWord:
count+=1
print(count)
return count
def strip_punctuation(strWord):
for charPunct in punctuation_chars:
strWord = strWord.replace(charPunct, "")
return strWord
def writeInDataFile(resultingDataFile):
resultingDataFile.write("Number of Retweets, Number of Replies,
positive Score, Negative Score, Net Score")
resultingDataFile.write("\n")
linesPTDF = projectTwitterDataFile.readlines()
headerDontUsed= linesPTDF.pop(0)
for linesTD in linesPTDF:
listTD = linesTD.strip().split(',')
resultingDataFile.write("{}, {}, {}, {}, {}".format(listTD[1],
listTD[2], get_pos(listTD[0]), get_neg(listTD[0]),
(get_pos(listTD[0])- get_neg(listTD[0]))))
resultingDataFile.write("\n")
writeInDataFile(resultingDataFile)
projectTwitterDataFile.close()
resultingDataFile.close())
Error
TimeLimitError: Program exceeded run time limit. on line 37
Description
Your program is running too long. Most programs in this book should
run in less than 10 seconds easily. This probably indicates your
program is in an infinite loop.
To Fix
Add some print statements to figure out if your program is in an
infinte loop. If it is not you can increase the run time with
sys.setExecutionLimit(msecs)

It's not very clear to me what you're trying to do. Your code isn't very well-formatted, so besides the fact that your line indentation is off, it's not immediately clear where your functions end and where new commands begin. In fact, you have commands put in between your definitions/functions. The typical convention in Python is to put your callable functions at the top, then the flow of actual code at the bottom after your functions.
I take it that positive_words.txt and negative_words.txt are just two text files where each line is either a word (with punctuation marks such as apostrophes already stripped out), something starting with a semicolon, or a blank line? If so, you can probably just do something like this to extract the lists:
with open("positive_words") as f:
positive_words = [ c.strip() for c in f.readlines() if c[0] not in [';', '\n'] ]
Also, instead opening files and passing the instances of their openings into your functions, only to not use that file in any other functions, maybe you should just pass in the name of the file, then do all of the opening and closing from within the function.

punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '#']
def strip_punctuation(oldS):
for i in punctuation_chars:
oldS = str(oldS).replace('%s' % i, '')
return oldS
def strip_punctuation(oldS):
for i in punctuation_chars:
oldS = str(oldS).replace('%s' % i, '')
return oldS
punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '#']
# list of positive words to use
positive_words = []
with open("positive_words.txt") as pos_f:
for lin in pos_f:
if lin[0] != ';' and lin[0] != '\n':
positive_words.append(lin.strip())
def get_pos(str):
str = strip_punctuation(str).split()
j = 0
for i in str:
if i in positive_words:
j += 1
return j
def strip_punctuation(oldS):
for i in punctuation_chars:
oldS = str(oldS).replace('%s' % i, '')
return oldS
punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '#']
negative_words = []
with open("negative_words.txt") as pos_f:
for lin in pos_f:
if lin[0] != ';' and lin[0] != '\n':
negative_words.append(lin.strip())
def get_neg(str):
str = strip_punctuation(str).split()
k = 0
for i in str:
if i in negative_words:
k += 1
return k
def run(file):
csvFile = open(file, 'r')
lines = csvFile.readlines()
lines = lines[1:]
neg_count = []
pos_count = []
wordList = []
for i in lines:
i = i.strip()
i = i.split(",")[0]
wordList.append(i)
for i in wordList:
neg_count.append(get_neg(i))
pos_count.append(get_pos(i))
res = ['retweet_count,reply_count,pos_count,neg_count,score']
res = []
for i in lines:
i = i.strip()
i = i.split(",")[1:]
res.append(i)
temp = []
for i in res:
i = list(map(int, i))
temp.append(i)
res = temp
for i in range(len(res)):
res[i].append(pos_count[i])
res[i].append(neg_count[i])
res[i].append(pos_count[i] - neg_count[i])
temp = []
for i in res:
temp.append(','.join('%s' %id for id in i))
res = temp
res.insert(0, "Number of Retweets, Number of Replies, Positive Score, Negative Score, Net Score")
print(res)
res = '\n'.join('%s' % id for id in res)
with open("resulting_data.csv", 'w') as csvFile:
write = csvFile.write(res)
if __name__ == '__main__':
run('project_twitter_data.csv')

Return value of function gets ignored

VOWELS = ['a', 'e', 'i', 'o', 'u']
BEGINNING = ["th", "st", "qu", "pl", "tr"]
def pig_latin2(word):
# word is a string to convert to pig-latin
string = word
string = string.lower()
# get first letter in string
test = string[0]
if test not in VOWELS:
# remove first letter from string skip index 0
string = string[1:] + string[0]
# add characters to string
string = string + "ay"
if test in VOWELS:
string = string + "hay"
print(string)
def pig_latin(word):
string = word
transfer_word = word
string.lower()
test = string[0] + string[1]
if test not in BEGINNING:
pig_latin2(transfer_word)
if test in BEGINNING:
string = string[2:] + string[0] + string[1] + "ay"
print(string)
When I un-comment the code below and replace print(string) with return string in above two functions, it only works for words in pig_latin(). As soon as word should be passed to pig_latin2() I get a value of None for all words and the programs crashes.
# def start_program():
# print("Would you like to convert words or sentence into pig latin?")
# answer = input("(y/n) >>>")
# print("Only have words with spaces, no punctuation marks!")
# word_list = ""
# if answer == "y":
# words = input("Provide words or sentence here: \n>>>")
# new_words = words.split()
# for word in new_words:
# word = pig_latin(word)
# word_list = word_list + " " + word
# print(word_list)
# elif answer == "n":
# print("Goodbye")
# quit()
# start_program()

You're not capturing the return value of the pig_latin2 function. So whatever that function does, you're discarding its output.
Fix this line in the pig_latin function:
if test not in BEGINNING:
string = pig_latin2(transfer_word) # <----------- forgot 'string =' here
When fixed thusly, it works for me. Having said that, there would still be a bunch of stuff to clean up.

Siimple Python. Not sure why my program is outputting this

I am making a program to take in a sentence, convert each word to pig latin, and then spit it back out as a sentence. I have no idea where I have messed up. I input a sentence and run it and it says
built-in method lower of str object at 0x03547D40
s = input("Input an English sentence: ")
s = s[:-1]
string = s.lower
vStr = ("a","e","i","o","u")
def findFirstVowel(word):
for index in range(len(word)):
if word[index] in vStr:
return index
return -1
def translateWord():
if(vowel == -1) or (vowel == 0):
end = (word + "ay")
else:
end = (word[vowel:] + word[:vowel]+ "ay")
def pigLatinTranslator(string):
for word in string:
vowel = findFirstVowel(word)
translateWord(vowel)
return
print (string)

You have used the lower method incorrectly.
You should use it like this string = s.lower().
The parentheses change everything. When you don't use it, Python returns an object.
Built-in function should always use ()

Here is the corrected version of the code which should work:
s = input("Input an English sentence: \n").strip()
string = s.lower() #lowercasing
vStr = ("a","e","i","o","u")
def findFirstVowel(word):
for idx,chr in enumerate(word):
if chr in vStr:
return idx
return -1
def translateWord(vowel, word):
if(vowel == -1) or (vowel == 0):
end = (word + "ay")
else:
end = (word[vowel:] + word[:vowel]+ "ay")
def pigLatinTranslator(string):
for word in string:
vowel = findFirstVowel(word)
translateWord(vowel,word)
return
print(string)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to train/extend an nltk vocabulary in a non-English language - python-3.x

Related

index-error: string out of range in wordle similar game

How do I print the output in one line as opposed to it creating a new line?

How can i complete this project ? it about sentmental classification

Return value of function gets ignored

Siimple Python. Not sure why my program is outputting this

Categories

Resources