I'm trying to make a program that will grab a random word from a JSON file and print it and it's definition using PyDictionary. It works occasionally but I think the issue I am having is displaying output from dictionary.meaning(word) when the word has multiple meanings. I get an IndexError when that appears the be the case.
example outputs: expected: tinamidae Noun ['comprising the
tinamous']
unwanted result: unmaterially Error: The Following Error occured: list
index out of range No definition found!
import json
import random
from PyDictionary import PyDictionary
dictionary = PyDictionary()
with open('C:\\Users\\jabes\\Desktop\\words_dictionary.json') as json_file:
words = json.load(json_file)
word = random.choice(list(words.keys()))
print(word)
try:
meanings = dictionary.meaning(word)
if meanings:
for k,v in meanings.items():
print(k, v)
else:
print("No definition found!")
except Exception as error:
print(error)
print("Exiting!")
Related
I am trying to delete phrase from text file using numpy.I have tried
num = [] and num1.append(num1)
'a' instead of 'w' to write the file back.
While append doesn't delete the phrase
writes' first run deletes the phrase
second run deletes second line which is not phrase
third run empties the file
import numpy as np
phrase = 'the dog barked'
num = 0
with open("yourfile.txt") as myFile:
for num1, line in enumerate(myFile, 1):
if phrase in line:
num += num1
else:
break
a=np.genfromtxt("yourfile.txt",dtype=None, delimiter="\n", encoding=None )
with open('yourfile.txt','w') as f:
for el in np.delete(a,(num),axis=0):
f.write(str(el)+'\n')
'''
the bird flew
the dog barked
the cat meowed
'''
I think you can still use nums.append(num1) with w mode, the issue I think you're getting is that you used the enumerate function for myFile's lines using 1-index instead of 0-index as expected in numpy array. Changing it from enumerate(myFile, 1) to enumerate(myFile, 0) seems to fix the issue
import numpy as np
phrase = 'the dog barked'
nums = []
with open("yourfile.txt") as myFile:
for num1, line in enumerate(myFile, 0):
if phrase in line:
nums.append(num1)
a=np.genfromtxt("yourfile.txt",dtype=None, delimiter="\n", encoding=None )
with open('yourfile.txt','w') as f:
for el in np.delete(a,nums,axis=0):
f.write(str(el)+'\n')
I am working on this movie classification problem
https://www.tensorflow.org/tutorials/keras/text_classification
In this example text files(12500 files with movie revies) are read and a batched dataset is prepared like below
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
at the time of standardization
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
#I WANT TO REMOVE STOP WORDS HERE, CAN I DO
return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation),'')
Problem: I understand that I have got training dataset with labels in variable 'raw_train_ds'. Now I want to iterate over this dataset and remove stop words from the movie review text and store back to same variable, I tried to do it in function 'custom_standardization' but it gives type error,
I also tried to use tf.strings.as_strings but it returns error
InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: int8, int16, int32, int64
can someone please help on it OR simply please help how to remove stopwords from the batch dataset
It looks like right now TensorFlow does not have built in support for stop words removal, just basic standardization (lowercase & punctuation stripping). The TextVectorization used in the tutorial supports a custom standardization callback, but I couldn't find any stop words examples.
Since the tutorial downloads the imdb dataset and reads the text files from disc you can just do standardization manually with python before reading them. This will modify the text files themselves, but then you can read in the files normally using tf.keras.preprocessing.text_dataset_from_directory, and the entries will already have the stop words removed.
#!/usr/bin/env python3
import pathlib
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
def cleanup_text_files_in_folder(folder_name):
text_files = []
for file_path in pathlib.Path(folder_name).glob('*.txt'):
text_files.append(str(file_path))
print(f'Found {len(text_files)} files in {folder_name}')
# Give some kind of status
i = 0
for text_file in text_files:
replace_file_contents(text_file)
i += 1
if i % 1000 == 0:
print("No of files processed =", i)
return text_files
def replace_file_contents(input_file):
"""
This will read in the contents of the text file, process it (clean up, remove stop words)
and overwrite the new 'processed' output to that same file
"""
with open(input_file, 'r') as file:
file_data = file.read()
file_data = process_text_adv(file_data)
with open(input_file, 'w') as file:
file.write(file_data)
def process_text_adv(text):
# review without HTML tags
text = BeautifulSoup(text, features="html.parser").get_text()
# review without punctuation and numbers
text = re.sub(r'[^\w\s]','',text, re.UNICODE)
# lowercase
text = text.lower()
# simple split
text = text.split()
swords = set(stopwords.words("english")) # conversion into set for fast searching
text = [w for w in text if w not in swords]
# joining of splitted paragraph by spaces and return
return " ".join(text)
if __name__ == "__main__":
# Download & untar dataset beforehand, then running this would modify the text files
# in place. Back up the originals if that's a concern.
cleanup_text_files_in_folder('aclImdb/train/pos/')
cleanup_text_files_in_folder('aclImdb/train/neg/')
cleanup_text_files_in_folder('aclImdb/test/pos/')
cleanup_text_files_in_folder('aclImdb/test/neg/')
I recently installed tabulate onto conda and I am trying to tabulate my results with print syntax
Source: Printing Lists as Tabular Data
but I am getting "TypeError: 'headers' is an invalid keyword argument for print()"
I have tried "print(tabulate([['Alice', 24], ['Bob', 19]], headers=['Name', 'Age'], tablefmt='orgtbl'))"
from tabulate import tabulate
i: int
with open("incre.txt", "w") as file:
for i in range(1, 100,5):
mol = int((i*50)/(i+50))
file.write(str(i)+ " " +str(mol) + "\n")
print(tabulate([[i], [mol]]), headers=['i' , 'mol'], tablefmt='orgtbl')
file.close()
Expected Results would be on terms of
I am getting typeerror, what am I missing here?
There is a mistake in the way you wrote your parenthesis, try with that line:
print(tabulate([[i], [mol]], headers=['i' , 'mol'], tablefmt='orgtbl'))
What you were doing was like doing this:
x = tabulate([[i], [mol]]
print(x, headers=['i' , 'mol'], tablefmt='orgtbl')
As you can see there, you were trying to call the print method with headers and tablefmt keywords, wich caused the error: 'headers' is an invalid keyword argument for print()
Update:
I'm not sure, but i think what you try to achieve is:
from tabulate import tabulate
values = []
for i in range(1, 100,5):
mol = int((i*50)/(i+50))
values.append([i, mol])
print(tabulate(values, headers=['i' , 'mol'], tablefmt='orgtbl'))
in your code, you were printing i and mol after having exited from the while loop, then you would have only printed their last values...
I'm getting error message from this command line "cv.fit (bigdf ['Description'])"
I noticed that this error is happening after I created the Tokenize function and RemoveStopWords sees that the line that returns in the panda is now like this
['PS4', 'SpiderMan'], ['XBOX', 'SpiderMan'], ['XBOX', 'Blackops 4']
before an entire sentence (here is the fit command is working, before created Tokenize(sentence) and StopWord(setence) )
['PS4 SpiderMan'], ['XBOX SpiderMan'], ['XBOX Blackops 4']
Is there any way to get the fit with tokenized values, or some way of converting these tokens to a sentence? Because I am using stemming and stopword library in Portuguese
def StemmingPortuguese(sentence):
phrase = []
for word in sentence:
phrase.append(stemmer.stem(word))
return phrase
def RemoveStopWords(sentence):
return [word for word in sentence if word not in stopwords]
def TreatPortuguese(sentence):
return StemmingPortuguese(RemoveStopWords(Tokenize(remove_accents(sentence))))
def Tokenize(sentence):
sentence = sentence.lower()
sentence = nltk.word_tokenize(sentence)
return sentence
trainData = []
for name in files:
if name.endswith(".txt"):
#print(os.path.basename(name))
trainData.append(pd.read_csv(os.path.basename(name), converters={'Description':TreatPortuguese}, quoting=csv.QUOTE_NONE, delimiter=";", error_bad_lines=False, names = ["Product", "Description", "Brand", "CategoryID"])) #, error_bad_lines=False
bigdf = pd.concat(trainData)
print(bigdf['CategoryID'].value_counts())
print(bigdf[:2])
cv = CountVectorizer(analyzer="word")
cv.fit(bigdf['Description'])
I have the following code where i am facing error and i am unable to identify the actual issue here. The code takes a .json file which holds the words and their meanings and finds the exact or nearest matches of the words given as input by the user along with their meanings. The code was running fine until i tried to modify it a little. I wanted to add the matching words where the first word is capital in the following line post which it started throwing exception:
Changed line:
if (word != "") and ((word in data.keys()) or (word.capitalize() in data.keys())):
Code:
import json
import difflib
def searchWord(word):
if (word != "") and ((word in data.keys()) or (word.capitalize() in data.keys())):
return data[word]
else:
closematch = difflib.get_close_matches(word,data.keys())[0]
confirmation = (input(f"\nDid you mean: {closematch} (y/n): ")).lower()
if confirmation == 'y':
return data[closematch]
else:
return 'Word Not Found in Dictionary'
print('Loading Data...\n')
data = json.load(open('data.json'))
print('Data Loaded!\n')
word = (input('Enter word to lookup in dictionary: ')).lower()
meanings = searchWord(word)
if meanings == list:
for meaning in meanings:
print("\n"+meaning)
else:
print(meanings[0])
Error:
Loading Data...
Data Loaded!
Enter word to lookup in dictionary: delhi
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
E:\Learning\Python\AdvancedPython\PythonMegaCourse\DictionaryApp\dictionary.py in <module>()
20 word = (input('Enter word to lookup in dictionary: ')).lower()
21
---> 22 meanings = searchWord(word)
23 if meanings == list:
24 for meaning in meanings:
E:\Learning\Python\AdvancedPython\PythonMegaCourse\DictionaryApp\dictionary.py in searchWord(word)
4 def searchWord(word):
5 if (word != "") and ((word in data.keys()) or (word.capitalize() in data.keys())):
----> 6 return data[word]
7 else:
8 closematch = difflib.get_close_matches(word,data.keys())[0]
KeyError: 'delhi'
The .json file has got a key named Delhi however, the capitalize() doesn't seem to work.
When you are trying to access the word from the dictionary, you are not capitalizing it.
This is not a clean way to handle it but to give you the idea.
if (word != "") and (word in data.keys()):
return data[word]
if (word != "") and (word.capitalize() in data.keys()):
return data[word.capitalize()]