file = open("C:\\Users\\file.txt")
text = file.read()
def ie_preprocess(text):
sent_tokenizer = PunktSentenceTokenizer(text)
sents=sent_tokenizer.tokenize(text)
print(sents)
word_tokenizer = WordPunctTokenizer()
words =nltk.word_tokenize(sents)
print(words)
tagges = nltk.pos_tag(words)
print(tagges)
ie_preprocess(text)
nltk.word_tokenize() takes in text which is expected to be a string, but you are passing in sents which is a list of sentences.
Instead, you want:
words = nltk.word_tokenize(text)
If you would like to tokenize each sentence into a list of words and get this back as a list of lists, you could use
words = [nltk.word_tokenize(sentence) for sentence in sents]
Related
Here is my list
x = ['India,America,Australia,Japan']
How to convert above list into
x = ['India','America','Australia','Japan']
I tried it using strip and split method but it doesn't work.
You can turn that list into the string and split in commas:
test = ['India,America,Australia,Japan']
result = "".join(test).split(",")
print(result)
Output is this:
['India', 'America', 'Australia', 'Japan']
or you can use regex library.
import re
x = "".join(['India,America,Australia,Japan'])
xText = re.compile(r"\w+")
mo = xText.findall(x)
print(mo)
The findall method looks for all the word characters and does not include comma. Finally it returns a list.
Output is this:
['India', 'America', 'Australia', 'Japan']
The question is to:
Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them
the file name should be used as the parameter of the function
I have done the first part of the question
import re
file = open('text.txt', 'r', encoding = 'latin-1')
word_list = file.read().split()
for x in word_list:
print(x)
res = len(word_list)
print ('The number of words in the text:' + str(res))
def uncommonWords (file):
uncommonwords = (list(file))
for i in uncommonwords:
i += 1
print (i)
The code shows till the number of the words and nothing appears after that.
you can do it like this
# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])
# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
for line in text_file:
for word in line.split():
words_in_file.add(word)
# remove common words from word list
unique_words = words_in_file - stop_words
print(list(unique_words))
First, you may want to get rid of punctuation : as showed in this answer, you should do :
nonPunct = re.compile('.*[A-Za-z0-9].*')
filtered = [w for w in text if nonPunct.match(w)]
then, you could do
from collections import Counter
counts = Counter(filtered)
you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with
[word for word in list(counts.keys()) if word not in common_words]
Hope this answers your question.
I am working on some file operations with python.
I have two text files. First, contains a lot of lines about bigram word embedding results such as apple_pie 0.3434 0.6767 0.2312. And another text file which contains a lot of lines with unigram word embedding results of apple_pie has apple 0.2334 0.3412 0.123 pie 0.976 0.75654 0.2312
I want to append apple_pie bigram word embedding results with apple and pie unigram so it result becomes something like:
apple_pie 0.3434 0.6767 0.2312 0.2334 0.3412 0.123 0.976 0.75654 0.2312 in one line. Does anybody know how to do this? Thanks...
bigram = open("bigram.txt",'r')
unigram = open("unigram.txt",'r')
combine =open("combine.txt",'w')
bigram_lines = bigram.readlines()
unigram_lines = unigram.readlines()
iteration = 0
while iteration < len(bigram_lines):
num_list_bigram = []
text_list_bigram = []
for item in bigram_lines[iteration].split(" "):
if "." in item:
num_list_bigram.append(item)
else:
text_list_bigram.append(item)
num_list_unigram = []
text_list_unigram = []
for item in unigram_lines[iteration].split(" "):
if "." in item:
num_list_unigram.append(item)
else:
text_list_unigram.append(item)
iteration+=1
com_list=text_list_bigram+num_list_bigram+num_list_unigram
for item in com_list:
combine.write(item+" ")
combine.write("\n")
bigram.close()
unigram.close()
combine.close()
Hopefully this will work for you
i have tokenized the text form the text files stored in a list and stored the tokenized text in a variable and when i print that variable it shows the wrong result.
import glob
files = glob.glob("D:\Pakistan Constitution\*.txt")
documents = []
for file in files:
with open(file) as f:
documents.append(f.read())
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)
I expect the tokenized words but result occur is like that
['ÿþp\x00a\x00r\x00t\x00', '\x00v\x00', '\x00', '\x00r\x00e\x00l\x00a\x00t\x00i\x00o\x00n\x00s\x00', '\x00b\x00e\x00t\x00w\x00e\x00e\x00n\x00',
So anyone can help me regarding this
I want to remove words tagged with the specific part-of-speech tags VBD and VBN from my CSV file. But, I'm getting the error "IndexError: list index out of range" after entering the following code:
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
My CSV file has 10 reviews of 10 people and the row name is Comment.
Here is my full code:
df_Comment = pd.read_csv("myfile.csv")
def clean(text):
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
tagged = nltk.pos_tag(text)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)
stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
text_clean = []
for text in df)Comment['Comment']:
text_clean.append(clean(text).split())
print(text_clean)
POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)
words=[]
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
How can I fix the error?
It is a bit hard to understand your problem without an example and the corresponding outputs, but it might be this:
Assuming that text is a string, text_clean will be a list of lists of strings, where every string represents a word. After the part-of-speech tagging, POS_tag_text_clean will therefore be a list of lists of tuples, each tuple containing a word and its tag.
If I'm right, then your last loop actually loops over items from your dataframe instead of words, as the name of the variable suggests. If an item has only one word (which is not so unlikely, since you filter a lot in clean()), your call to word[1] will fail with an error similar to the one you report.
Instead, try this code:
words = []
for item in POS_tag_text_clean:
words_in_item = []
for word in item:
if word[1] !='VBD' and word[1] !='VBN':
words_in_item .append(word[0])
words.append(words_in_item)