lemmatize words in nest list - python-3.x

How do I lemmatize words in the nested list in a single line? I tried few things, I am getting close but I think I may be getting syntax wrong? How do I fix it?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word_list = [['test','exams','projects'],['math','exam','things']]
word_list # type list
Try #1: Does the lemmatization but in different format
for word in word_list:
for e in word:
print(lemmatizer.lemmatize(e)) # not the result I need for
Try #2: Looking for similar approach in one line to solve the problem. Not giving correct results.
[[word for word in lemmatizer.lemmatize(str(doc))] for doc in word_list]
Output needed:
[['test','exam','project'],['math','exam','thing']]

I found a for loop solution for my question above. I couldn't get this into a single line, but it is working for now. If any one is looking for solution:
word_list_lemma = []
for ls in word_list:
word_lem = []
for word in ls:
word_lem.append(lemmatizer.lemmatize(word))
word_list_lemma.append(word_lem)

Related

error while removing the stop-words from the text

I am trying to remove stopwords from my data and I have used this statement to download the stopwords.
stop = set(stopwords.words('english'))
This has character 'd' as one of the stopwords. So, when I apply this to my function it is removing 'd' from the word. Please see the attached picture for the reference and guide me how to fix this.
enter image description here
I checked out the code and noticed that you are applying the rem_stopwords function on the clean_text column, while you should apply it on tweet column.
Otherwise, NLTK removes d, I, and other characters when they are independent tokens, a token here is a word after you split on spaces, so if you have i'd, it will not remove d nor I since they are combined into a word. However if you have 'I like Football' it will remove I, since it will be an independent token.
You can try this code, it will solve your problem
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop = set(stopwords.words('english'))
df['clean_text'] = df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))

Count up misspelled words in a sentence of variable size

as a part of a large project, I need a function that will check for any misspelt words in a sentence, however, this sentence can be one word or it can be 30 words or any size really.
It needs to be fast, if possible I would like to use text blob or pyspellcheck as python_language_tool has problems installing on my comp.
My code so far (non-working):
def spell2():
from textblob import TextBlob
count = 0
sentence = "Tish soulhd al be corrrectt"
split_sen = sentence.split(" ")
for thing in split_sen:
thing = Word(thing)
thing.spellcheck()
# if thing is not spelt correctly add to count, if it is go to
# next word
spell2()
this gives me this error:
thing = Word(thing)
NameError: name 'Word' is not defined
Any suggestions appreciated:)
def spell3():
from spellchecker import SpellChecker
s = "Tish soulhd al be corrrectt, riiiigghtttt?"
wordlist=s.split()
spell = SpellChecker()
amount_miss = len(list(spell.unknown(wordlist)))
print("Possible amount of misspelled words in the text:",amount_miss)
spell3()

removing NLTK StopWords

I am trying to remove stop words of my data set.
stopwordsw = nltk.corpus.stopwords.words('german')
def remove_stopwords(txt_clean):
txt_clean = [Word for Word in txt_clean if Word not in stopwords]
return txt_clean
data['Tweet_sw'] = data['Tweet_clean'].apply(lambda x: remove_stopwords(x))
data.head()
I have two problems with that.
First, the output is given character by character (separated by a comma), although I run the check against the list of stopwords with words.
I can solve this problem with a join command, but I don't understand why it is split into characters.
The second and real problem is that the removal of stop words does not work. Words that are clearly in the list are not removed from the sentences.
Where is my mistake in this?
image
txt_clean = [Word for Word in txt_clean.split() if Word not in stopwords]

How to find the number of common words in a text file and delete them in python?

The question is to:
Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them
the file name should be used as the parameter of the function
I have done the first part of the question
import re
file = open('text.txt', 'r', encoding = 'latin-1')
word_list = file.read().split()
for x in word_list:
print(x)
res = len(word_list)
print ('The number of words in the text:' + str(res))
def uncommonWords (file):
uncommonwords = (list(file))
for i in uncommonwords:
i += 1
print (i)
The code shows till the number of the words and nothing appears after that.
you can do it like this
# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])
# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
for line in text_file:
for word in line.split():
words_in_file.add(word)
# remove common words from word list
unique_words = words_in_file - stop_words
print(list(unique_words))
First, you may want to get rid of punctuation : as showed in this answer, you should do :
nonPunct = re.compile('.*[A-Za-z0-9].*')
filtered = [w for w in text if nonPunct.match(w)]
then, you could do
from collections import Counter
counts = Counter(filtered)
you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with
[word for word in list(counts.keys()) if word not in common_words]
Hope this answers your question.

Python:Comparing the strings in two files and printing the match does not works properly

I am trying to compare the strings of file "formatted_words.txt" with another customised file "dictionary.txt" and in the output I am trying to print those words from "formatted_words.txt"formatted_words file which are present in file "dictionary.txt"dictionary file.
from itertools import izip
with open("formatted_words.txt") as words_file:
with open("dictionary.txt") as dict_file:
all_strings = list(map(str.strip,dict_file))
for word in words_file:
for a_string in all_strings:
if word in a_string:
print a_string
Nevertheless, in the output, all the words of the file "formatted_words.txt" are getting printed, though many words from this file are not in the "dictionary.txt".I cannot use any builtin python dictionary.Any help would be appreciated.
Using sets:
with open('formatted_words.txt') as words_file:
with open('dictionary.txt') as dict_file:
all_strings = set(map(str.strip, dict_file))
words = set(map(str.strip, words_file))
for word in all_strings.intersection(words):
print(word)
Prints nothing because the intersection is empty

Resources