Struggling with removing stop words using nltk - nlp

I'm trying to remove the stop words from "I don't like ice cream." I have defined:
stop_words = set(nltk.corpus.stopwords.words('english'))
and the function
def stop_word_remover(text):
return [word for word in text if word.lower() not in stop_words]
But when I apply the function to the string in question, I get this list:
[' ', 'n', '’', ' ', 'l', 'k', 'e', ' ', 'c', 'e', ' ', 'c', 'r', 'e', '.']
which, when joining the strings together as in ''.join(stop_word_remover("I don’t like ice cream.")), I get
' n’ lke ce cre.'
which is not what I was expecting.
Any tips on where have I gone wrong?

word for word in text iterates over characters of text (not over words!)
you should change your code as below:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words = set(nltk.corpus.stopwords.words('english'))
def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)
stop_word_remover("I don't like ice cream.")
## 'n't like ice cream .'

Related

Python program that extracts most frequently found names from .csv file

I have created a program that generates 5000 random names, ssn, city, address, and email and stored them in fakeprofile.csv file. I am trying to extract the most common names from the file. I was able to get the program to work syntactically but fail to extract frequent names.
Here's the code:
import re
import statistics
file_open = open('fakeprofile.csv').read()
frequent_names = re.findall('[A-Z][a-z]*', file_open)
print(frequent_names)
Sample in the file:
Alicia Walters 419-52-4141 Yorkstad 66616 Schultz Extensions Suite 225
Reynoldsmouth, VA 72465 stevenserin#stein.biz
Nicole Duffy 212-38-9009 West Timothy 51077 Phillips Ports Apt. 314
Hubbardville, IN 06723 kaitlinthomas#bennett-carter.com
Stephanie Lewis 442-20-1279 Jacquelineshire 650 Gutierrez Forge Apt. 839
West Christianbury, TN 13654 ukelley#gmail.com
Michael Harris 108-81-3733 East Toddberg 14387 Douglas Mission Suite 038
Garciaview, WI 58624 kshields#yahoo.com
Aaron Moreno 171-30-7715 Port Taraburgh 56672 Wagner Path
Lake Christopher, VA 37884 lucasscott#nguyen.info
Alicia Zimmerman 286-88-9507 Barberstad 5365 Heath Extensions Apt. 731
South Randyburgh, NJ 79367 daniellewebb#yahoo.com
Brittney Mcmillan 334-44-0321 Lisahaven PSC 3856, Box 2428
APO AE 03215 kevin95#hotmail.com
Amanda Perkins 327-31-6610 Perryville 8750 Hurst Harbor Apt. 929
Sample output:
', 'Lake', 'Brianna', 'P', 'A', 'Michael', 'Smith', 'Harveymouth', 'Patricia', 'Tunnel', 'West', 'William', 'G', 'A', 'Charles', 'Perkins', 'Lake', 'Marie', 'Lisa', 'Overpass', 'Suite', 'Kennedymouth', 'C', 'A', 'Barbara', 'Perez', 'Billyshire', 'Joshua', 'Village', 'Cindymouth', 'W', 'I', 'Curtis', 'Simmons', 'North', 'Mitchellport', 'Gordon', 'Crest', 'Suite', 'Jacksonburgh', 'C', 'O', 'Cameron', 'Berg', 'South', 'Dean', 'Christina', 'Coves', 'Williamton', 'T', 'N', 'Maria', 'Williams', 'North', 'Judith', 'Carson', 'Overpass', 'Apt', 'West', 'Amandastad', 'N', 'M', 'Hannah', 'Dennis', 'Rodriguezmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Laura', 'Richardson', 'Lake', 'Kayla', 'Johnson', 'Place', 'Suite', 'Port', 'Jennifermouth', 'N', 'H', 'John', 'Lawson', 'Hintonhaven', 'Thomas', 'Via', 'Mossport', 'N', 'J', 'Jennifer', 'Hill', 'East', 'Phillip', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Cody', 'Jackson', 'Lake', 'Jessicamouth', 'Snyder', 'Ways', 'Apt', 'New', 'Stacey', 'M', 'E', 'Ryan', 'Friedman', 'Shahburgh', 'Jerry', 'Pike', 'Suite', 'Toddfort', 'N', 'V', 'Kathleen', 'Fox', 'Ferrellmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'P', 'Michael', 'Thompson', 'Port', 'Jessica', 'Boone', 'Spurs', 'Suite', 'Port', 'Ashleyland', 'C', 'O', 'Christopher', 'Marsh', 'North', 'Catherine', 'Scott', 'Trail', 'Apt', 'Baileyburgh', 'F', 'L', 'Richard', 'Rangel', 'New', 'Anna', 'Ray', 'Drive', 'Apt', 'Nunezland', 'I', 'A', 'Connor', 'Stanton', 'Troyshire', 'Rodgers', 'Hill', 'West', 'Annmouth', 'N', 'H', 'James', 'Medina',
My issue here is being unable to extract most frequently found first names as well as avoiding those capital letters. Instead, I have extracted all names (including the unnecessary capital letters) and the one seen above is a small sample of all names extracted. I noticed that the first names are always on the odd rows in the output, and I am trying to capture the most frequent first names in those odd rows.
The fakeprofile.csv file was created by this program:
import csv
import faker
from faker import Faker
fake = Faker()
name = fake.name(); print(name)
ssn = fake.ssn(); print(ssn)
city = fake.city(); print(city)
address = fake.address(); print(address)
email = fake.email(); print(email)
profile = fake.simple_profile()
for i,j in profile.items():
print('{}: {}'.format(i,j))
print('Name: {}, SSN: {}, City: {}, Address: {}, Email: {}'.format(name,ssn,city,address,email))
with open('fakeprofile.csv', 'w') as f:
for i in range(0,5001):
print(f'{fake.name()} {fake.ssn()} {fake.city()} {fake.address()} {fake.email()}', file=f)
Does this achieve what you want?
import collections, re
# Read in all lines into a list
with open('fakeprofile.csv') as f:
lines = f.readlines()
# Throw out every other line
lines = [line for i, line in enumerate(lines) if i%2 == 0]
# Keep only first word of each line
names = [line.split()[0] for line in lines]
# Find most common names
n = 3
frequent_names = collections.Counter(names).most_common(n)
# Display most common names
for name, count in frequent_names:
print(name, count)
To do the counting it uses collections.Counter together with its most_common() method.
I think It would have been better if you use pandas library, for the CSV manipulation (collecting the desire information ), and then apply python collection like counter(df ['name'] ) into it, or else could you give us more information about the CSV file.
thank you
So the main problem you have is that you use a regexp that will capture every letter.
You are interested in the first world in the odd line.
you can do something on those lines:
# either use a dict to count or a list to transform as counter.
dico_count = {}
with open('fakeprofile.csv') as file_open: # use of context manager
line_number = 1
for line in file_open: #iterates all the lines
if line_number % 2 != 0 : # odd line
spt = line.strip().split()
dico_count[spt[0]] = dico_count.get(spt[0], 0) + 1
frequent_name_counter = [(k,v) for k,v in sorted(dico_count.items(), key=lambda x: x[1], reverse=True)]

Why Gensim most similar in doc2vec gives the same vector as the output?

I am using the following code to get the ordered list of user posts.
model = doc2vec.Doc2Vec.load(doc2vec_model_name)
doc_vectors = model.docvecs.doctag_syn0
doc_tags = model.docvecs.offset2doctag
for w, sim in model.docvecs.most_similar(positive=[model.infer_vector('phone_comments')], topn=4000):
print(w, sim)
fw.write(w)
fw.write(" (")
fw.write(str(sim))
fw.write(")")
fw.write("\n")
fw.close()
However, I am also getting the vector "phone comments" (that I use to find nearest neighbours) in like 6th place of the list. Is there any mistake I do in the code? or is it a issue in Gensim (becuase the vector cannot be a neighbour of itself)?
EDIT
Doc2vec model training code
######Preprocessing
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for key, value in my_d.items():
value = re.sub("[^1-9a-zA-Z]"," ", value)
words = value.lower().split()
tags = key.replace(' ', '_')
docs.append(analyzedDocument(words, tags.split(' ')))
sentences = [] # Initialize an empty list of sentences
######Get n-grams
#Get list of lists of tokenised words. 1 sentence = 1 list
for item in docs:
sentences.append(item.words)
#identify bigrams and trigrams (trigram_sentences_project)
trigram_sentences_project = []
bigram = Phrases(sentences, min_count=5, delimiter=b' ')
trigram = Phrases(bigram[sentences], min_count=5, delimiter=b' ')
for sent in sentences:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
trigram_sentences_project.append(trigrams_)
paper_count = 0
for item in trigram_sentences_project:
docs[paper_count] = docs[paper_count]._replace(words=item)
paper_count = paper_count+1
# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 5, workers = 4, iter = 20)
#Save the trained model for later use to take the similarity values
model_name = user_defined_doc2vec_model_name
model.save(model_name)
The infer_vector() method expects a list-of-tokens, just like the words property of the text examples (TaggedDocument objects, usually) that were used to train the model.
You're supplying a simple string, 'phone_comments', which will look to infer_vector() like the list ['p', 'h', 'o', 'n', 'e', '_', 'c', 'o', 'm', 'm', 'e', 'n', 't', 's']. Thus your origin vector for the most_similar() is probably garbage.
Further, you're not getting back the input 'phone_comments', you're getting back the different string 'phone comments'. If that's a tag-name in the model, then that must have been a supplied tag during model training. Its superficial similarity to phone_comments may be meaningless - they're different strings.
(But it may also hints that your training had problems, too, and trained the text that should have been words=['phone', 'comments'] as words=['p', 'h', 'o', 'n', 'e', ' ', 'c', 'o', 'm', 'm', 'e', 'n', 't', 's'] instead.)

Double for statement in list comprehension

wordlist = ['cat','dog','rabbit']
letterlist = [ ]
To list all the characters in all the words, we can do this:
letterlist = [word[i] for word in wordlist for i in range(len(word))]
['c', 'a', 't', 'd', 'o', 'g', 'r', 'a', 'b', 'b', 'i', 't']
However, when I try to do it in this way:
letterlist = [character for character in word for word in wordlist]
I get the error:
NameError: name 'word' is not defined on line 9
Can someone explain my error in understanding how list comprehension works?
Thanks.
Writing
wordlist = ["cat", "dog", "rabbit"]
letterlist = [character for character in word for word in wordlist]
is comparable to the following nested loop:
wordlist = ["cat", "dog", "rabbit"]
letterlist = []
for character in word:
for word in wordlist:
letterlist.append(character)
This loop will throw the same error as your list comprehension because you are attempting to reference character in word before defining word as an element of wordlist. You just have the order backwards. Try the following:
letterlist = [character for word in wordlist for character in word]

Splitting punctuation into a list (Python)

I want to be able to generate a list that includes the punctuation but I am struggling to find a solution.
Example: "Hello world! I am here."
["Hello","world","!","I","am","here","."]
So far I know that
"Hello World! I am here.".split()
will evaluate to
['Hello', 'World!', 'I', 'am', 'here.']
You can use regex :
>>> s="Hello world! I am here."
>>>
>>> import re
>>> re.findall(r'\w+|[^\w\s]',s)
['Hello', 'world', '!', 'I', 'am', 'here', '.']
re.findall() with Regex r'\w+|[^\w\s]' will find all combinations of word characters (\w+) or every thing except word character or white-spaces ([^\w\s]).

Is there any way to force ipython to interpret utf-8 symbols?

I'm using ipython notebook.
What I want to do is search a literal string for any spanish accented letters (ñ,á,é,í,ó,ú,Ñ,Á,É,Í,Ó,Ú) and change them to their closest representation in the english alphabet.
I decided to write down a simple function and give it a go:
def remove_accent(n):
listn = list(n)
for i in range(len(listn)):
if listn[i] == 'ó':
listn[i] =o
return listn
Seemed simple right simply compare if the accented character is there and change it to its closest representation so i went ahead and tested it getting the following output:
in []: remove_accent('whatever !## ó')
out[]: ['w',
'h',
'a',
't',
'e',
'v',
'e',
'r',
' ',
'!',
'#',
'#',
' ',
'\xc3',
'\xb3']
I've tried to change the default encoding from ASCII (I presume since i'm getting two positions for te accented character instead of one '\xc3','\xb3') to UTF-8 but this didnt work. what i would like to get is:
in []: remove_accent('whatever !## ó')
out[]: ['w',
'h',
'a',
't',
'e',
'v',
'e',
'r',
' ',
'!',
'#',
'#',
' ',
'o']
PD: this wouldn't be so bad if the accented character yielded just one position instead of two I would just require to change the if condition but I haven't find a way to do that either.
Your problem is that you are getting two characters for the 'ó' character instead of one. Therefore, try to change it to unicode first so that every character has the same length as follows:
def remove_accent(n):
n_unicode=unicode(n,"UTF-8")
listn = list(n_unicode)
for i in range(len(listn)):
if listn[i] == u'ó':
listn[i] = 'o'.encode('utf-8')
else:
listn[i]=listn[i].encode('utf-8')
return listn

Resources