I have found the following code in Python that is doing the same work but it only replaces with manually selected synonym.
import nltk
from nltk.corpus import wordnet
synonyms = []
string="i love winter season"
for syn in wordnet.synsets("love"):
for l in syn.lemmas():
synonyms.append(l.name())
print(synonyms)
rep=synonyms[2]
st=string.replace("love",rep, 1)
print(st)
rep=synonyms[2] will be taking any synonym at index 2
What i want is to replace the selected word with its randomly selected synonym?
If I understand your question correctly, what you need is to select a random element from a list. This can be done in python like so:
import random
random.choice (synonyms)
As answered here
Related
Say I have the code txt = "Hello my name is bob. I really like pies.", how would I extract each sentence individually and add the to a list. I created this messy script which gives me a number of sentences roughly in a string...
sentences = 0
capitals = [
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S',
'T','U','V','W','X','Y','Z'
]
finish_markers = [
'.','?','!'
]
newTxt = txt.split()
for x in newTxt[1:-1]:
for caps in capitals:
if caps in x:
for fin in finish_markers:
if fin in newTxt[newTxt.index(x) - 1]:
sentences += 1
for caps in capitals:
if caps in newTxt[0]:
sentences += 1
print("Sentence count...")
print(sentences)
It is using the txt variable mentioned above. However I would now like to extract each sentence and put them into a list so the final product would look something like this...
['Hello my name is bob.','I really like pies.']
I would prefer not to use any non standard packages because I want this script to work independent of everything and offline. Thank you for any help!
Use nltk.tokenize
import nltk
sentences = nltk.sent_tokenize(txt)
This will give you a list of sentences.
You could work with a regex for all the ending chars(".","?","!")and then split it into different string.
You are trying to split a string into sentences, that is a bit hard to do it with regular expressions or string functions handling. For your use case, I'd recommend a NLP library like NLTK. Then, take a look at this Tokenize a paragraph into sentence and then into words in NLTK.
I'm currently trying to get all possible pos tags of a single word using Python.
From traditional pos taggers you get back only one tag, if you enter the single word.
Is there a way to get all possiblities?
Is it possible to search in a corpora(e.g. brown) for a specific word and not just for a category?
Kind regards & thanks for help
You can get the pos_tag() using this approach - specifically for brown,
import nltk
from nltk.corpus import brown
from collections import Counter, defaultdict
# x is a dict which will have the word as key and pos tags as values
x = defaultdict(list)
# looping for first 100 words and its pos tags
for word, pos in brown.tagged_words()[1:100]:
if pos not in x[word]: # to append one tag only once
x[word].append(pos) # adding key-value to x
# to print the pos tags for the word 'further'
print(x['further'])
#['RBR']
I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.
from nltk.corpus import stopwords
import nltk
import json
with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
tweets = []
for sentence in data:
tokens = nltk.wordpunct_tokenize(sentence['text'])
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha() and w not in
stopwords.words('english') and w is not 'the']
s = " "
useful_sentence = s.join(words)
tweets.append(useful_sentence)
print(tweets)
I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.
Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?
Beware of the processing order.
Here are two test strings for you:
THIS THE REMAINS.
this the is removed
Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.
The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.
Have you tried lowering w in check?
words = [w.lower() for w in text if w.isalpha() and w.lower() not in
stopwords.words('english') and w.lower() is not 'the']
is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.
In your case, use the != operator or the negation of == to compare with the string "the".
See also: Is there a difference between `==` and `is` in Python?
I am using wordnet to find the synonyms for a particular word as shown below
synonyms = wn.synsets('good','a')
where wn is wordnet. This returns a list of synsets like
Synset('good.a.01')
Synset('full.s.06')
Synset('good.a.03')
Synset('estimable.s.02')
Synset('beneficial.s.01')
etc...
How to iterate through each synset and get the name and the pos tag of each synset?
You can get the name and the pos tag of each synset like this:
from nltk.corpus import wordnet as wn
synonyms = wn.synsets('good','a')
for synset in synonyms:
print(synset.name())
print(synset.pos())
The name is the combination of word, pos and sense, such as 'full.s.06'. If you just want the word, you can split on the dot '.' and take the first element:
print(synset.name().split('.')[0]
I am using this in my code:
from nltk.corpus import wordnet
word = input()
for syn in wordnet.synsets(word):
for l in syn.lemmas():
print(syn.pos())
print(l.name())
The syn.pos() can be called outside the inner loop as well, because each lemma contains words with same pos.
I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.