I would like to find keywords from a list, but return a zero if the word does not exist (in this case: part). In this example, collabor occurs 4 times and part 0 times.
My current output is
[['collabor', 4]]
But what I would like to have is
[['collabor', 4], ['part', 0]]
str1 = ["collabor", "part"]
x10 = []
for y in wordlist:
for string in str1:
if y.find(string) != -1:
x10.append(y)
from collections import Counter
x11 = Counter(x10)
your_list = [list(i) for i in x11.items()]
rowssorted = sorted(your_list, key=lambda x: x[0])
print(rowssorted)
Although you have not clearly written your problem and requirements,I think I understood the task.
I assume that you have a set of words that may or may not occur in a given list and you want to print the count of those words based on the occurrence in the given list.
Code:
constants=["part","collabor"]
wordlist = ["collabor", "collabor"]
d={}
for const in constants:
d[const]=0
for word in wordlist:
if word in d:
d[word]+=1
else:
d[word]=0
from collections import Counter
x11 = Counter(d)
your_list = [list(i) for i in x11.items()]
rowssorted = sorted(your_list, key=lambda x: x[0])
print(rowssorted)
output:
[['collabor', 2], ['part', 0]]
This approach gives the required output.
In python, to get the count of occurrence dictionary is popular.
Hope it helps!
I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?
i want to create a prediction function which complete a part of "sentence"
the model used here is a character based RNN(LSTM). what are the steps we should fellow ?
i tried this but i can't give as input the sentence
def generate(self) -> Tuple[List[Token], torch.tensor]:
start_symbol_idx = self.vocab.get_token_index(START_SYMBOL, 'tokens')
# print(start_symbol_idx)
end_symbol_idx = self.vocab.get_token_index(END_SYMBOL, 'tokens')
padding_symbol_idx = self.vocab.get_token_index(DEFAULT_PADDING_TOKEN, 'tokens')
log_likelihood = 0.
words = []
state = (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))
word_idx = start_symbol_idx
for i in range(self.max_len):
tokens = torch.tensor([[word_idx]])
embeddings = self.embedder({'tokens': tokens})
output, state = self.rnn._module(embeddings, state)
output = self.hidden2out(output)
log_prob = torch.log_softmax(output[0, 0], dim=0)
dist = torch.exp(log_prob)
word_idx = start_symbol_idx
while word_idx in {start_symbol_idx, padding_symbol_idx}:
word_idx = torch.multinomial(
dist, num_samples=1, replacement=False).item()
log_likelihood += log_prob[word_idx]
if word_idx == end_symbol_idx:
break
token = Token(text=self.vocab.get_token_from_index(word_idx, 'tokens'))
words.append(token)
return words, log_likelihood,start_symbol_idx
Here are two tutorial on how to use machine learning libraries to generate text Tensorflow and PyTorch.
this code snippet is the part of allennlp "language model" tutorial, here the generate function is defined to compute the probability of tokens and find the best token and sequence of tokens according to the maximum likelihood of model output, the full code is in the colab notebook bellow you can refer to it: https://colab.research.google.com/github/mhagiwara/realworldnlp/blob/master/examples/generation/lm.ipynb#scrollTo=8AU8pwOWgKxE
after training the the language model for using this function you can say:
for _ in range(50):
tokens, _ = model.generate()
print(''.join(token.text for token in tokens))
I am trying to produce a vector that represents the match of a string and a list's elements. I have made a function in python3.x:
def vector_build (docs, var):
vector = []
features = docs.split(' ')
for ngram in var:
if ngram in features:
vector.append(docs.count(ngram))
else:
vector.append(0)
return vector
It works fine:
vector_build ('hi my name is peter',['hi', 'name', 'are', 'is'])
Out: [1, 1, 0, 1]
But this function is not scalable to significant data. When its string parameter 'docs' is heavier than 190kb it takes more time that need. So I am trying to replace the for loop with map function like:
var = ['hi', 'name', 'are', 'is']
doc = 'hi my name is peter'
features = doc.split(' ')
vector = list(map(var,if ngram in var in features: vector.append(doc.count(ngram))))
But this return this error:
SyntaxError: invalid syntax
Is there a way to replace that for loop with map, lambda, itertools in order to make the execution faster?
You can use list comprehension for this task. Also, lookups in a set of features should help the function some as well.
var = ['hi', 'name', 'are', 'is']
doc = 'hi my name is peter'
features = doc.split(' ')
features_set = set(features) #faster lookups
vector = [doc.count(ngram) if ngram in features_set else 0 for ngram in var]
print(vector)
This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time.
The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is so bad' as positive.
INPUT:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
OUTPUT:
awesome movie is pos
i like it is pos
it is so bad is pos
To make sure the function 'word_feat(word)' iterates over each sentences instead of each word or letter, I did some diagnostic codes to see what is each element in 'word_feat(word)':
for word in words:
print(word_feat(word))
And it printed out:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
So it seems like the function 'word_feat(word)' is correct?
Does anyone know why the classifier classified 'It is so bad' as positive? As mentioned before, I had clearly labeled the word 'bad' as negative in my training data.
This particular failure is because your word_feats() function expects a list of words (a tokenized sentence), but you pass it each word separately... so word_feats() iterates over its letters. You've built a classifier that classifies strings as positive or negative on the basis of the letters they contain.
You're probably in this predicament because you pay no attention to what you name your variables. In your main loop, none of the variables sentence, words, or word contain what their name claims. To understand and improve your program, start by naming things properly.
Bugs aside, this is not how you build a sentiment classifier. The training data should be a list of tokenized sentences (each labeled with its sentiment), not a list of individual words. Similarly, you classify tokenized sentences.
Here is the modified code for you
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'
Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.
Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.
The above code gives below output
awesome movie --> pos
i like it --> pos
it is so bad --> neg
So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.
Let me show a rewriting of your code. All I changed near the top was adding import re, as it is easier to tokenize with regexes. Everything else up to defining classifier is the same as your code.
I added one more test case (something really, really negative), but more importantly I used proper variable names - then it is much harder to get confused about what is going on:
test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')
So sentences now contains 4 strings, each a single sentence.
I left your word_feat() function unchanged.
For using the classifier I did quite a big rewrite:
for sentence in sentences:
if(len(sentence) == 0):continue
neg = 0
pos = 0
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
print(word, classResult)
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print("\n%s: %d vs -%d\n"%(sentence,pos,neg))
The outer loop is again descriptive, so that sentence contains one sentence.
I then have an inner loop where we classify each word in the sentence; I am using a regex to split the sentence up on whitespace and punctuation marks:
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
The rest is just basic adding up and reporting. I get this output:
awesome pos
movie neu
awesome movie: 1 vs -0
i pos
like pos
it pos
i like it: 3 vs -0
it pos
is neu
so pos
bad neg
it is so bad: 2 vs -1
i pos
hate neg
this pos
terrible neg
useless neg
movie neu
i hate this terrible useless movie: 2 vs -3
I still get the same as you - "it is so bad" is considered positive. And with the extra debug lines we can see it is because "it" and "so" are considered positive words, and "bad" is the only negative word, so overall it is positive.
I suspect this is because it hadn't seen those words in its training data.
...yes, if I add "it" and "so" to the list of neutral words, I get "it is so bad: 0 vs -1".
As next things to try, I'd suggest:
Try with more training data; toy examples like this carry the risk that the noise will swamp the signal.
Look into removing stop words.
You can try this code
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
results are:
Positive: 0.7142857142857143
Negative: 0.14285714285714285