Related
In NER task we want to classification sentence tokens with using different approaches (BIO, for example). But we cant join any subtokens when tokenizer divides sentences stronger.
I would like to classificate 'weight 40.5 px' sentence with custom tokenization (by space in this example)
But after tokenization
tokenizer.convert_ids_to_tokens(tokenizer(['weight', '40.5', 'px'], is_split_into_words=True)['input_ids'])
i had
['[CLS]', 'weight', '40', '.', '5', 'p', '##x', '[SEP]']
when '40.5' splitted into another tokens '40', '.', '5'. Its problem for me, because i want to classificate 3 tokens ('weight', '40.5', 'px'), but it not merge automaticaly, because '40', '.', '5' not looks like '40', '##.', '##5'.
What can i do to solve this problem?
you can get the relation between raw text and tokenized tokens through “offset_mapping”
My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!
model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.
While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.
How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)
In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.
I am using TfidfVectorizer
with following parameters:
smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)
I am vectorizing following text: "red sun, pink candy. Green flower."
Here is output of get_feature_names():
['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']
Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?
Is there a way to prevent creation of n-grams spawning multiple sentences?
Depends on how you are passing that to TfidfVectorizer!
If passed as a single document, TfidfVectorizer will only keep words which contain 2 or more alphanumeric characters. Punctuation is completely ignored and always treated as a token separator. So your sentence becomes:
['red', 'sun', 'pink', 'candy', 'green', 'flower']
Now from these tokens, ngrams are generated.
Since TfidfVectorizer is a bag-of-words technique, working on words appearing in a document, it does not keep any information about the structure or order of words in a single document.
If you want them to be treated separately, then you should detect the sentences yourself and pass them as different documents.
Or else, pass your own analyzer and ngram generator to the TfidfVectorizer.
For more information on how TfidfVectorizer actually works, see my other answer:
sklearn TfidfVectorizer : Generate Custom NGrams by not removing stopword in them
I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)
If I am given a string for example, "I like ham. I like cheese too! Do you?", and I want to create a nested list where a new list is created in the main list when a ".", "?", or"!" is seen. I have tried:
string = "I like ham. I like cheese too! Do you?"
for i in string:
if i == "?" or i == "." or i == "!":
list = string.split(i)
print(list)
This however, only works once and does not work if there is more than one period, question mark, or exclamation mark in the string. What I am trying to get as output is:
[['I', 'like', 'ham'], ['I', 'like', 'cheese', 'too'], ['Do', 'you']]
If anyone could help me that was be great! Thank you in advanced.
This does what you are looking for:
>>> s = "I like ham. I like cheese too! Do you?"
>>> import re
>>> [sentence.split() for sentence in re.split('[?.!]', s) if sentence]
[['I', 'like', 'ham'], ['I', 'like', 'cheese', 'too'], ['Do', 'you']]
How it works
re.split will split a string on a regular expression:
>>> re.split('[?.!]', s)
['I like ham', ' I like cheese too', ' Do you', '']
where [?.!] is a regular expression that matches any of ?, ., or !.
Once that we have split the string into sentences, we need to do one more step and that is split the sentences into words. So, as an example taking a single sentence:
>>> sentence = 'I like ham'
>>> sentence.split()
['I', 'like', 'ham']
We do the same to all sentences with:
>>> [sentence.split() for sentence in re.split('[?.!]', s) if sentence]
[['I', 'like', 'ham'], ['I', 'like', 'cheese', 'too'], ['Do', 'you']]
Notes
list is a python builtin and string is the name of a python standard library. Since you might need them in the future, it is poor practice to overwrite them. So, it is good form to pick other names for script variables.