Lemmatizing words like "movin", "roamin" or any sort of slang word - nlp

I'm currently using nltk and spacy's lemmatiser (both accuracy and large datasets used) to test. But they are simply not able to lemmatize slangs words in dictionary properly.
import spacy
!python3 -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg", disable=['parser', 'ner'])
doc = nlp("Hello I am movin to the target so i can roamin around")
for token in doc:
print(token, token.lemma_)
Hello hello
I I
am be
movin movin
to to
the the
target target
so so
i I
can can
roamin roamin
around around```
This is my output but I would want
movin -> move
roamin -> roam
thank you!

Related

What is *.subwords file in natural language processing to use as vocabulary file?

I have been trying to create a vocab file in a nlp task to use in tokenize method of trax to tokenize the word but i can't find which module/library to use to create the *.subwords file. Please help me out?
The easiest way to use the trax.data.Tokenize with your own data and a subword vocabulary it's using Google's Sentencepiece python module
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=data/my_data.csv --model_type=bpe --model_prefix=my_model --vocab_size=32000')
This creates two files:
my_model.model
my_model.vocab
We'll use this model in trax.data.Tokenize and we'll add the parameter vocab_type with the value "sentencepiece"
trax.data.Tokenize(vocab_dir='vocab/', vocab_file='my_model.model', vocab_type='sentencepiece')
I think it's the best way since you can load the model and use it to get control ids while avoiding hardcode
sp = spm.SentencePieceProcessor()
sp.load('my_model.model')
print('bos=sp.bos_id()=', sp.bos_id())
print('eos=sp.eos_id()=', sp.eos_id())
print('unk=sp.unk_id()=', sp.unk_id())
print('pad=sp.pad_id()=', sp.pad_id())
sentence = "hello world"
# encode: text => id
print("Pieces: ", sp.encode_as_pieces(sentence))
print("Ids: ", sp.encode_as_ids(sentence))
# decode: id => text
print("Decode Pieces: ", sp.decode_pieces(sp.encode_as_pieces(sentence)))
print("Decode ids: ", sp.decode_ids(sp.encode_as_ids(sentence)))
print([sp.bos_id()] + sp.encode_as_ids(sentence) + [sp.eos_id()])
If still you want to have a subword file, try this:
python trax/data/text_encoder_build_subword.py \
--corpus_filepattern=data/data.txt --corpus_max_lines=40000 \
--output_filename=data/my_file.subword
I hope this can help since there is no clear literature to see how to create compatible subword files out there
You can use tensorflow API SubwordTextEncoder
Use following code snippet -
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(text_row for text_row in text_dataset), target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
Tensorflow will append .subwords extension to above vocab file.

Using glove.6B.100d.txt embedding in spacy getting zero lex.rank

I am trying to load glove 100d emebddings in spacy nlp pipeline.
I create the vocabulary in spacy format as follows:
python -m spacy init-model en spacy.glove.model --vectors-loc glove.6B.100d.txt
glove.6B.100d.txt is converted to word2vec format by adding "400000 100" in the first line.
Now
spacy.glove.model/vocab has following files:
5468549 key2row
38430528 lexemes.bin
5485216 strings.json
160000128 vectors
In the code:
import spacy
nlp = spacy.load("en_core_web_md")
from spacy.vocab import Vocab
vocab = Vocab().from_disk('./spacy.glove.model/vocab')
nlp.vocab = vocab
print(len(nlp.vocab.strings))
print(nlp.vocab.vectors.shape) gives
gives
407174
(400000, 100)
However the problem is that:
V=nlp.vocab
max_rank = max(lex.rank for lex in V if lex.has_vector)
print(max_rank)
gives 0
I just want to use the 100d glove embeddings within spacy in combination with "tagger", "parser", "ner" models from en_core_web_md.
Does anyone know how to go about doing this correctly (is this possible)?
The tagger/parser/ner models are trained with the included word vectors as features, so if you replace them with different vectors you are going to break all those components.
You can use new vectors to train a new model, but replacing the vectors in a model with trained components is not going to work well. The tagger/parser/ner components will most likely provide nonsense results.
If you want 100d vectors instead of 300d vectors to save space, you can resize the vectors, which will truncate each entry to first 100 dimensions. The performance will go down a bit as a result.
import spacy
nlp = spacy.load("en_core_web_md")
assert nlp.vocab.vectors.shape == (20000, 300)
nlp.vocab.vectors.resize((20000, 100))

How to convert the text into vector using word2vec embedding?

Suppose I have a dataframe shown below:
|Text
|Storm in RI worse than last hurricane
|Green Line derailment in Chicago
|MEG issues Hazardous Weather Outlook
I created word2vec model using below code:
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
text_data = sent_to_words(df['Text'])
w2v_model = gensim.models.Word2Vec(text_data, size=100, min_count=1, window=5, iter=50)
now how I will convert the text present in the 'Text' column to vectors using this word2vec model?
you can get generated word embeddings by
w2v_model.wv
you can get word embeddings of a specific word by
w2v_model.wv['word']
Word2Vec models can only map words to vectors, so, as #metalrt mentioned, you have to use a function over the set of word vectors to convert them to a single sentence vector. A good baseline is to compute the mean of the word vectors:
import numpy as np
df["Text"].apply(lambda text: np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv]))
The example above implements very simple tokenization by whitespace characters. You can also use spacy library to implement better tokenization:
import spacy
nlp = spacy.load("en_core_web_sm")
df["Text"].apply(lambda text: np.mean([self.keyed_vectors[token.text] for token in nlp.pipe(text) if not token.is_punct and token.text in self.keyed_vectors]))

How to turn a list of words into a list of vectors using a pre-trained word2vec model(Google)?

I am trying to learn word2vec.
I am using the code below to load the Google pre-trained word2vec model in Python 3. But I am unsure how to turn a list such as :["I", "ate", "apple"] to a list of vectors (ie how to get vectors from this model?).
import nltk
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
You get the vector via idiomatic Python keyed-index-access (brackets). For example:
wv_apple = model['apple']
You can create a new list based on some operation on every item of an existing list via an idiomatic Python 'list comprehension' ([expression(x) for x in some_list]), For example:
words = ["I", "ate", "apple"]
vectors = [model[word] for word in words]

Using British National Corpus in NLTK

I am new to NLTK (http://www.nltk.org/), and python for that matter. I wish to use the NLTK python library, but use the BNC for the corpus. I do not believe this corpus is distributed through the NLTK Data download. Is there a way to import the BNC corpus to be used by NLTK. If so, how? I did find a function called BNCCorpusReader but have no idea how to use it. Also, at the BNC site, I was able to download the corpus (http://ota.ox.ac.uk/desc/2554).
http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word
Update
I have tried entrophy's suggestion, but get the following error:
raise IOError('No such file or directory: %r' % _path)
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'
My code to read in the corpora:
bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')
And by corpora is located in:
C:\Users\jason\Documents\NetBeansProjects\DemoCollocations\src\Corpora\bnc\
In regards to examples usage of nltk for collocation extraction, take a look at the following guide: A how-to guide by nltk on collocations extraction
As far as BNC corpus reader is concerned, all the information was right there in the documentation.
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')
#And say you wanted to extract all bigram collocations and
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.
list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(scored)
The output of that will look something like this:
[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699),
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894),
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]
And if you wanted to sort them using the score, you could try something like this
sorted_bigrams = sorted(bigram for bigram, score in scored)
print(sorted_bigrams)
Resulting:
[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'),
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'),
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]

Resources