How to deal with large amount of sentences with gensim word2vec?

How to deal with large amount of sentences with gensim word2vec? - nlp

I have a very large amount of sentences, the problem is i cannot load them all at once in memory, specially when i tokenize the sentences and split them into list of words my RAM goes full really fast.
but i couldn't find any example of how can i train the gensim word2vec with batches, meaning in each epoch i guess i have to somehow load batches of data from disk, tokenize them and give it to the model then unload it and load the next batch.
how can i overcome this problem and train a word2vec model when i don't have enough ram to load all the sentences (not even 20% of them).
my sentences are basically in a text file, each line representing a sentence.

You can define your own corpus as suggested in docs and basically size of corpus doesn't matter in this case:
from gensim.test.utils import datapath
from gensim import utils
class MyCorpus(object):
"""An interator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
for line in open(corpus_path):
# assume there's one document per line, tokens separated by whitespace
yield utils.simple_preprocess(line)
Then train it as follow:
import gensim.models
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

Related

Getting an embedded output from huggingface transformers

To compare different paragraphs, I am trying to use a transformer model, fitting each paragraph onto the model and then in the end I intend to compare the outputs and see which paragraph has the most similarity.
For this purpose, I am using Roberta-base model. I first used roberta tokenizer on a paragraph. Then I used the roberta model on that tokenized output. But the process is failing due to lack of memory. Even 25GB ram is not enough to complete the process for the paragraphs with 1324 lines.
Any idea how can I make it better or any suggestion what mistakes i might be doing?
from transformers import RobertaTokenizer, RobertaModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base").to(device)
inputs = tokenizer(dict_anrika['Anrika'], return_tensors="pt", truncation=True,
padding=True).to(device)
outputs = model(**inputs)

Sound like you gave the model input of shape [1324, longest_length_in_batch], which is huge. I tried [1000, 512] input, and found even 200GB RAM server also hits OOM.
One solution is to break the huge input into smaller batches, for example 10 lines at a time.

Use custom Word2Vec embedding instead of GloVe

I am working on a text generation using seq2seq model where GloVe embedding is being used. I want to use a custom Word2Vec (CBOW/Gensim) embedding in this code. Can anyone please help to use my custom embedding instead of GloVe?
def initialize_embeddings(self):
"""Reads the GloVe word-embeddings and creates embedding matrix and word to index and index to word mapping."""
# load the word embeddings
self.word2vec = {}
with open(glove_path%self.EMBEDDING_DIM, 'r') as file:
for line in file:
vectors = line.split()
self.word2vec[vectors[0]] = np.asarray(vectors[1:], dtype="float32")```
```# get the embeddings matrix
self.num_words = min(self.MAX_VOCAB_SIZE, len(self.word2idx)+1)
self.embeddings_matrix = np.zeros((self.num_words, self.EMBEDDING_DIM))
for word, idx in self.word2idx.items():
if idx <= self.num_words:
word_embeddings = self.word2vec.get(word)
if word_embeddings is not None:
self.embeddings_matrix[idx] = word_embeddings
self.idx2word = {v:k for k,v in self.word2idx.items()}
This code is for GloVe embedding which is transformed to Word2Vec. I want to load my own Word2Vec embedding.

word2vec and Glove are a techniques for producing word embeddings, i.e., for modelling text (a set of sentences) into computer-readable vectors.
While word2vec trains on the local context (neighboring words), Glove will look for words co-occurrence in a whole text or corpus, its approach is more global.
word2vec
There are two main approaches for word2vec, in which the algorithm loops through the worlds of the sentence. For each current word w it will try to predict
the neighboring words from w and its context, this is the Skip-Gram approach
w from its context, this is the CBOW approach
Hence, word2vec will produce a similar embedding for words with similar contexts, for instance a noun in singular and its plural, or two synonyms.
Glove
The main intuition underlying the Glove model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. In other words the embeddings are based on the computation of distances between pairs of target words. The model computes the distance between two target words in a text by analyzing the co-occurence of those two target words with some other probe words (contextual words).
https://nlp.stanford.edu/projects/glove/
For example, consider the co-occurrence probabilities for target words "ice" and "steam" with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:
As one might expect, "ice" co-occurs more frequently with "solid" than it does with "gas", whereas "steam" co-occurs more frequently with "gas" than it does with "solid". Both words co-occur with their shared property "water" frequently, and both co-occur with the unrelated word "fashion" infrequently. Only in the ratio of probabilities does noise from non-discriminative words like "water" and "fashion" cancel out, so that large values (much greater than 1) correlate well with properties specific to "ice", and small values (much less than 1) correlate well with properties specific of "steam". In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.
Also, Glove is very good at analogy, and performs well on the word2vec dataset.

How to train a pretrained binary file on my own corpus using gensim?

Hey guys I have a pretrained binary file and I want to train it on my corpus.
Approach I tried :
I tried to extract the txt file from the bin file I had and use this as a word2vec file at time of loading and further trained it on my own corpus and saved the model but the model is performing badly for the words which are there in the pre-trained bin file (I used intersect_word2vec_format command for this.)
Here is the script I used.
What should be my approach for my model to perform well on words from both the pre-trained file and my corpus?

Load your model and use build_vocab with update = True.
import gensim
from gensim.models import Word2Vec
model = Word2Vec.load('w2vmodel.bin')
my_corpus = ... # load your corpus as sentences here
model.build_vocab(my_corpus, update=True)
model.train(my_corpus)
It's not really clear to me when intersect_word2vec_format is helpful, but you can read more about the intended use case here. It does seem it's not for ordinary re-training of vectors though.

Gensim train word2vec on wikipedia - preprocessing and parameters

I am trying to train the word2vec model from gensim using the Italian wikipedia
"http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"
However, I am not sure what is the best preprocessing for this corpus.
gensim model accepts a list of tokenized sentences.
My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.
After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.
What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )
This the code of my first trial:
from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1
def generate_lines():
for index, text in enumerate(corpus.get_texts()):
if index < max_sentence or max_sentence==-1:
yield text
else:
break
from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec()
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)

Your approach is fine.
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
This could be because of pruning infrequent words (the default is min_count=5).
To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.
Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.

I've been working on a project to massage the wikipedia corpus and get vectors out of it.
I might generate the Italian vectors soon but in case you want to do it on your own take a look at:
https://github.com/idio/wiki2vec

TfidfVectorizer for corpus that cannot fit in memory

I want to build a tf-idf model based on a corpus that cannot fit in memory. I read the tutorial but the corpus seems to be loaded at once:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["doc1", "doc2", "doc3"]
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)
I wonder if I can load the documents into memory one by one instead of loading all of them.

Yes you can, just make your corpus an iterator. For example, if your documents reside on a disc, you can define an iterator that takes as an argument the list of file names, and returns the documents one by one without loading everything into memory at once.
from sklearn.feature_extraction.text import TfidfVectorizer
def make_corpus(doc_files):
for doc in doc_files:
yield load_doc_from_file(doc) #load_doc_from_file is a custom function for loading a doc from file
file_list = ... # list of files you want to load
corpus = make_corpus(file_list)
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string