Does anyone knows if I can get all the vocabulary for the glove model?
I look to do the same thing that this guy does to BERT on this video [on 15:40]: https://www.youtube.com/watch?v=zJW57aCBCTk&ab_channel=ChrisMcCormickAI
The GloVe vectors and their vocabulary are simply distributed as (space-separated column) text files. On a Unix-derived OS, you can get the vocabulary with a command like:
cut -f 1 -d ' ' glove.6B.50d.txt
If you'd like to do it in Python, the following works. The only trick is that the files use no quoting. Rather, the GloVe files simply use space as a delimiter and space is not allowed inside tokens.
import csv
vocab = set()
with open("glove.6B.100d.txt", encoding="utf-8") as f:
g300 = csv.reader(f, delimiter=" ", quoting=csv.QUOTE_NONE, escapechar=None)
for row in g300:
vocab.add(row[0])
print(vocab)
Related
When I attempt to run FastText using gensim in Python, the best I can get is a result that gives me the most similar but each result is a single character. (I'm on a windows machine, which I've heard affects the result.)
I have all of my data stored in either a csv file in which I've already tokenized each sentence or in the original txt file I started with. When I try to use the csv file, I end up with the single character result.
Here's the code I'm using to process my csv file (I'm looking at analyzing how sports articles discuss white vs. nonwhite NFL quarterbacks differently, this is the code for my NonWhite results csv file):
from gensim.models import FastText
from gensim.test.utils import get_tmpfile, datapath
import os
embedding_size = 200
window_size = 10
min_word = 5
down_sampling = 1e-2
if os.path.isfile(modelpath):
model1 = FastText.load(modelpath)
else:
class NWIter():
def __iter__(self):
path = datapath(csvpath)
with utils.open(path, 'r') as fin:
for line in fin:
yield line
model1 = FastText(vector_size=embedding_size, window=window_size, min_count=min_word,sample=down_sampling,workers=4)
model1.build_vocab(corpus_iterable=NWIter())
exs1=model1.corpus_count
model1.train(corpus_iterable=NWIter(), total_examples=exs1, epochs=50)
model1.save(modelpath)
The cleaned CSV data looked like this, with each row representing a sentence that had been cleaned (stopwords removed, tokenized, and lemmatized).
When that didn't work, I attempted to bring in the raw text but got lots of UTF-8 encoding errors with unrecognizable characters. I attempted to work around this issue, finally getting to a point where it tried to read in the raw text file - only for the single character returns to come back.
So it seems the issue persists regardless of if I use my csv file or if I use the txt file. So I'd prefer to stick with the csv as I've already processed the information; how can I bring that data in without Python (or gensim) seeing the individual characters as the unit of analysis?
Edit:
Here are the results I get when I run:
print('NonWhite: ',model1.wv.most_similar('smart', topn=10))
NonWhite: [('d', 0.36853086948394775), ('q', 0.326141357421875), ('s', 0.3181183338165283), ('M', 0.27458563446998596), ('g', 0.2703150510787964), ('o', 0.215525820851326), ('x', 0.2153075635433197), ('j', 0.21472081542015076), ('f', 0.20139966905117035), ('a', 0.18369245529174805)]
The Gensim FastText model (like its other models in the Word2Vec family) needs each individual text as a list-of-string-tokens, not a plain string.
If you pass texts as plain strings, they appear to be lists-of-single-characters – because of the way Python treats strings. Hence, the only 'words' the model sees are single-characters – including the individual spaces.
If the format of your file is such that each line is already a space-delimited text, you could simply change your yield line to:
yield line.split()
If instead it's truly a CSV, and your desired training texts are in only one column of the CSV, you should pick out that field and properly break it into a list-of-string-tokens.
Spacy's pos tagger is really convenient, it can directly tag on raw sentence.
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I am eating")
But I'm using tokenizer from nltk. So how to use a tokenized sentence like
['I', 'am', 'eating'] rather than 'I am eating' for the Spacy's tagger?
BTW, where can I found detailed Spacy documentation?
I can only find an overview on the official website
Thanks.
There's two options:
You write a wrapper around the nltk tokenizer and use it to convert text to spaCy's Doc format. Then overwrite nlp.tokenizer with that new custom function. More info here: https://spacy.io/usage/linguistic-features#custom-tokenizer.
Generate a Doc directly from a list of strings, like so:
doc = Doc(nlp.vocab, words=[u"I", u"am", u"eating", u"."],
spaces=[True, True, False, False])
Defining the spaces is optional - if you leave it out, each word will be followed by a space by default. This matters when using e.g. the doc.text afterwards. More information here: https://spacy.io/usage/linguistic-features#own-annotations
[edit]: note that nlp and doc are sort of 'standard' variable names in spaCy, they correspond to the variables sp and sen respectively in your code
I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file).
I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?)
the output was:
cooccurrence.bin
cooccurrence.shuf.bin
text8
corpus.txt
vectors.txt
How can I used those files to load it as a GloVe model on python?
You can do it using GloVe library:
Install it: pip install glove_python
Then:
from glove import Corpus, Glove
#Creating a corpus object
corpus = Corpus()
#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)
glove = Glove(no_components=5, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')
Reference: word vectorization using glove
This is how you run the model
$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make
To train it on your own corpus, you just have to make changes to one file, that is demo.sh.
Remove the script from if to fi after 'make'.
Replace the CORPUS name with your file name 'corpus.txt'
There is another if loop at the end of file 'demo.sh'
if [ "$CORPUS" = 'text8' ]; then
Replace text8 with your file name.
Run the demo.sh once the changes are made.
$ ./demo.sh
Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.
your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.
Here is my take on this::
After cloning the repository, edit the demo.sh file as you have to train it using your own corpus replace the CORPUS name with your file's name.
Then remove the script between MAKE and CORPUS as that is for downloading an example corpus for you.
Then run make which will form the four files in the build folder.
Now run ./demo.sh which will train and do all the stuff mentioned in the script on your own corpus and output will be generated as vectors.txt file.
Note : Don't forget to keep your corpus file directly inside the Glove folder.
I'm looking for a way to pos_tag a French sentence like the following code is used for English sentences:
def pos_tagging(sentence):
var = sentence
exampleArray = [var]
for item in exampleArray:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
return tagged
here is the full code source it works very well
download link for Standford NLP https://nlp.stanford.edu/software/tagger.shtml#About
from nltk.tag import StanfordPOSTagger
jar = 'C:/Users/m.ferhat/Desktop/stanford-postagger-full-2016-10-31/stanford-postagger-3.7.0.jar'
model = 'C:/Users/m.ferhat/Desktop/stanford-postagger-full-2016-10-31/models/french.tagger'
import os
java_path = "C:/Program Files/Java/jdk1.8.0_121/bin/java.exe"
os.environ['JAVAHOME'] = java_path
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag('je suis libre'.split())
print (res)
The NLTK doesn't come with pre-built resources for French. I recommend using the Stanford tagger, which comes with a trained French model. This code shows how you might set up the nltk for use with Stanford's French POS tagger. Note that the code is outdated (and for Python 2), but you could use it as a starting point.
Alternately, the NLTK makes it very easy to train your own POS tagger on a tagged corpus, and save it for later use. If you have access to a (sufficiently large) French corpus, you can follow the instructions in the nltk book and simply use your corpus in place of the Brown corpus. You're unlikely to match the performance of the Stanford tagger (unless you can train a tagger for your specific domain), but you won't have to install anything.
I have a text document from which I'd like to extract the Noun phrases. In the first step I extract sentences and then I do a part of speech (pos) tagging for each sentence and then using the pos I do a chunking. I used StanfordNLP for these task, and this is the code for extracting the sentences.
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
I think DocumentPreprocessor does a pos under the hood in order to extract the sentences. However, I'm doing another pos for extracting the noun phrases in the second phase as well. That is, pos is done twice and because pos is a computationally expensive task, I'm looking for a way to do it only once. Is there any way to do pos only once to extract sentences and noun phrases?
No, DocumentPreprocessor does not run a tagger while it loads the text. (NB, it does have the capability to parse pre-tagged text, i.e. parse tokens in a file like dog_NN.)
In short: you aren't doing extra work, so I suppose that's good news!
I'm not sure. Try using nltk (python package)
import nltk
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]