Method for distinguishing between word and non-words - nlp

I'm working with the Stack exchange data dump and attempting to identify unique and novel words in the corpus. I'm doing this be referencing a very large wordlist and extracting the words not present in my reference word list.
The problem I am running up against is a number of the unique tokens are non-words, like directory names, error codes, and other strings.
Is there a good method of identifying differentiating word-like strings from non-word-like strings?
I'm using NLTK, but am not limited to that toolkit.

This is an interesting problem because it's so difficult to define what's makes a combination of characters a word. I would suggest to use supervised machine learning.
First, you need take the current output from your program and annotate manually each example as word and non-word.
Then, come up with some features, e.g.
number of characters
first three characters
last three characters
preceeding word
following word
...
Then, use a library like sci-kit learn to create a training model that captures these differences and can predict the likelihood of "wordness" for any sequence of characters.
Potentially a one-class classifier would be useful here. But in any case prepare some data so that you can evaluate the accuracy of this or any other approach.

Related

How can I recover the likelihood of a certain word appearing in a given context from word embeddings?

I know that some methods of generating word embeddings (e.g. CBOW) are based on predicting the likelihood of a given word appearing in a given context. I'm working with polish language, which is sometimes ambiguous with respect to segmentation, e.g. 'Coś' can be either treated as one word, or two words which have been conjoined ('Co' + '-ś') depending on the context. What I want to do, is create a tokenizer which is context sensitive. Assuming that I have the vector representation of the preceding context, and all possible segmentations, could I somehow calculate, or approximate the likelihood of particular words appearing in this context?
This very much depends on the way how you got your embeddings. The CBOW model has two parameters the embedding matrix that is denoted v and the output projection matrix v'. If you want to recover the probabilities that are used in the CBOW model at training time, you need to get v' as well.
See equation (2) in the word2vec paper. Tools for pre-computing word embeddings usually don't do that, so you would need to modify them yourself.
Anyway, if you want to compute a probability of a word, given a context, you should rather think about using a (neural) language model than a table of word embeddings. If you search the Internet, I am sure you will find something that suits your needs.

What is the difference between keras.tokenize.text_to_sequences and word embeddings

Difference between tokenize.fit_on_text, tokenize.text_to_sequence and word embeddings?
Tried to search on various platforms but didn't get a suitable answer.
Word embeddings is a way of representing words such that words with the same/similar meaning have a similar representation. Two commonly used algorithms that learn word embedding are Word2Vec and GloVe.
Note that word embeddings can also be learnt from scratch while training your neural network for text processing, on your specific NLP problem. You can also use transfer learning; in this case, it would mean to transfer the learned representation of the words from huge datasets on your problem.
As for the tokenizer(I assume it's Keras that we're speaking of), taking from the documentation:
tokenize.fit_on_text() --> Creates the vocabulary index based on word frequency. For example, if you had the phrase "My dog is different from your dog, my dog is prettier", word_index["dog"] = 0, word_index["is"] = 1 (dog appears 3 times, is appears 2 times)
tokenize.text_to_sequence() --> Transforms each text into a sequence of integers. Basically if you had a sentence, it would assign an integer to each word from your sentence. You can access tokenizer.word_index() (returns a dictionary) to verify the assigned integer to your word.

Google's BERT for NLP: replace foreign characters in vocab.txt to add words?

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?
The unused words weights are essentially randomly initialized only as they have not been used. If you just replace them with your own words but don't pretrain further on your domain specific corpus, then it would essentially remain random only. So, there won't be much benifit IMO, if you replace and continue with finetuning.
Let me point you to this github issue. According to author of the paper:
My recommendation would be to just use the existing wordpiece vocab
and run pre-trianing for more steps on the in-domain text, and it
should learn the compositionality "for free". Keep in mind that with a
wordpiece vocabulary there are basically no out-of-vocabulary words,
and you don't really know which words were seen in the pre-training
and not. Just because a word was split up by word pieces doesn't mean
it's rare, in fact many words which were split into wordpieces were
seen 5,000+ times in the pre-training data.
But if you want to add more vocab you can either: (a) Just replace the
"[unusedX]" tokens with your vocabulary. Since these were not used
they are effectively randomly initialized. (b) Append it to the end of
the vocab, and write a script which generates a new checkpoint that is
identical to the pre-trained checkpoint, but but with a bigger vocab
where the new embeddings are randomly initialized (for initialized we
used tf.truncated_normal_initializer(stddev=0.02)). This will likely
require mucking around with some tf.concat() and tf.assign() calls.
Hope this helps!

Detecting content based on position in sentence with OpenNLP

I've successfully used OpenNLP for document categorization and also was able to extract names from trained samples and using regular expressions.
I was wondering if it is also possible to extract names (or more generally speaking, subjects) based on their position in a sentence?
E.g. instead of training with concrete names that are know a priori, like Travel to <START:location> New York </START>, I would prefer not to provide concrete examples but let OpenNLP decide that anything appearing at the specified position could be an entity. That way, I wouldn't have to provide each and every possible option (which is impossible in my case anyway) but only provide one for the possible surrounding sentence.
that is context based learning and Opennlp already does that. you've to train it with proper and more examples to get good results.
for example, when there is Professor X in our sentence, Opennlp trained model.bin gives you output X as a name whereas when X is present in the sentence without professor infront of it, it might not give output X as a name.
according to its documentation, give 15000 sentences of training data and you can expect good results.

stopword removing when using the word2vec

I have been trying word2vec for a while now using the gensim's word2vec library. My question is do I have to remove stopwords from my input text? Because, based on my initial experimental results, I could see words like 'of', 'when'.. (stopwords) popping up when I do a model.most_similar('someword')..?
But I didn't see anywhere referring that stop word removal is necessary with word2vec? Does the word2vec is supposed to handle stop words even if you don't remove them?
What are the must do pre processing things (like for topic modeling, it's almost a must that you should do stopword removal)?
Gensim's implementation is based on the original Tomas Mikolov model of word2vec, then it downsamples all frequent words automatically based on frequency.
As stated in the paper:
We show that subsampling of frequent words during training results in
a significant speedup (around 2x - 10x), and improves accuracy of the
representations of less frequent words.
What it means is that these words are sometimes not considered in the window of the words to be predicted. The sample parameter which defaults to 0.001 is used as a parameter to prune out those words. If you want to remove some specific stopwords which would not be removed based on its frequency, you can do that.
Summary : The result would not make any significant difference if you do stop words removal.
Personaly I think, removal of stop word will give better results, check link
Also for topic modeling, you shlould perform preprocessing on the text, following things you must do,
Remove of stop words.
Tokenization.
Stemming and Lemmatization.
As others mentioned before, it really depends on what you want to do, and the best answer cannot be found in personal opinions, but in experiments. Stop words may play a role in word embedding by associating related words through their relationship to some of those stop words. For example, city names may tend to be more closely associated in a word embedding non only because they are associated with verbs such as "come", "go", "went", "fly", "drive", but also to prepositions such as "to", "from" and "in".
A hypothesis that can be empirically tested is whether the removal of those prepositions decreases the likelihood that those city names will be retrieved together.

Resources