Sent2Vec or Doc2Vec Testing - doc2vec

How can i test a sent2vec or doc2vec model that I've trained on a specific dataset? The process is all unsupervised so have no labels to help in the testing. My interest is in how the semantic similarity measure is computed. Thanks in advance.

Related

How to train a linear SVM with H2O

The H2OSupportVectorMachineEstimator in H2O seems to only support "gaussian" as the value of the kernel_type parameter. Is there a way to train a linear SVM with H2O?
As you mentioned, based on the documentation (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/svm.html) currently there is no way to train linear SVM on H2O. Within linear models, I think it only has GLM (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html).

Unsupervised finetuning of BERT for embeddings only?

I would like to fine-tuning BERT for a specific domain on unlabeled data and get the output layer to check the similarity between them. How can I do it? Do I need to fine-tuning first a classifier task (or question answer, etc..) and get the embeddings? Or can I just use a pre-trained Bert model without task and fine-tuning with my own data?
There is no need to fine-tune for classification, especially if you do not have any supervised classification dataset.
You should continue training BERT the same unsupervised way it was originally trained, i.e., continue "pre-training" using the masked-language-model objective and next sentence prediction. Hugginface's implementation contains class BertForPretraining for this.

StanfordNLP training iteration for CRF classifier

I know its a simple question. But I just want to make sure. Are all the samples from training dataset used in each iterations in CRF classifier?
Yes during the training process all training examples are used during each iteration.

Doc2Vec vs Avg Word Vectors : Which is better for Sentiment Analysis?

I was performing Sentiment Analysis on the IMdb dataset on Kaggle. I used the BOW approach with bigrams and that gave me a decent accuracy of ~89%. But I dont know how to approach the same using word embeddings: Should i go for averaged word vectors or doc2vec?
Someone please help. Thanks in advance.
Here's a recent blog post comparing word2vec averaging vs doc2vec performance. The post favors doc2vec. It also depends on what classification model you are using (logistic regression, SVM, LSTM, etc.)

Anomaly detection in Text Classification

I have built a text classifier using OneClassSVM.
I have the training set which corresponds to only one label i.e("Yes") and I don't have the other("NO") label data. My task is to build a classifier which classifies the new unseen sentence(test data) as 1 if it is very similar to the training data. Else, it classifies as -1 i.e,(anomaly).
I have used Word2Vec to build the word embeddings for my training data. Then, I am using word-vector averaging with OneClassSVM to build a anomaly detector classifier.
This classifier is currently giving accuracy of about 50%-55%. I have to enhance this further to build a robust classifier.
Any suggestions to this problem would be helpful...
I'd suggest a very different approach since you have no training examples for the negative class at all.
You could train a language model on your training data. At inference time, you score the input with the language model, and classify it according to some threshold on the perplexity of the input sentence according to the LM.

Resources