Calculating IDF (as in TF-IDF) when testing?

Calculating IDF (as in TF-IDF) when testing? - text

As I understand it, IDF is used to calculate how many documents have the term (sort of just the idea). You can calculate IDF (along with TF) in the training set since you have all the documents beforehand. But what if I don't have the test set beforehand and I'm getting test documents in a sequential manner (like from a web crawler), then how am I going to calculate the IDF for words in a document when it comes to testing?

For this state if your dataset is big enough you could using just training set for IDF. in the test phase if the new term be in train set use the IDF of training and if the term is new use the number of train set documents for calculate IDF.
For some purposes you could use smoothing methods for having better results.

If you only perform tests after indexing/crawling a whole bunch of documents you can calculate IDF after the crawling is done. You don't have to calculate IDF when you encounter a new document or a new term. You can calculate it on-the-fly when you need it to do some TD-IDF or other calculation.
If that is not enough, for some reason, you can still use the IDF of another document dataset, preferably with the same type of documents.

Related

Classification Problem - Every record is a unique class

I have thousands of records where each is a unique class. I think what this means is that, which ever ML algorithm I choose, I will not be able to execute a train/test split because the test split cannot be modelled. This is because there is only one record for each class, therefore there is no way the model could have trained for anything that is in the test split.
I can't justify introducing unseen records for prediction if my model cannot be validated.
Each record is similar to a generic accounting record with about 20 features. I thought about using KNN, but KNN would not be able to do a train/test split neither. How do I go about mitigating this? Is KNN the wrong model to use here?

Training Doc2vec with new data

I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?

Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.

Which additional features to use apart from Doc2Vec embeddings for Document Similarity?

So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.

You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.

How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)

As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.

How can I use Chi-square value for text classification using SVM?

I have both positive and negative training documents for a text classification problem. I am planning on calculating chi-square value for every feature in each document. Having that value, how may I proceed to classification using SVM? What would be the threshold value for the classification?

Chi-square value can be used to perform feature selection, which could be a pre-processing step. After that, you could greatly reduce your feature vocabulary (for example, select the most useful 100K terms from a 1M vocabulary). This step might have two benefit: 1. reduce your model size in the next step; 2. faster at prediction time. Cons: may or may not affect the classification performance.
To proceed with a classification, you still need to use those 100K features to train your model (for example, using SVM algorithm). After your model is learnt, you could use the model for classification.

Natural language query preprocessing

I am trying to implement a natural language query preprocessing module which would, given a query formulated in natural language, extract the keywords from that query and submit it to an Information Retrieval (IR) system.
At first, I thought about using some training set to compute tf-idf values of terms and use these values for estimating the importance of single words. But on second thought, this does not make any sense in this scenario - I only have a training collection but I dont have access to index the IR data. Would it be reasonable to only use the idf value for such estimation? Or maybe another weighting approach?
Could you suggest how to tackle this problem? Usually, the articles about NLP processing that I read address training and test data sets. But what if I only have the query and training data?

tf-idf (it's not capitalized, fyi) is a good choice of weight. Your intuition is correct here. However, you don't compute tf-idf on your training set alone. Why? You need to really understand what the tf and idf mean:
tf (term frequency) is a statistic that indicates whether a term appears in the document being evaluated. The simplest way to calculate it would simply be a boolean value, i.e. 1 if the term is in the document.
idf (inverse document frequency), on the other hand, measures how likely a term appears in a random document. It's most often calculated as the log of (N/number of document matches).
Now, tf is calculated for each of the document your IR system will be indexing over (if you don't have the access to do this, then you have a much bigger and insurmountable problem, since an IR without a source of truth is an oxymoron). Ideally, idf is calculated over your entire data set (i.e. all the documents you are indexing), but if this is prohibitively expensive, then you can random sample your population to create a smaller data set, or use a training set such as the Brown corpus.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string