TF - IDF vs only IDF - nlp

Is there any case when IDF is better than TF-IDF? As far I understood TF is important to give a weight to a word within a document to match that document with a predefined query. If I'd like just to sort the importance of words in a collection of documents without any specific IR purpose, why should I use the TF term?

TF in TF-IDF means frequency of a term in a document. In other words, TF-IDF is a measure for both the term and the document. Here is a good illustration of what I mean.
As far as I understand your case, you don't work with any particular document, instead you want to have some integral characteristic for each word over the whole document collection. So, you should use IDF (or simply DF, document frequency), if you want to find something like stop-words. See also for related question.

Related

Which additional features to use apart from Doc2Vec embeddings for Document Similarity?

So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.
You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.
How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)
As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.

Natural Language Processing in Python

How to find similar kind of issues for a new unseen issue based on past trained issues(includes summary and description of issue) using natural language processing in python
If I understand you correctly you have a new issue (query) and you want to look up other similar issues (documents) in your database. If so, then what you need is a way to find the similarity between your query and existing documents. And once you have them, you can rank them and select the most relevant ones. One such method that allows you to do this is Latent Semantic Indexing (LSI).
To do this you'll have to construct a document-term matrix. You'll use your existing document and create a term occurrence matrix across documents. What this means is that you basically record how many times a word appears in a document (or some other complex measure, example- tfidf). This can be done either through a bag of words representation or a TFIDF representation.
Once you have that, you'll have to process your query so that it is in the same form as your documents. Now that you have your query in usable form, you can calculate the cosine similarity between documents and your query. The one with the highest cosine similarity is the closest match.
Note: The topic that you may want to read about is Information Retrieval and LSI is just one such method. You should look into other methods as well.

Gensim doc2vec most_similar equivalent to get full documents

In Gensim's doc2vec implementation, gensim.models.keyedvectors.Doc2VecKeyedVectors.most_similar returns the tags and cosine similarity of the documents most similar to the query document. What if I want the actual documents themselves and not the tags? Is there a way to do that directly without searching for the document associated with the tag returned by most_similar?
Also, is there documentation on this? I can't seem to find the documentation for half of Gensim's classes.
The Doc2Vec class doesn't serve as a full document database that stores the original documents in their original formats. That would require a lot of extra complexity and state.
Instead, you just present the docs, with their particular tags, in the tokenized format it needs for training, and the model only learns and retains their vector representations.
If you need to then look-up the original documents, you must maintain your own (tags -> documents) lookup – which many projects will already have as the original source of the docs.
The Doc2Vec class docs are at https://radimrehurek.com/gensim/models/doc2vec.html but it may also be helpful to look at the example Jupyter notebooks included in the gensim docs/notebooks directory but also viewable online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
The three notebooks related to Doc2Vec have filenames beginning doc2vec-.

Multiclass text classification: new class if input does not match to a class

I am trying to classify pieces of text to categories. I have 9 categories but the given sentences i have can be classify to more categories. My objective is to take a piece of text and find the industry of each sentence, one common problem i have is that my training set does not have a "Porn" category and sentences with porn material classified to "Financial".
I want my classifier to check if the sentence can be categorized to a class and if not just print that cant classify that text.
I am using Tf-idf vectorizer to transform the sentences and then i feed the data to a LinearSVC.
Can anyone help me with this issue?
Or can anyone provde me any usefull material?
Firstly, the problem you have with the “Porn” documents being classified as “Financial” doesn’t seem to be entirely related to the other question here. I’ll address the main question for now.
The setting is that you have data for 9 categories, but the actual document universe is bigger. The problem is to determine that you haven’t seen the likes of a particular data point before. This seems to be more like outlier or anomaly detection, than classification.
You'll have to do some background reading to proceed further, but here are some points to get you started. One strategy to use is to determine if the new document is “similar” to other documents that you have in your collection. The idea being that an outlier is not likely to be similar to “normal” documents. To do this, you would need a robust measure of document similarity.
Outline of a potential method you can use:
Find a good representation of the documents, say tf-idf vectors, or better.
Benchmark the documents within your collection. For each document, the “goodness” score is the highest similarity score with all other documents in the collection. (Alternately, you can use k’th highest similarity, for some fault tolerance.)
Given the new document, measure its goodness score in a similar way.
How does the new document compare to other documents in terms of the goodness score? A very low goodness score is a sign of an outlier.
Further reading:
Survey of Anomaly Detection
LSA, which is a technique for text representation and similarity computation.

Natural language query preprocessing

I am trying to implement a natural language query preprocessing module which would, given a query formulated in natural language, extract the keywords from that query and submit it to an Information Retrieval (IR) system.
At first, I thought about using some training set to compute tf-idf values of terms and use these values for estimating the importance of single words. But on second thought, this does not make any sense in this scenario - I only have a training collection but I dont have access to index the IR data. Would it be reasonable to only use the idf value for such estimation? Or maybe another weighting approach?
Could you suggest how to tackle this problem? Usually, the articles about NLP processing that I read address training and test data sets. But what if I only have the query and training data?
tf-idf (it's not capitalized, fyi) is a good choice of weight. Your intuition is correct here. However, you don't compute tf-idf on your training set alone. Why? You need to really understand what the tf and idf mean:
tf (term frequency) is a statistic that indicates whether a term appears in the document being evaluated. The simplest way to calculate it would simply be a boolean value, i.e. 1 if the term is in the document.
idf (inverse document frequency), on the other hand, measures how likely a term appears in a random document. It's most often calculated as the log of (N/number of document matches).
Now, tf is calculated for each of the document your IR system will be indexing over (if you don't have the access to do this, then you have a much bigger and insurmountable problem, since an IR without a source of truth is an oxymoron). Ideally, idf is calculated over your entire data set (i.e. all the documents you are indexing), but if this is prohibitively expensive, then you can random sample your population to create a smaller data set, or use a training set such as the Brown corpus.

Resources