Does corpus take into account of the word frequency in each document, or just the word frequency in all documents? - nlp

As stated in the title, I have some understanding issues regarding corpus, as far as what I've read from articles and guides, it seems that the corpus only contains the frequency of words in all documents combined, if it is true, when I'm training an LDA model, how does the calculation of topic mixture works? Since the model is built based on the text corpus, how does topic mixture know the frequency of words in individual documents, as used in the calculation of topic mixture?

Related

Which additional features to use apart from Doc2Vec embeddings for Document Similarity?

So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.
You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.
How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)
As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.

What is the impact of word frequency on Gensim LDA Topic modelling

I am trying to use Gensim LDA modelling to topic model of dataset of food recipes. I wish to have topics based the key ingredients in the recipe. But the recipe text contains more words that are generic English and are not ingredient names. Hence my topic outcome is not as good as expected. I am trying to understand the impact of word frequency in the LDA topic outcome. Thanks.
Have you tried removing stop-words from the data on which you construct LDA model?
Also, please bear in mind that it is not really possible to influence the assignment of words among the topics. This has been discussed in the answer to this question: how to improve word assignement in different topics in lda

Assign more weight to certain documents within the corpus - LDA - Gensim

I am using LDA for topic modelling but unfortunately my data is heavily skewed. I have documents from 10 different categories and would like each category to equally contribute to the LDA topics.
However, each category has a varying number of documents (one category for example holds more than 50% of the entire documents, while several categories hold only 1-2% of the documents).
What would be the best approach to assign weights to these categories, so they equally contribute to my topics? If I run the LDA without doing so, my topics will be largely based on the category, which holds over 50% of the documents in the corpus. I am exploring up-sampling but would prefer a solution that directly assigns weight in LDA.

Multiclass text classification: new class if input does not match to a class

I am trying to classify pieces of text to categories. I have 9 categories but the given sentences i have can be classify to more categories. My objective is to take a piece of text and find the industry of each sentence, one common problem i have is that my training set does not have a "Porn" category and sentences with porn material classified to "Financial".
I want my classifier to check if the sentence can be categorized to a class and if not just print that cant classify that text.
I am using Tf-idf vectorizer to transform the sentences and then i feed the data to a LinearSVC.
Can anyone help me with this issue?
Or can anyone provde me any usefull material?
Firstly, the problem you have with the “Porn” documents being classified as “Financial” doesn’t seem to be entirely related to the other question here. I’ll address the main question for now.
The setting is that you have data for 9 categories, but the actual document universe is bigger. The problem is to determine that you haven’t seen the likes of a particular data point before. This seems to be more like outlier or anomaly detection, than classification.
You'll have to do some background reading to proceed further, but here are some points to get you started. One strategy to use is to determine if the new document is “similar” to other documents that you have in your collection. The idea being that an outlier is not likely to be similar to “normal” documents. To do this, you would need a robust measure of document similarity.
Outline of a potential method you can use:
Find a good representation of the documents, say tf-idf vectors, or better.
Benchmark the documents within your collection. For each document, the “goodness” score is the highest similarity score with all other documents in the collection. (Alternately, you can use k’th highest similarity, for some fault tolerance.)
Given the new document, measure its goodness score in a similar way.
How does the new document compare to other documents in terms of the goodness score? A very low goodness score is a sign of an outlier.
Further reading:
Survey of Anomaly Detection
LSA, which is a technique for text representation and similarity computation.

How to produce a bag of words depending on relevance across corpus

I understand that TF-IDF(term frequency-inverse document frequency) is the solution here? But see, TF of the TF-IDF is specific to a single document only. I need to produce a bag of words that are relevant to the WHOLE corpus. Am I doing this wrong or is there an alternative?
You may be able to do this if you count the IDF on a different corpus. A general corpus containing newswire texts may be suitable. Then you can treat your own corpus as a single document to count the TF. You will also need a strategy for the words that are present in your corpus but not present in the external corpus as they won't have a IDF value. Finally, you can rank the words in your corpus according to the TF-IDF.

Resources