Good day, fellow humans (?).
I have a methodological question that is confused by a deep research in a tiny amount of time.
The question arises from the following problem(s): I need to apply semi-supervised or unsupervised clustering on documents. I have ~300 documents classified with multi-labels and approximately 3400 documents not classified. The number of unsupervised documents could become ~10'000 in the next days.
The main idea is that of applying semi-supervised clustering based on the labels at hands. Alternatively, that of going fully unsupervised for soft clustering.
We thought of creating embeddings for the whole documents, but here lies the confusion: which library is the best for such a task?
I guess the utmost importance needs to lie in the context of the whole document. As far as I know, BERT and FastText provide context-dependent word embedding, but not whole document embedding. On the other hand, Gensim's Doc2Vec is context-agnostic, right?
I think I saw a way to train sentence embeddings with BERT, via the HuggingFace API, and was wondering whether it could be useful to consider the whole document as a single sentence.
Do you have any suggestion? I'm probably exposing my utter ignorance and confusion on the matter, but my brain is melted.
Thank you very much for your time.
Edit to answer to #gojomo:
My documents are on average ~180 words. The original task was that of multi-label text classification, i.e. each document can have from 1 to N labels, with the number of labels now being N=18. They are highly imbalanced.
Having only 330 labeled documents so far due to several issues, we asked the documents' provider to give also unlabeled data, that should reach the order of the 10k.
I used FastText classification mode, but the result is obviously atrocious. I also run a K-NN with Doc2Vec document embedding, but the result is obviously still atrocious.
I was going to use biomedical BERT-based models (like BioBERT and SciBERT) to produce a NER tagging (trained on domain-specific datasets) on the documents to later apply a classifier.
Now that we have unlabeled documents at disposal, we wanted to adventure into semi-supervised classification or unsupervised clustering, just to explore possibilities. I have to say that this is just a master thesis.


Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.
A FastText model will already be able to generate vectors for OOV words.
So there's not necessarily any need to either list the specifically OOV words in your PDF, nor 'fine tune' as FastText model.
You just ask it for vectors, it gives them back. The vectors for full in-vocabulary words, that were trained from relevant training material, will likely be best, while vectors synthesized for OOV words from word-fragments (character n-grams) shared with training material will just be rough guesses - better than nothing, but not great.
(To train a good word-vector requires many varied examples of a word's use, interleaved with similarly good examples of its many 'peer' words – and traditionally, in one unified, balanced training session.)
If you think you need to do more, you should expand your questin with more details about why you think that's necessary, and what existing precedents (in docs/tutorials/papers) you're trying to match.
I've not seen a well-documented way to casually fine-tune, or incrementally expand the known-vocabulary of, an existing FastText model. There would be a lot of expert tradeoffs required, and in many cases simply training a new model with sufficient data is likely to be a safer approach.
Anyone seeking such fine-tuning should have a clear idea of:
what their incremental data might be able to add to an existing model
what process/code will they be using, and why that process/code might be expected to give meaningful results with their specific starting model & new data
how the results of any such process can be evaluated to ensure the extra fine-tuning steps are beneficial compared to alternatives

Multilingual free-text-items Text Classification for improving a recommender system

To improve the recomender system for Buyer Material Groups, our company is willing to train a model using customer historial spend data. The model should be trained on historical "Short text descriptions" to predict the appropriate BMG. The dataset has more that 500.000 rows and the text descriptions are multilingual (up to 40 characters).
1.Question: can i use supervised learning if i consider the fact that the descriptions are in multiple languages? If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
2.Question: if i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
if you have other ideas or approaches please feel free :). (It is a matter of a simple text classification problem)
Can I use supervised learning if i consider the fact that the descriptions are in multiple languages?
Yes, this is not a problem except it makes your data more sparse. If you actually only have 40 characters (is that not 40 words?) per item, you may not have enough data. Also the main challenge for supervised learning will be whether you have labels for the data.
If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
They will work as well as they always have, though these days building a vector representation is probably a better choice.
If i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
Assuming the numerical labels are labels on the original data, you can add them as tokens like LABEL001 and the model can learn representations of them if you want to make an unsupervised recommender.
Honestly these days I wouldn't start with Naive Bayes or classical models, I'd go straight to word vectors as a first test for clustering. Using fasttext or word2vec is pretty straightforward. The main problem is that if you really only have 40 characters per item, that just might not be enough data to cluster usefully.

Which additional features to use apart from Doc2Vec embeddings for Document Similarity?

So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.
You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.
How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)
As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.

Text representations : How to differentiate between strings of similar topic but opposite polarities?

I have been doing clustering of a certain corpus, and obtaining results that group sentences together by obtaining their tf-idf, checking similarity weights > a certain threshold value from the gensim model.
tfidf_dic = DocSim.get_tf_idf()
ds = DocSim(model,stopwords=stopwords, tfidf_dict=tfidf_dic)
sim_scores = ds.calculate_similarity(source_doc, target_docs)
The problem is that despite putting high threshold values, sentences of similar topics but opposite polarities get clustered together as such:
Here is an example of the similarity weights obtained between "don't like it" & "i like it"
Are there any other methods, libraries or alternative models that can differentiate the polarities effectively by assigning them very low similarities or opposite vectors?
This is so that the outputs "i like it" and "dont like it" are in separate clusters.
PS: Pardon me if there are any conceptual errors as I am rather new to NLP. Thank you in advance!
The problem is in how you represent your documents. Tf-idf is good for representing long documents where keywords play a more important role. Here, it is probably the idf part of tf-idf that disregards the polarity because negative particles like "no" or "not" will appear in most documents and they will always receive a low weight.
I would recommend trying some neural embeddings that might capture the polarity. If you want to keep using Gensim, you can try doc2vec but you would need quite a lot of training data for that. If you don't have much data to estimate the representation, I would use some pre-trained embeddings.
Even averaging word embeddings (you can load FastText embeddings in Gensim). Alternatively, if you want a stronger model, you can try BERT or another large pre-trained model from the Transformers package.
Unfortunately, simple text representations based merely on the sets-of-words don't distinguish such grammar-driven reversals-of-meaning very well.
The method needs to be sensitive to meaningful phrases, and the hierarchical, grammar-driven inter-word dependencies, to model that.
Deeper neural networks using convolutional/recurrent techniques do better, or methods which tree-model sentence-structure.
For ideas see for example...
"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"
...or a more recent summary presentation...
"Representations for Language: From Word Embeddings to Sentence Meanings"

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
