Suggestion on algorithm selection and implementation

Suggestion on algorithm selection and implementation - apache-spark

Good day,
I'm working on the next problem and have minimum knowledge about machine learning (ML):
Given the list of articles (A's, text format) and search string (SQ), select and order most relative articles (A's) to the search string (SQ).
Optimize point #1 on case if new article (A) is added - i.e. so search will account new record and it will be taken into consideration next time.
I've selected spark as the engine for ML calculations and found examples, which counts IDF models (https://spark.apache.org/docs/2.0.0/ml-features.html#tf-idf). This is smths that results in the end with feature-vector of frequencies of term in the article:
(8,[0,1,4],[0.287... (8,[0,1,6],[0.287... (8,[1,3,4],[0.0,0...
(sorry for truncated results)
At this point I stuck. It looks like we might need to calculate similar vector for SQ and order somehow articles by closest. Not sure how to do that though.
What would be right way forward? Can you please share/point to examples with implementation?
Thank you in advance,
Vitaliy

here is the short roadmap which worked for me (verified on really small data set):
Build TF-IDF for corpus, obtain "features" vectors
Build IDF for search term, obtain "features" vector
Train kmeans on corpus features (from step 1), get "article id" to cluster mapping.
Predict search term cluster (from step 2 and 3), get "search term" cluster.
Filter articles by needed cluster (from step 4).
Filter further articles in selected cluster with LSH, order further by distance between features in cluster and search term.

Related

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.

It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

Suggestion on LDA

I am trying to do textual analysis on a bunch (about 140 ) of textual documents. Each document, after preprocessing and removing unnecessary words and stopwords, has about 7000 sentences (as determined by nlkt's sentence tokenizer) and each sentence has about 17 words on average. My job is to find hidden themes in those documents.
I have thought about doing topic modeling. However, I cannot decide if the data I have is enough to obtain meaningful results via LDA or is there anything else that I can do.
Also, how do I divide the texts into different documents? Is 140 documents (each with roughly 7000 x 17 words) enough ? or should I consider each sentence as a document. But then each document will have only 17 words on average; much like tweets.
Any suggestions would be helpful.
Thanks in advance.

I have worked on similar lines. This approach can work till 300 such documents. But, taking it to higher scale you need to replicate the approach using spark.
Here it goes:
1) Prepare TF-IDF matrix: Represent documents in terms Term Vectors. Why not LDA because you need to supply number of themes first which you don't know first. You can use other methods of representing documents if want to be more sophisticated (better than semantics) try word2Vec, GloVe, Google News Vectors etc.
2) Prepare a Latent Semantic Space from the above TF-IDF. Creation of LSA uses SVD approach (one can choose the kaiser criteria to choose the number of dimensions).
Why we do 2)?
a) TF-IDF is very sparse. Step 3 (tSne) which is computationally expensive.
b) This LSA can be used to create a semantic search engine
You can bypass 2) when your TF-IDF size is very small but i don't think given your situation that would be the case and also, you don't have other needs like having semantic search on these documents.
3) Use tSne (t-stochastic nearest embedding) to represent the documents in 3 dimensions. Prepare a spherical plot from the euclidean cordinates.
4) Apply K-means iteratively to find the optimal number of clusters.
Once decided. Prepare word clouds for each categories. Have your themes.

All-pairs similarity using tfidf vectors in pyspark

I'm trying to find similar documents based on their text in spark. I'm using python with Spark.
So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf
Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.
We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.

I'd write this as a comment isntead of an answer but SO won't let me commet yet.
This would be "trivially" solved by utilizing ElasticSearch's more-like-this query. From docs you can see how it works and which factors are taken into account, which should be useful info even if you end up implementing this in Python.
They have also implemented other interesting algorithms such as the significant terms aggregation.

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?

Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.

A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.

First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/

Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.

Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.
For the sake of being thorough, I have the following pieces of information as input:
A set of documents (segments of DNA from one organism, where each segment is a document)
A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)
The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.
The main difference between this and the classical LDA model is that I know the topics ahead of time.
My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).
Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.

I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.
I also have a set of documents (pdf documents anywhere from 1 to 200
pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)
LDAesk approach: Guided LDA
Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.
Non-LDAesk approach: ~KNN
What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together.
For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector.
I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.
I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.

Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string