Simple Binary Text Classification - nlp

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!

Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Related

Why word embedding technique works

I have look into some word embedding techniques, such as
CBOW: from context to single word. Weight matrix produced used as embedding vector
Skip gram: from word to context (from what I see, its acutally word to word, assingle prediction is enough). Again Weight matrix produced used as embedding
Introduction to these tools would always quote "cosine similarity", which says words of similar meanning would convert to similar vector.
But these methods all based on the 'context', account only for words around a target word. I should say they are 'syntagmatic' rather than 'paradigmatic'. So why the close in distance in a sentence indicate close in meaning? I can think of many counter example that frequently occurs
"Have a good day". (good and day are vastly different, though close in distance).
"toilet" "washroom" (two words of similar meaning, but a sentence contains one would unlikely to contain another)
Any possible explanation?
This sort of "why" isn't a great fit for StackOverflow, but some thoughts:
The essence of word2vec & similar embedding models may be compression: the model is forced to predict neighbors using far less internal state than would be required to remember the entire training set. So it has to force similar words together, in similar areas of the parameter space, and force groups of words into various useful relative-relationships.
So, in your second example of 'toilet' and 'washroom', even though they rarely appear together, they do tend to appear around the same neighboring words. (They're synonyms in many usages.) The model tries to predict them both, to similar levels, when typical words surround them. And vice-versa: when they appear, the model should generally predict the same sorts of words nearby.
To achieve that, their vectors must be nudged quite close by the iterative training. The only way to get 'toilet' and 'washroom' to predict the same neighbors, through the shallow feed-forward network, is to corral their word-vectors to nearby places. (And further, to the extent they have slightly different shades of meaning – with 'toilet' more the device & 'washroom' more the room – they'll still skew slightly apart from each other towards neighbors that are more 'objects' vs 'places'.)
Similarly, words that are formally antonyms, but easily stand-in for each-other in similar contexts, like 'hot' and 'cold', will be somewhat close to each other at the end of training. (And, their various nearer-synonyms will be clustered around them, as they tend to be used to describe similar nearby paradigmatically-warmer or -colder words.)
On the other hand, your example "have a good day" probably doesn't have a giant influence on either 'good' or 'day'. Both words' more unique (and thus predictively-useful) senses are more associated with other words. The word 'good' alone can appear everywhere, so has weak relationships everywhere, but still a strong relationship to other synonyms/antonyms on an evaluative ("good or bad", "likable or unlikable", "preferred or disliked", etc) scale.
All those random/non-predictive instances tend to cancel-out as noise; the relationships that have some ability to predict nearby words, even slightly, eventually find some relative/nearby arrangement in the high-dimensional space, so as to help the model for some training examples.
Note that a word2vec model isn't necessarily an effective way to predict nearby words. It might never be good at that task. But the attempt to become good at neighboring-word prediction, with fewer free parameters than would allow a perfect-lookup against training data, forces the model to reflect underlying semantic or syntactic patterns in the data.
(Note also that some research shows that a larger window influences word-vectors to reflect more topical/domain similarity – "these words are used about the same things, in the broad discourse about X" – while a tiny window makes the word-vectors reflect a more syntactic/typical similarity - "these words are drop-in replacements for each other, fitting the same role in a sentence". See for example Levy/Goldberg "Dependency-Based Word Embeddings", around its Table 1.)
‘Embedding’ mean a semantic vector representation. e.g. how to represent words such that synonyms are nearer than antonyms or other unrelated words.
Embeddings algorithms like Word2vec maps entities be it e-commerce
items or words (say in English language), to N-dimensional vectors.
Now since you have a mathematical representation of the entities in
a Euclidean space, you can use associated semantics such as distance
between vectors. e.g:
For a given item say ‘Levis Jeans’ recommend the most related items
which are often co-purchased with it.
This can be easily done: search the nearest vectors to the vector of
‘Levis Jeans’, and recommend them. You will find that the nearest
vectors correspond to items such as T-shirts etc., which are
relevant to the Levis Jeans. Similarly it preserves
distance/similarity between words e.g.: King - Queen = Man - Woman !
Yes, Word2vec captures such co-occurrance relationships, when
mapping the items/words to vectors also called as ‘item/word
embeddings’.
This is not specifically targeted to sentence embeddings but nevertheless here you get some crucial insights extremely relevant to the core logic behind embedding generation. Read till the end.

Determine text similarity through cluster analysis

I am a senior bachelor student in CS and I currently work on my thesis. For this thesis I wrote a program that uses density-based clustering approach. More specifically, OPTICS algorithm. I have an idea of how to use it, but I don't know if it is valid.
I want to use this algorithm for text classification. Texts are points in the set that have to be clustered, so that the resulting hierarchy consists of categories and subcategories of texts. For example, one such set is "Scientific literature", consisting of subsets "Mathematics", "Biology" etc.
I came up with the idea that I can analyze texts for specific words that are encountered in particular text more often than in the whole dataset, also excluding insignificant words like prepositions. Perhaps I can use open source natural language parsers for that purpose, like Stanford parser. After that the program combines these "characteristic words" from each text into one set, and a certain amount of the most frequent words can be taken from this set. That amount becomes the dimentionality for the clustering, and each word's frequency in a particular text is used as a coordinate of a point. Thus we can cluster them.
The question is, is that idea valid or a complete nonsense? Can clustering in general and density-based clustering in particular be used for such classification? Maybe there is some kind of literature that can point me in the right direction?
Clustering != classification.
Run the clustering algorithm, and study the results. Most likely, there will not be a cluster "scientific literature" with subjects "mathematics" - what do you do then?
Also, clusters will only give you sets, that is too coarse for similarity search - on the contrary, you need first to solve the similarity problem, before you can run clustering algorithms such as OPTICS.
The "idea" you described is pretty much what everybody has been trying for years already.

Systematic threshold for cosine similarity with TF-IDF weights

I am running an analysis of several thousand (e.g., 10,000) text documents. I have computed TF-IDF weights and have a matrix with pairwise cosine similarities. I want to treat the documents as a graph to analyze various properties (e.g., the path length separating groups of documents) and to visualize the connections as a network.
The problem is that there are too many similarities. Most are too small to be meaningful. I see many people dealing with this problem by dropping all similarities below a particular threshold, e.g., similarities below 0.5.
However, 0.5 (or 0.6, or 0.7, etc.) is an arbitrary threshold, and I'm looking for techniques that are more objective or systematic to get rid of tiny similarities.
I'm open to many different strategies. For example, is there a different alternative to tf-idf that would make most of the small similarities 0? Other methods to keep only significant similarities?
In short, take the average cosine value of an initial clustering or even all of the initial sentences and accept or reject clusters based on something akin to the following.
One way to look at the problem is to try and develop a score based on a distance from the mean similarity (1.5 standard deviations (86th percentile if the data were normal) tends to mark an outlier with 3 (99.9th percentile) being an extreme outlier), taking the high end for good measure. I cannot remember where, but this idea has had traction in other forums and formed the basis for my similarity.
Keep in mind that the data is not likely to be normally distributed.
average(cosine_similarities)+alpha*standard_deviation(cosine_similarities)
In order to obtain alpha, you could use the Wu Palmer score or another score as described by NLTK. Strong similarities with Wu Palmer should lead to a larger range of acceptance while lower Wu Palmer scores should lead to a more strict acceptance. Therefore, taking 1-Wu Palmer score would be adviseable. You can even use this method for LSA or LDA groups. To be even more strict and take things close to 1.5 or more standard deviations, you could even try 1+Wu Palmer (the cream of the crop), re-find the ultimate K,find the new score, cluster, and repeat.
Beware though, this would mean finding the Wu Palmer of all relevant words and is quite a large computational problem. Also, 10000 documents is peanuts compared to most algorithms. The smallest I have seen for tweets was 15,000 and the 20 news groups set was 20,000 documents. I am pretty sure Alchemy API uses something akin to the 20 news groups set. They definitely use senti-wordnet.
The basic equation is not really mine so feel free to dig around for it.
Another thing to keep in mind is that the calculation is time intensive. It may be a good idea to use a student t value for estimating the expected value/mean wu-palmer score of SOV pairings and especially good if you try to take the entire sentence. Commons Math3 for java/scala includes the distribution as does scipy for python and R should already have something as well.
Xbar +/- tsub(alpha/2)*sample_std/sqrt(sample_size)
Note: There is another option with this weight. You could use an algorithm that adds or subtracts from this threshold until achieving the best result. This would likely not be related solely to the cosine importance but possibly to an inflection point or gap as with Tibshirani's gap statistic.

Number of Latent Semantic Indexing topics

I'm using gensim's package to implement LSI on a corpus. My goal is to find out the most frequently occurring distinct topics that appear in the corpus.
If I don't know the number of topics that are in the corpus (I'd estimate anywhere from 5 to 20), what is the best approach in setting the number of topics that LSI should search for? Is it better to look for a large number of topics (20-30), or a small number of topics (~5)?
From Radim himself:
that's a good question, but unfortunately without a good answer.
It is not true that increasing the number of dimensions always
improves retrieval accuracy. In fact, if you use all the dimensions
(=full rank of the training matrix), LSI will give you exactly the
same documents that you entered in, so LSI would become pointless.
If you're interested in the math side of it, have a look at this
issue: https://github.com/piskvorky/gensim/issues/28 Otherwise, just
set the dimensions to a few hundred~thousand which is the accepted
standard. Or try several different choices, measure the accuracy and
select dimensionality that works the best on your problem.
Best, Radim
This is what I do sometimes when I'm confused. Since you've already narrowed down to your topics from 5-20, you can iterate b/w some of these values and see which value fits the best.
##Declare values for N_TOPICS
for i in lda.show_topics(topics=-N_TOPICS, topn=20, log=False, formatted=True):
print "TOPIC {0}: {1}\n".format(count, i)

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources