I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.
I'm currently trying to use SVD to summarise documents, and then find similarities between documents. My question is, what do the resulting matrices actually mean, and how can I use them, for example to get the summarisation of one document, or to get a summarisation of all documents?
I have already built a term-document matrix, in the form of a 2D list, with one list per term, where each list contains the number of times a term appears in each document.
My knowledge of linear algebra is limited, so please don't give a maths heavy explanation. Furthermore, I have already googled a lot, and looked at many online pages to try to understand, but at this point, I feel I just need a simple explanation of how to achieve what I want.
Let's say I have a user search query which looks like:
"the happy bunny"
I have already computed tf-idf and have something like this (following are made up example values) for each document in which I am searching (of coures the idf is always the same):
tf idf score
the 0.06 1 0.06 * 1 = 0.06
happy 0.002 20 0.002 * 20 = 0.04
bunny 0.0005 60 0.0005 * 60 = 0.03
I have two questions with what to do next.
Firstly, the still has the highest score, even though it is adjusted for rarity by idf, still it's not exactly important - do you think I should square the idf values to weight in terms of rare words, or would this give bad results? Otherwise I'm worried that the is getting equal importance to happy and bunny, and it should be obvious that bunny is the most important word in the search. As long as rare always equals important then it would be always a good idea to weight in terms of rarity, but if that is not always the case then doing so could really mess up the results.
Secondly and more importantly: what is the best/preferred method for combining the scores for each word together to give each document a single score that represents how well it reflects the entire search query? I was thinking of adding them, but it has become apparent that that is going to give higher priority to a document containing 10,000 happy but only 1 bunny instead of another document with 500 happy and 500 bunny (which would be a better match).
First, make sure that you are computing the correct TF-IDF values. As others have pointed they do not look right. TF is relative to specific documents, and we often do not need to compute them for queries (since raw term frequency is almost always 1 in queries). There are different types of TF functions to pick from (check the Wikipedia page on tf-idf, it has a good coverage). Log Normalisation is common and the most efficient scheme, since it saves an extra disk access to get the respective document's total frequency maxF that is needed for something like Double Normalisation. When you are dealing with large volumes of documents this can be expensive, especially if you can't bring these into memory. A bit of insight on inverted files can go a long way in understanding some of the underlying complexities. Log normalisation is efficient and is a non-linear function, therefore better than raw frequency.
Once you are certain on your weighting scheme, then you may want to consider a stop list to get rid of very common/noisy words. These do not contribute to the rank of documents. It is generally recommended to use a stop list of high frequency, very common words. Do a search and you will find many available, including the one that Lucene uses.
The remaining lies on your ranking strategy and that will depend on your implementation/model. The vector space model (VSM) is simple and readily available with libraries like Lucene, Lemur, etc. VSM computes the Dot product or scalar of the weights of common terms between the query and a document. Term weights are normalised via vector length normalisation (which solves your second question), and the result of applying the model is a value between 0 and 1. This is also justified/interpreted as the Cosine of the angle between two vectors in a planar graph, or the Euclidean distance divided by the Euclidean vector length of two vectors.
One of the earliest comprehensive studies on weighting schemes and ranking with VSM is an article by Salton (pdf) and is a good read if you are interested in Information Retrieval. A bit outdated perhaps (notice how log normalisation is not mentioned in the article).
Your best read I believe is the book Introduction to Information Retrieval by Christopher Manning. It will take you through everything that you need to know, from indexing to ranking schemes, etc. A bit lacking on ranking models (does not cover some of the more complex probabilistic approaches).
You should reconsider your TF and IDF values, they do not look correct. The TF value is usually just how often the word occurs, so if the word "the" appeared 20 times it's tf value would be 20. A word like "the" should have a very low IDF value (possibly around 4 decimal places, 0.000...).
You could use stop word removal if word like the are not necessary, they would be removed rather than just given a low score.
A vector space model could be used for this.
can you compute tf-idf for amalgamated terms? That is, you first generate a sentiment that considers each of its component as equal before treating the sentiment as a single term for which you now compute the tf-idf
I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.
I am trying to understand what is the relevance score that opencalais returns associated with each entity? What does it signify and how is it to be interpreted? I would be thankful for insights into this.
Their documentation states: The relevance capability detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important).
While they do not explain what 'relevance' means exactly, one would expect it to quantify the centrality of the entity to the discourse of the document. It's likely influenced by factors such as the entities mention frequency in this document as compared to its expected frequency in a random document (cf. TF-IDF), but could also involve more sophisticated discourse analysis.