Multi-Target Regression/Interpolations - statistics

Looking for some advice on the problem I have in front of me.
I have a data set of movies watched by users. For some of the users, we know that they watched the movie, and what their rating is for that movie. For many others we know they watched the movie, but don't know their rating of that movie.
I am looking to find a way to apply a predicted or perhaps interpolated rating to those users movies they watched based on the larger dataset that has movies ratings. I am trying to find out what the best course of action around this would be. I have 1.5m users and 20K movies; however only 10% of those movies are rated by about 85% of users.
My approach is thus looking at cosine similarity and interpolating the rating based on the neighbor; and if the nearest neighbor doesnt have a value for a specific movie, go to the next nearest until all the movies have a rating. The other approach is looking at NNMF to apply ratings, and having 2*features-- one binary representation of the movies, the other the rating. So when I look to "predict" for a user, Ill input their binary movie values and it will return their ratings.
My questions are: Does the NNMF approach make sense? I have never used NNMF in that way. Also, are there any other models you think make sense? I am wondering if their is more of a prediction algorithm that can be employed rather than an interpolation.

Related

Handling optional data in Logistic regression

I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

model implicit and explicit behavioral data for recommendation engine

I've following user behavior data,
1. like
2. dislike
3. rating
4. product viewed
5. product purchased
The spark MLlib which support implicit behavioral data with the confident score 0 or 1, Ref (http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html).
For example user 1 viewed product A then the model will be like
1,A,1 (userId, productId, binary confident score)
But by looking at the nature of behavior, product liked has strong confident than product viewed. Product bought has strong confident than product viewed.
How can one model the data based on the type of behavior?
Actually implicit data does not have to be 0 or 1. It means that the values are treated like a confidence or strength of association, rather than a rating. You can simply model actions that show a higher association between user and item as having a higher confidence. A like is stronger than a view, and a purchase is stronger than a like.
In fact, negative confidence can be fit into this framework (and I know MLlib implements that). A dislike can mean a negative score.
What the values are are up to you to tune, really. I think it's reasonable to pick values that correspond to relative frequency, if you have no better idea. For example, if there are generally 50x more page views than likes, maybe a like's value is 50x that of a page view.

How to classify text when pre defined categories are not available

I have a problem and not getting idea which algorithm have to apply.
I am thinking to apply clustering in case two but no idea on case one:
I have .5 million credit card activity documents. Each document is well defined and contains 1 transaction per line. The date, the amount, the retailer name, and a short 5-20 word description of the retailer.
Sample:
2004-11-47,$500,Amazon,An online retailer providing goods and services including books, hardware, music, etc.
Questions:
1. How would classify each entry given no pre defined categories.
2. How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
1) How would classify each entry given no pre defined categories.
You wouldn't. Instead, you'd use some dimensionality reduction algorithm on the data's features to them in 2-d, make a guess at the number of "natural" clusters, then run a clustering algorithm.
2) How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
You'd manually label a bunch of them, then train a classifier on that and see how well it works with the usual machinery of accuracy/F1, cross validation, etc. Or you'd check whether a clustering algorithm picks up these categories well, but then you still need some labeled data.

Resources