Using Learning To Rank on textual documents? - python-3.x

i need some help in implementing Learning To Rank (LTR). It is related to my semester project and I'm totally new to this. The details are as follows:
I gathered around 90 documents and populated 10 user queries. Now i have to rank these documents based on each query using three algorithms specifically LambdaMart, AdaRank, and Coordinate Ascent. Previously i applied clustering techniques on Vector Space Model but that was easy. However in this case, I don't know how to change the data according to these algorithms. As i have this textual data( document and queries) in txt format in separate files. I have searched for solutions online and I'm unable to find a proper solution so can anyone here please guide me in the right direction i.e. Steps. I would really appreciate.

As you said you have applied the clustering in vector space model. the input of these algorithms are also vectors.
Why don't you have a look at the standard data set introduced for learning to rank issue (Letor benchmark) in which documents are shown in vectors of features?
There is also implementation of these algorithm provided in java (RankLib), which may give you the idea to solve the problem. I hope, this help you!

Related

Can we apply multi-criteria decision making algorithms in incomplete data?

I am currently working on a project where a multi criteria decision making algorithm is needed in order to evaluate several alternatives for a given goal. After long research, I decided to use the AHP method for my case study. The problem is that the alternatives taken into account for the given goal contain incomplete data.
For example, I am interested in buying a house and I have three alternatives to consider. One criterion for comparing them is the size of the house. Let’s assume that I know the sizes of some of the rooms of these houses, but I do not have information about the actual sizes of the entire houses.
My questions are:
Can we apply AHP (or any MCDM method) when we are dealing with
incomplete data?
What are the consequences?
And, how can we minimize the presence of missing data in MCDM?
I would really appreciate some advice or help! Thanks!
If you still looking for answers, please let me answer your questions.
Before the detail explain, I coludn't answer with a technical approach in programming language.
First, we can use uncertinal data for MCDM, AHP with statical method.
As reducing lost of data, you can use deep learning concepts like entropy.
The result of it will be get reliability by accuracy of probabilistic approach.
The example that you talked, you could find the data of entire extent in other houses has same extent of criteria. Accuracy will depend on number of criteria, reliability of inference.
To get the perfect answer in your problem, you might need to know optimization, linear algebra, calculus, statistics above intermediate level
I'm student in management major, and I would help you as I can. I hope you get what you want

Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It does not work as desired for me

Has anyone used CoreNLP from stanford for sentiment analysis in Spark?
It is not working as desired or may be I need to do some work which I am not aware of.
Following is the example.
1). I look forward to interacting with kids of states governed by the congress. - POSITIVE
2). I look forward to interacting with CM of states governed by the congress. - NEGETIVE (CM is chief minister)
Please note the change in one word here. kids -> CM
Statement 2 is not negetive but coreNLP tagged it as negetive.
is there anything I need to do to make it work as desired? Any alteration required? Please let me know if I need to plug-in any custom code.
Whoever has knowledge on this, please suggest something.
Also, suggest if there is any other better alternate to coreNLP.
Thanks.
Gaurav
The sentiment model relies on embeddings for all of the words in the statement. If you change one word you can radically alter the score it will get. In fact there is a whole area of machine learning research which is analyzing situations like this where you can make changes that don't affect a human's judgment/perception but radically alter the learned model's choices. For instance in vision problems, there are ways to alter the image that are imperceptible to humans but fool the trained model.
It is important to remember that while a neural network can do impressive things, it is not a perfect replication of the human mind. Numerous examples like this have been presented to me over the last year, so there clearly are some deficiencies.

Content based recommendation in scale

This question is probably very repeated in the blogging and Q&A websites but I couldn't find any concrete answer yet.
I am trying to build a recommendation system for customers using only their purchase history.
Let's say my application has n products.
Compute item similarities for all the n products based on their attributes (like country, type, price)
When user needs recommendation - loop the previously purchased products p for user u and fetch the similar products (similarity is done in the previous step)
If am right we call this as content-based recommendation as opposed to collaborative filtering since it doesn't involve co-occurrence of items or user preferences to an item.
My problem is multi-fold:
Is there any existing scalable ML platform that addresses contend based recommendation (I am fine to adopt different technologies/language)
Is there a way to tweak Mahout to get this result?
Is classification a way to handle content based recommendation?
Is it something that a graph database good at solving?
Note: I looked at Mahout (since am familiar with Java and Mahout apparently utilizes Hadoop for distributed processing) for doing this in scale and advantage of having a well tested ML algorithms.
Your help is appreciated. Any examples would be really great. Thanks.
The so called item-item recommenders are natural candidates for precomputing the similarities, because the attributes of the items rarely change. I would suggest you precompute the item similarity between each item, and perhaps store the top K for each item, and if you have enough resources you could load the similarity matix into main memory for real time recommendation.
Check out my answer to this question for a way to do this in Mahout: Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
The example is how to compute the textual similarity between the items, and than load the precomputed values into main memory.
For performance comparison about different data structures to hold the values check out this question: Mahout precomputed Item-item similarity - slow recommendation

How to gather useful data for autocompletion?

im trying to implement a typical autocompletion-box, like you know from amazon.com.
There you go, type in a letter and you get a reasonable suggest about what you might try to enter into the search-box.
The box itself will be implemented by jquery, the persistence-layer and suggest algorithm will be based on Apache Lucene/Solr and its Suggest-Feature.
Additionaly i get a weighted suggestion into the result, using WFST-Suggestion by lucene.
My problem is, what does e.g. amazon to achieve this kind of reasonable data?
I mean where do they get all this keywords and score, so it makes sense?
Is it a pure hand-made style information on each product? What I think would be real tough.
Or is it possible to gather the data using things like clustering or classification from machine-learning-theory? (then I could use mahout or carrot2).
Looking on amazon suggestions, I think the data contains:
name of the product
producer/manufacturer/author of the product/book
product-features (like color, type, size)
Does it contain more?
The next thing would be that it looks that the suggestion itself is ranked. How do they receive this kind of score to weight the suggestions?
Is it a simple user-click-path-tracking, where you look, what the user typed into the box and what he selected or which product he looks afterward?
Is this kind of score computed on each query (maybe cached) using some logic? (Which? maybe bayes theorem?)
They might use something as simple as building an n-gram model from user queries and/or product names and use that to predict the most likely auto-completions.

Topic modeling using mallet

I'm trying to use topic modeling with Mallet but have a question.
How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I crawled, new subjects may appear. In that case, how do I know whether I should rebuild the model from start till current?
I was thinking of doing so for documents I crawled each month. Can someone please advise?
So, is topic modeling more suitable for text under a fixed amount of topics (the input parameter k, no. of topics). If not, how do I really determine what number to use?
The answers to your questions depend in large part on the kind of data you're working with and the size of the corpus.
Regarding frequency, I'm afraid you'll just have to estimate how often your data changes in a meaningful way and remodel at that rate. You could start with a week and see if the new data lead to a significantly different model. If not, try two weeks and so on.
The number of topics you select is determined by what you're looking for in the model. The higher the number, the more fine-grained the results. If you want a broad overview of what's in your corpus, you could select say 10 topics. For a closer look, you could use 200 or some other suitably high number.
I hope that helps.

Resources