Scoring Nutch results - nutch

I am crawling using Nutch 1.2 by providing seed links to it for travel domain. Next i am indexing using Solr 3.1. I am getting the search results in my serach engine. But now i want to score the indexed results and display them in the search engine.
I have reffered the URLS: 1) http://wiki.apache.org/solr/QueryElevationComponent which is basically for boosting the queries.
2) http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts which is for boosting the documents.
How do i boost the results at the index time and retrieve the results??
Thanks in advance!

What is your criteria for boosting the results?
SOLR already does a good job at calculating a documents relevancy based on the frequency the terms are present.
What are your specific requirements that are not covered by the default set up?

Related

Scoring relevancy in Azure Search

Is it possible to apply a profile scoring considering that a file must have a greater relevancy for one country than others?
How can I manage a scoring like that?
You can add Scoring Profiles to Azure Search index
The example in documentation gives sample for boosting score based on geo location.

Suggestion on algorithm selection and implementation

Good day,
I'm working on the next problem and have minimum knowledge about machine learning (ML):
Given the list of articles (A's, text format) and search string (SQ), select and order most relative articles (A's) to the search string (SQ).
Optimize point #1 on case if new article (A) is added - i.e. so search will account new record and it will be taken into consideration next time.
I've selected spark as the engine for ML calculations and found examples, which counts IDF models (https://spark.apache.org/docs/2.0.0/ml-features.html#tf-idf). This is smths that results in the end with feature-vector of frequencies of term in the article:
(8,[0,1,4],[0.287... (8,[0,1,6],[0.287... (8,[1,3,4],[0.0,0...
(sorry for truncated results)
At this point I stuck. It looks like we might need to calculate similar vector for SQ and order somehow articles by closest. Not sure how to do that though.
What would be right way forward? Can you please share/point to examples with implementation?
Thank you in advance,
Vitaliy
here is the short roadmap which worked for me (verified on really small data set):
Build TF-IDF for corpus, obtain "features" vectors
Build IDF for search term, obtain "features" vector
Train kmeans on corpus features (from step 1), get "article id" to cluster mapping.
Predict search term cluster (from step 2 and 3), get "search term" cluster.
Filter articles by needed cluster (from step 4).
Filter further articles in selected cluster with LSH, order further by distance between features in cluster and search term.

Azure Search, using prefix with Scoring Profile

When using a prefix like "tru*" I see the score of the results are stopped been calculated against the Scoring Profile.
I'm looking for a solution to searching part of a word and also order the results.
Two images show a search with '*' and without,
Eventually the Scoring Profile does work, but since using '*' the base Score is so low that no difference is shown after scoring profile.
The best solution for me was using Order By in the search request.

Document Similarity in ElasticSearch

I want to calculate similarity between two documents indexed in elasticsearch. I know it can be done in lucene using term vectors. What is the direct way to do it?
I found that there is a similarity module doing exactly this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html
How do I integrate this in my system? I am using pyelasticsearch for calling elasticsearch commands, but I am open to use the REST api for similarity if needed.
I think the Elasticsearch documentation can easily be mis-interpreted.
Here "similarity" is not a comparison of documents or fields but rather a mechanism for scoring matching documents based on matching terms from the query.
The documentation states:
A similarity (scoring / ranking model) defines how matching documents
are scored.
The similarity algorithms that Elasticsearch supports are probabilistic models based on term distribution in the corpus (index).
In regards to term vectors, this also can be mis-interpreted.
Here "term vectors" refer to statistics for the terms of a document that can easily be queried. It seems that any similarity measurements across term vectors would then have to be done in your application post-query. The documentation on term vectors state:
Returns information and statistics on terms in the fields of a
particular document.
If you need a performant (fast) similarity metric over a very large corpus you might consider a low-rank embedding of your documents stored in an index for doing approximate nearest neighbor searches. After your KNN lookup, which greatly reduces the candidate set, you can do more costly metric calculations for ranking.
Here is an excellent resource for evaluation of approximate KNN solutions:
https://github.com/erikbern/ann-benchmarks

Apache lucene inverted index

Does Lucene index use tf-idf as weights? Is it possible to define your own statistics and weights for each document, and "plug" them into Lucene?
Yes, the default scoring algorithm incorporates tf-idf, and is fully documented in the TFIDFSiilarity documentation.
There are a number of ways to customize the scoring of documents.
The simplest and most common is to incorporate a boost, either on a field at index time, or on a query term when querying.
Many query types modify the scoring used for that query. Examples include ConstantScoreQuery and DisjunctionMaxQuery.
The Similarity you use defines the scoring algorithm. You could select a different one (ex. BM25Similarity).
You can implement your own Similarity, Usually by extending a higher-level implementation such as DefaultSimilarity, TFIDFSimilarity, or SimilarityBase
Just go through this example. It may help help you to know how you can bring custom changes in indexing process
http://lucene.apache.org/core/4_3_1/demo/src-html/org/apache/lucene/demo/IndexFiles.html

Resources