metrics to rank text files - text

I have a set of text files in a particular domain. I need to rank the files based on some metric.
Please help me out with a few metrics that can be used to rank my text files (term frequency, size, frequency of use, etc..). I would then like to use text mining techniques to rank the files based on one of these techniques.

The major issue that i had come across is to rank the documents according to thier relevance or some other metric .
Now i have come to a conclusion that documents ranked based on their content(relevance) provides better results.
I am making use of a vector based approach to rank documents based on the search words given in the query . I am not sure if that is the best approach but it provides results with average accuracy

Related

Do high cardinality fields affect performance for searches?

The Azure Search docs state that:
A high cardinality field consists of a facetable or filterable field that has a significant number of unique values, and as a result, consumes significant resources when computing results
But it's not clear on whether this poor performance is limited to when the fields are specifically used in a filter/facet query, or whether it also affects performance when the field is queried against using search terms.
Can anyone with some deeper Azure Search knowledge weigh in?
After getting clarification from Microsoft, I can confirm that the answer is "no, performance is only affected when using the field in a facet/filter".
This poor performance is limited to when the fields are specifically used in a filter/facet query. The searchable terms will not be affected.
Fields that work best in faceted navigation have low cardinality: a small number of distinct values that repeat throughout documents in your search corpus (for example, a list of colors, countries/regions, or brand names).
If the field that has a significant number of unique values, it will consume significant resources when computing the facet navigation. Because each distinct value will be 1 facet and need to be calculated.
At query time, a filter parser accepts criteria as input, converts the expression into atomic Boolean expressions represented as a tree, and then evaluates the filter tree over filterable fields in an index.
If the field that has a significant number of unique values, the tree will be deep and consume significant computing resources. Because each unique value will be calculated in filter, there will be no cached result for duplicate items to reduce the calculation.
The searchable fields will not be affected if the fields have a significant number of unique values. Because searchable fields have inverted index to accelerate query.
When you load the index, each field's inverted index is populated with all of the unique, tokenized words from each document, with a map to corresponding document IDs. For example, when indexing a hotels data set, an inverted index created for a City field might contain terms for Seattle, Portland, and so forth. Documents that include Seattle or Portland in the City field would have their document ID listed alongside the term.
I reached out to MS as well, this is the answer that I got:
“High cardinality” means different things to filterable vs searchable fields. Cardinality for filterable fields amounts to the uniqueness of the full value of the field. For searchable fields, it’s about the aggregate number of indexed terms that results from writing a document to the index. Complex custom analyzers, for example, can bloat the index by producing several tokens for each word in a string. Inverted indexes scale really well, so I wouldn’t be too concerned about having a high number of unique words in the index. But, this should help understand the unit of scale each.
This mention in the documentation is primarily to raise awareness about what contributes to query performance and why they may see reduced performance as they add additional fields to the filter clause. I will add…You can improve the performance of individual queries by scaling up the number of partitions in your service. Going from 1 to 2 not only doubles the storage available to your service, it also doubles the amount of compute power available to execute queries. The data workload is divided roughly equally between each partition. It doesn’t usually equate to exactly twice the performance for your queries, but it can have a significant impact if you are seeing slow queries.

Algolia search keywords

I want to build a smart search with Algolia. The point is to use keywords to rank the results. Lets say user types "smarphone blue cheap good camera". This should find all blue smarthones and order them by price and camera characteristics.
The idea is to somehow map those keywords to a ranking formula.
Doea any one know if it is possible with Algolia and if so what is the best way to achieve the desired result?
To automatically detect and filter by facet values (like blue, good camera), you could use Query Rules, in particular Dynamic Filtering.
However, that shouldn't be necessary. If you include the color (containing for instance the blue value) and characteristics (containing for instance the good camera value) attributes in your searchableAttributes list, then the search request will return relevant results based on purely textual relevance matched in those attributes.
On the other hand, sorting strategies impact the Algolia indices at build time, therefore in order to change the sorting strategy based on the query (e.g. sort results by ascending price if the search query contains cheap), you will need to setup a new replica index for which results are sorted by price. On the frontend, when detecting a relevant keyword (e.g. cheap), you can decide to switch the search queries to the primary index or to the sorted replica.

The implication of #search.score in Azure Search Service

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?
The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

Group analysis in Microsoft Azure

I have a data set, and I would like to produce a prediction model based on that data-set, usimg Microsoft Azure
This data-set contains some group of events that together make a bigger event, for example - few lines in the data-set that are close in time (there is a time column) create together one event in time.
does anybody know the method for how can I do it? is there anyway to create a prediction model that learns not from a certain column, but from a different data-set (of results, for that matter)
thanks
Yes, this does exist using Azure Machine Learning, have a look in the Gallery for various examples where this has been shown.
I think for your specific question it will be good to have a look at the Bike regression example. In that example they show how they aggregate information from multiple rows to score on a single feature.

Influencing Solr search results with a field value

I've recently started experimenting with Solr. My data is indexed and searchable. My problem is in the sorting. I have three fields: Author, Title, Sales.
I would like to search against the author & title fields, but have the sales value influence the score so that matches with higher sales move toward the top, even if the initial match score is not the highest.
Simply sorting by sales does not produce valid results as a result with a near 0 score for the search term, but a lot of sales in general could end up above a perfect match for the term that has never been sold.
I am seeing results that, while great term matches, are not necessarily the product I want showing at the top of the list.
If you're using the dismax handler, you can add a boost function (bf) with the field you want to boost on, e.g.
http://...?q=foo&bf="fieldValue(sales)^1.5"
...to make the value of the sales figure give a bump. You can, of course, make the function more complex if you want to munge the sales data in some way.
More info is easily found.
You may also just want to do this at index time since the sales data isn't going to be changing on the fly.
You can also use Index-time boosting.
And here's detailed info on using function queries to influence scoring.

Resources