Do high cardinality fields affect performance for searches?

Do high cardinality fields affect performance for searches? - azure

The Azure Search docs state that:
A high cardinality field consists of a facetable or filterable field that has a significant number of unique values, and as a result, consumes significant resources when computing results
But it's not clear on whether this poor performance is limited to when the fields are specifically used in a filter/facet query, or whether it also affects performance when the field is queried against using search terms.
Can anyone with some deeper Azure Search knowledge weigh in?

After getting clarification from Microsoft, I can confirm that the answer is "no, performance is only affected when using the field in a facet/filter".
This poor performance is limited to when the fields are specifically used in a filter/facet query. The searchable terms will not be affected.
Fields that work best in faceted navigation have low cardinality: a small number of distinct values that repeat throughout documents in your search corpus (for example, a list of colors, countries/regions, or brand names).
If the field that has a significant number of unique values, it will consume significant resources when computing the facet navigation. Because each distinct value will be 1 facet and need to be calculated.
At query time, a filter parser accepts criteria as input, converts the expression into atomic Boolean expressions represented as a tree, and then evaluates the filter tree over filterable fields in an index.
If the field that has a significant number of unique values, the tree will be deep and consume significant computing resources. Because each unique value will be calculated in filter, there will be no cached result for duplicate items to reduce the calculation.
The searchable fields will not be affected if the fields have a significant number of unique values. Because searchable fields have inverted index to accelerate query.
When you load the index, each field's inverted index is populated with all of the unique, tokenized words from each document, with a map to corresponding document IDs. For example, when indexing a hotels data set, an inverted index created for a City field might contain terms for Seattle, Portland, and so forth. Documents that include Seattle or Portland in the City field would have their document ID listed alongside the term.

I reached out to MS as well, this is the answer that I got:
“High cardinality” means different things to filterable vs searchable fields. Cardinality for filterable fields amounts to the uniqueness of the full value of the field. For searchable fields, it’s about the aggregate number of indexed terms that results from writing a document to the index. Complex custom analyzers, for example, can bloat the index by producing several tokens for each word in a string. Inverted indexes scale really well, so I wouldn’t be too concerned about having a high number of unique words in the index. But, this should help understand the unit of scale each.
This mention in the documentation is primarily to raise awareness about what contributes to query performance and why they may see reduced performance as they add additional fields to the filter clause. I will add…You can improve the performance of individual queries by scaling up the number of partitions in your service. Going from 1 to 2 not only doubles the storage available to your service, it also doubles the amount of compute power available to execute queries. The data workload is divided roughly equally between each partition. It doesn’t usually equate to exactly twice the performance for your queries, but it can have a significant impact if you are seeing slow queries.

Related

Mongodb, should a number fields be indexed?

I'm trying to get a proper understanding of using mongodb to optimise queries. In this case it's for fields that would hold an integer. So say i have a collection
with two fields value and cid where value will store data of type string and cid will store data of type number.
I intend to write queries that will search for records by matching the fields value and cid. Also the expectation is that the saved records for this collection would get very large and hence queries could benefit from mongodb indexes. It makes sense to me to index the value field which holds string. But I wonder if the cid field requires indexing, or its okay as is, given that it will be holding integers.
I'm asking because I was going through a code base with this exact scenario described and i can't figure out why the number field was not indexed. Hoping my question makes any sense.

Regardless of datatypes, generally speaking all queries should use an index. If you use a sort predicate you can assist the database by having a compound index on both the equality portion of the query (the filter predicate) as well as the sorting portion (the sort predicate). MongoDB recommends following the index strategy referred to as the E.S.R. rule - see Performance Best Practices for E.S.R. rule.

Algolia search keywords

I want to build a smart search with Algolia. The point is to use keywords to rank the results. Lets say user types "smarphone blue cheap good camera". This should find all blue smarthones and order them by price and camera characteristics.
The idea is to somehow map those keywords to a ranking formula.
Doea any one know if it is possible with Algolia and if so what is the best way to achieve the desired result?

To automatically detect and filter by facet values (like blue, good camera), you could use Query Rules, in particular Dynamic Filtering.
However, that shouldn't be necessary. If you include the color (containing for instance the blue value) and characteristics (containing for instance the good camera value) attributes in your searchableAttributes list, then the search request will return relevant results based on purely textual relevance matched in those attributes.
On the other hand, sorting strategies impact the Algolia indices at build time, therefore in order to change the sorting strategy based on the query (e.g. sort results by ascending price if the search query contains cheap), you will need to setup a new replica index for which results are sorted by price. On the frontend, when detecting a relevant keyword (e.g. cheap), you can decide to switch the search queries to the primary index or to the sorted replica.

The implication of #search.score in Azure Search Service

I understood the reason for having search profile and boosting results based on some fields e.g. distance, rating, etc. To me, that's most likely applicable to structured documents like json files. The scenario that I cannot make sense of it is when indexer gets search service index let's say a MS Word or PDF document in azure blob. We have two entries of "id" and "content" which I don't know how the search score would apply to it.
For e.g. there are two documents with different contents. I searched for a keyword and the same keyword found in two documents resulted into getting two different scores for two MS Word documents. My challenge is why this score should be different while both documents contain the same keyword?

The score is determined by many factors, for example, the count of terms in each document, and the number of searchable fields in which query terms were found. In your example, the documents have different lengths, so naturally they'll have different scores. HTH.

Wide rows vs Collections in Cassandra

I am trying to model many-to-many relationships in Cassandra something like Item-User relationship. User can like many items and item can be bought by many users. Let us also assume that the order in which the "like" event occurs is not a concern and that the most used query is simply returning the "likes" based on item as well as the user.
There are a couple of posts dicussing data modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
An alternative would be to store a collection of ItemID in the User table to denote the items liked by that user and do something similar in the Items table in CQL3.
Questions
Are there any hits in performance using the collection? I think they translate to composite columns? So the read pattern, caching and other factors should be similar?
Are collections less performant for write heavy applications? Is updating the collection frequently less performant?

There are a couple of advantages of using wide rows over collections that I can think of:
The number of elements allowed in a collection is 65535 (an unsigned short). If it's possible to have more than that many records in your collection, using wide rows is probably better as that limitation is much higher (2 billion cells (rows * columns) per partition).
When reading a collection column, the entire collection is read every time. Compare this to wide row where you can limit the number of rows being read in your query, or limit the criteria of your query based on clustering key (i.e. date > 2015-07-01).
For your particular use case I think modeling an 'items_by_user' table would be more ideal than a list<item> column on a 'users' table.

Cassandra sets or composite columns

I am storing account information in Cassandra. Each account has lists of data associated with it. For example, an account may have a list of friends and a list of liked books. Queries on accounts will always want all friends or all liked books or all of both. No filtering or searching is needed on either. The list of friends and books can grow and shrink.
Is it better to use a set column type or composite columns for this scenario?

I would suggest you not to use sets if
You are concerned about disk space(as each value is allocated a cell in disk + data space for metadata of each cell which is 15 bytes if am not wrong. Now that consumes a lot if your data is a growing one).
Not going to grow a lot of data in that particular row as each time ,the cells are to be fetched from different sstable .
In these kind of cases, the more preferred option would be a json array. You shall store it as a text and back the data from that.
Set (or any other collections ) use case was brought in for a completely different perspective. If you are having a particular value inside the list or a value has to be updated frequently inside the same collection, you shall make use of the collections .
My take on your query will be this.
Store all account specific info in a json object of friends that has a value as list of books .

Sets are good for smaller collections of data, if you expect your friends / liked books lists to grow constantly and get large (there isn't a golden number here) it would be better to go with composite columns as that model scales out better than collections and allows for straight up querying compared to requiring secondary indexes on collections.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string