Solr Order result without sort - search

I m new to solr.
I have a schema for product with 3 fields
price, name, size.
Field size has 3 values [type="text_general"]
small, medium, large
I want to show results in a particular order :
User has an option to mention its size.
If user selects small as the size the ordering should be
first show all small products followed by medium and large.
If user selects medium as the size the ordering should be
first show all medium products followed by small and large.
If user selects large as the size the ordering should be
first show all large products followed by medium and small.
How can I achieve this?
I tried grouping result but had no success.
http://localhost:8983/solr/product/select?start=0&row=10&q=*:*&group=true&group.field=size&group.format=simple&group.limit=100000&sort=size desc
I think sort is not the way to achieve this ordering.
Is there any other feature in SOLR to do this?

Related

organize a set of data based on open slots/sold-out slots

I am trying to analyze data based on the following scenario:
A group of places, each with its own ID gets available for visiting from time to time for an exclusive number of people - this number varies according to how well the last visit season performed - so far visit seasons were opened 3 times.
Let's suppose ID_01 in those three seasons had the following available slots/sold-out slots ratio: 25/24, 30/30, and 30/30, ID_02 had: 25/15, 20/18, and 25/21, and ID_03 had: 25/10, 15/15 and 20/13.
What would be the best way to design the database for such analysis on a single table?
So far I have used a table for each ID with all their available slots and sold-out amounts, but as the number of IDs gets higher and the number of visit seasons too (way beyond three at this point) it has been proving to be not ideal, hard to keep track of, and terrible to work with.
The best solution I could come up with was putting all IDs on a column and adding two columns for each season (ID | 1_available | 1_soldout | 2_available | 2_soldout | ...).
The Wikipedia article on database normalization would be a good starting point.
Based on the information you provided in your question, you create one table.
AvailableDate
-------------
AvailableDateID
LocationID
AvailableDate
AvailableSlots
SoldOutSlots
...
You may also have other columns you haven't mentioned. One possibility is SoldOutTimestamp.
The primary key is AvailableDateID. It's an auto-incrementing integer that has no meaning, other than to sort the rows in input order.
You also create a unique index on (LocationID, AvailableDate) and another unique index on (AvailableDate, LocationID). This allows you to retrieve the row by LocationID or by AvailableDate.

Do high cardinality fields affect performance for searches?

The Azure Search docs state that:
A high cardinality field consists of a facetable or filterable field that has a significant number of unique values, and as a result, consumes significant resources when computing results
But it's not clear on whether this poor performance is limited to when the fields are specifically used in a filter/facet query, or whether it also affects performance when the field is queried against using search terms.
Can anyone with some deeper Azure Search knowledge weigh in?
After getting clarification from Microsoft, I can confirm that the answer is "no, performance is only affected when using the field in a facet/filter".
This poor performance is limited to when the fields are specifically used in a filter/facet query. The searchable terms will not be affected.
Fields that work best in faceted navigation have low cardinality: a small number of distinct values that repeat throughout documents in your search corpus (for example, a list of colors, countries/regions, or brand names).
If the field that has a significant number of unique values, it will consume significant resources when computing the facet navigation. Because each distinct value will be 1 facet and need to be calculated.
At query time, a filter parser accepts criteria as input, converts the expression into atomic Boolean expressions represented as a tree, and then evaluates the filter tree over filterable fields in an index.
If the field that has a significant number of unique values, the tree will be deep and consume significant computing resources. Because each unique value will be calculated in filter, there will be no cached result for duplicate items to reduce the calculation.
The searchable fields will not be affected if the fields have a significant number of unique values. Because searchable fields have inverted index to accelerate query.
When you load the index, each field's inverted index is populated with all of the unique, tokenized words from each document, with a map to corresponding document IDs. For example, when indexing a hotels data set, an inverted index created for a City field might contain terms for Seattle, Portland, and so forth. Documents that include Seattle or Portland in the City field would have their document ID listed alongside the term.
I reached out to MS as well, this is the answer that I got:
“High cardinality” means different things to filterable vs searchable fields. Cardinality for filterable fields amounts to the uniqueness of the full value of the field. For searchable fields, it’s about the aggregate number of indexed terms that results from writing a document to the index. Complex custom analyzers, for example, can bloat the index by producing several tokens for each word in a string. Inverted indexes scale really well, so I wouldn’t be too concerned about having a high number of unique words in the index. But, this should help understand the unit of scale each.
This mention in the documentation is primarily to raise awareness about what contributes to query performance and why they may see reduced performance as they add additional fields to the filter clause. I will add…You can improve the performance of individual queries by scaling up the number of partitions in your service. Going from 1 to 2 not only doubles the storage available to your service, it also doubles the amount of compute power available to execute queries. The data workload is divided roughly equally between each partition. It doesn’t usually equate to exactly twice the performance for your queries, but it can have a significant impact if you are seeing slow queries.

Wide rows vs Collections in Cassandra

I am trying to model many-to-many relationships in Cassandra something like Item-User relationship. User can like many items and item can be bought by many users. Let us also assume that the order in which the "like" event occurs is not a concern and that the most used query is simply returning the "likes" based on item as well as the user.
There are a couple of posts dicussing data modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
An alternative would be to store a collection of ItemID in the User table to denote the items liked by that user and do something similar in the Items table in CQL3.
Questions
Are there any hits in performance using the collection? I think they translate to composite columns? So the read pattern, caching and other factors should be similar?
Are collections less performant for write heavy applications? Is updating the collection frequently less performant?
There are a couple of advantages of using wide rows over collections that I can think of:
The number of elements allowed in a collection is 65535 (an unsigned short). If it's possible to have more than that many records in your collection, using wide rows is probably better as that limitation is much higher (2 billion cells (rows * columns) per partition).
When reading a collection column, the entire collection is read every time. Compare this to wide row where you can limit the number of rows being read in your query, or limit the criteria of your query based on clustering key (i.e. date > 2015-07-01).
For your particular use case I think modeling an 'items_by_user' table would be more ideal than a list<item> column on a 'users' table.

Cassandra sets or composite columns

I am storing account information in Cassandra. Each account has lists of data associated with it. For example, an account may have a list of friends and a list of liked books. Queries on accounts will always want all friends or all liked books or all of both. No filtering or searching is needed on either. The list of friends and books can grow and shrink.
Is it better to use a set column type or composite columns for this scenario?
I would suggest you not to use sets if
You are concerned about disk space(as each value is allocated a cell in disk + data space for metadata of each cell which is 15 bytes if am not wrong. Now that consumes a lot if your data is a growing one).
Not going to grow a lot of data in that particular row as each time ,the cells are to be fetched from different sstable .
In these kind of cases, the more preferred option would be a json array. You shall store it as a text and back the data from that.
Set (or any other collections ) use case was brought in for a completely different perspective. If you are having a particular value inside the list or a value has to be updated frequently inside the same collection, you shall make use of the collections .
My take on your query will be this.
Store all account specific info in a json object of friends that has a value as list of books .
Sets are good for smaller collections of data, if you expect your friends / liked books lists to grow constantly and get large (there isn't a golden number here) it would be better to go with composite columns as that model scales out better than collections and allows for straight up querying compared to requiring secondary indexes on collections.

Sort Column Total

I have a view with column totals.
What I want is to sort the totals-column in a Xpages-view or repeat-control.
I am able to get the totals to display but cannot sort them.
Any suggestions?
Not very clean. I wouldn't do that in production - for performance reasons - it's a brute force solution, but it should work for fair amount of categories (up to few hundreds).
Assume the view can be used for category lookups (similar to show single category). Then all you need is list of categories in correct order - based on totals and not alphabetical order. Therefore, in first cycle, loop through all categories (use NotesNavigator with cache) and store them as pair of values - (category, totals). It may be Map[String,Double] or Set[Category] where Category is POJO with category and totals attributes. In both cases you will need your own Comparator. If your categories are hierarchical, use only top level category (sorting of tree structure is more complicated).
For example:
Australia (5)
Brazilia (10)
Chile (7)
will sort as
Brazilia (10)
Chile (7)
Australia (5)
Cache this collection in viewScope (assuming totals are "static" for short period of time, user will need to reload the page to get updated data).
Feed this collection to repeater with simple data table (or view or repeater) showing only selected category.
GUI will be a bit odd with pagers (pager for categories and pager for content of category), but you will handle this, I hope.
It's probably better to ask can this type of sorting or resorting be done in Notes rather than can it be done in XPages. If it can be done in Notes then you should be able to do the same in XPages - sometimes automatically.
XPages can only do so much with the view datasource. So if the datasource can't sort the categories by the totals then you won't be able to do this in XPages. At least not out of the box.
You might be able to do something with repeats - doing a lookup of the datasource, retrieving all documents under a certain category that has the highest total before moving on to the next category in the sequence - but it likely to become pretty complicated and not worth it in the end.
Sorry if it's not the answer you're looking for.
I will try to clarify it.
There is an categorized view. Category is for example Company name.
In the view is a column with totals, so the category has also a total.
The wish is that this categorized view will show the company category, who has the highest total at the top of the view, without losing the categorization

Resources