SOLR sort by grouping num found - search

Hello StackOverflow,
Environments: Solr 7.5.0|SolrCloud 7.5.0, Custom search index.
According to requirements for search, it should group search response by specific field and sort groups by score and by numFound group value then
I saw some links that noticed that it is not possible or it is possible through two queries using facets (implemented currently). But they are pretty old and probably something changed.
For example:
how to order groups by count in solr
Solr 6: how to group by title and return groups that contain minimum 5 numFound
https://issues.apache.org/jira/browse/SOLR-9678?jql=text%20~%20%22numFound%22
Also, heard that this purpose can be reached through SOLR subqueries. But also no luck to find a solution for this.
Will appreciate any help.

Related

How index and inverted index works in facets in solr?

I understand the theory concepts of Inverted index and indexes. Primarily, Solr indexes documents using inverted index (Searching tokens instead of documents).
I've also read that Solr uses indexing for features such as facets.
As I understand it, for facets,
searching for a term and creating facets require Solr to search all the terms in a field and match all the retrieved documents containing the search term, which will be costly, so indexing is used.
From what I understand, index is used when all the documents referring to the search terms are retrieved, they are traversed and a count of unique values regarding the fields are calculated.
Is this a correct understanding of this concept or there is something else ?
The is not only one way, how faceting in solr works.
Solr has a heuristic to select a best but there is also a the
facet.method parameter to select it by your own.
Mainly your description is right, but solr is fast because of caching the
UnInvertedField instead of selecting the values for each request from the inverted index.
With DocValues there is also an efficient storage of an uninverted field.
Possible also this answers will help you:
How does Lucene/Solr achieve high performance in multi-field / faceted search?
Solr faceted search performance recommendations
http://de.slideshare.net/lucenerevolution/seeley-solr-facetseurocon2011

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Siren "more like this" query

i am using the newest Siren distribution for Solr to index my data and search it. (http://siren.solutions/siren/downloads/)
Is there a simple way to search similar documents in my indexed data. Something similar to the MoreLikeThis query of Solr (https://cwiki.apache.org/confluence/display/solr/MoreLikeThis).
My goal is to find documents that have a similar json structure that the one i am interested in.
best,
Bernd
If I remember SIREn stores the RDF representation of each resource within a dedicated field of the Solr document. I don't think the default MLT component that comes with Solr works for your scenario.
I mean, enabling that component will produce some kind of result but I don't believe that it will follow your json "similarity" requirement.
On top of that I suggest you to post your request on SIREn mailing list [1]: I'm sure the dev team will address you on the right path.
[1] https://groups.google.com/forum/m/#!forum/siren-user

Creating an ElasticSearch index mixing two CouchDB rivers

I am trying to index two different CouchDB databases with Elastic Search, using a single index, but it seems that only one of the two databases is actually indexed. The documents from the other database are simply not included in the index.
Is there any limitation about the number of rivers which can be connected to a given index?
I am unable to find any documentation about this specific use case ...
You should create two index with one river for each.
Afaik you can not have multiple rivers for the same index. See the content of your _river index.
BTW you should ask questions to the elasticsearch.org mailing list.

Sphinx Search Multiply Indexes & Sources

Im making a dynamic CMS, so every instance of the CMS will have its on tables in one MYSQL DB. So far all is working.
The Envorioment:
8 different Sites with different content. they share only the DB name but all have differenttables ($sitename_posts)
search enigne SPHINX
Now im stuck at this: when for example user makes a search on site 1 i want search all tables $sitename_posts and return the best results.
As search engine i use sphinx. I have tried it with two sources and two indexes but when i search for example:
$sphinx = new SphinxClient;
$sphinx->setServer($sphinx_host, $sphinx_port);
$sphinx->setMatchMode(SPH_MATCH_ANY);
$sphinx->setMaxQueryTime(10000);
$sphinx->SetSortMode(SPH_SORT_EXTENDED, '#relevance DESC');
$sphinx->SetLimits(0, 100, 300);
$result = $sphinx->query("Hello World", (index1 index2);
I get no results. But if i build only one INDEX and multiply sources i get results, but i cant identify from which source i get the data, so i cant judge to which site the content belongs.
One more question is when i search the indexes, is it possible, that sphinx returns, the ID and to what index that id belongs? Cause i need to indentify which index belongs to which result.
Thanks for help!
If I understand the question correctly I it's worth you looking into the following Sphinx features:
Distributed Indexes - This would allow you to have one index per site and also have a "virtual" distributed index which you could search from the application when you want to get data.
Index Merging - This is more permanent than the distributed index option but the indexer is able to merge multiple indexes into a single index. I would usually prefer to use distributed indexes.
Attributes - This would allow you to include a constant value in each of the indexes (e.g. siteId) which would allow you to identify which of the indexes the search result came from. It could also allow you to filter results when searching from the single distributed index.
Sphinx docs - http://sphinxsearch.com/docs/2.0.1/
Distributed indexes explained - http://sphinxsearch.com/docs/2.0.1/distributed.html
Configuring distributed indexes - http://sphinxsearch.com/docs/2.0.1/confgroup-index.html

Resources