Sphinx Search Multiply Indexes & Sources - search

Im making a dynamic CMS, so every instance of the CMS will have its on tables in one MYSQL DB. So far all is working.
The Envorioment:
8 different Sites with different content. they share only the DB name but all have differenttables ($sitename_posts)
search enigne SPHINX
Now im stuck at this: when for example user makes a search on site 1 i want search all tables $sitename_posts and return the best results.
As search engine i use sphinx. I have tried it with two sources and two indexes but when i search for example:
$sphinx = new SphinxClient;
$sphinx->setServer($sphinx_host, $sphinx_port);
$sphinx->setMatchMode(SPH_MATCH_ANY);
$sphinx->setMaxQueryTime(10000);
$sphinx->SetSortMode(SPH_SORT_EXTENDED, '#relevance DESC');
$sphinx->SetLimits(0, 100, 300);
$result = $sphinx->query("Hello World", (index1 index2);
I get no results. But if i build only one INDEX and multiply sources i get results, but i cant identify from which source i get the data, so i cant judge to which site the content belongs.
One more question is when i search the indexes, is it possible, that sphinx returns, the ID and to what index that id belongs? Cause i need to indentify which index belongs to which result.
Thanks for help!

If I understand the question correctly I it's worth you looking into the following Sphinx features:
Distributed Indexes - This would allow you to have one index per site and also have a "virtual" distributed index which you could search from the application when you want to get data.
Index Merging - This is more permanent than the distributed index option but the indexer is able to merge multiple indexes into a single index. I would usually prefer to use distributed indexes.
Attributes - This would allow you to include a constant value in each of the indexes (e.g. siteId) which would allow you to identify which of the indexes the search result came from. It could also allow you to filter results when searching from the single distributed index.
Sphinx docs - http://sphinxsearch.com/docs/2.0.1/
Distributed indexes explained - http://sphinxsearch.com/docs/2.0.1/distributed.html
Configuring distributed indexes - http://sphinxsearch.com/docs/2.0.1/confgroup-index.html

Related

Designing Twitter Search - How to sort large datasets?

I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!
First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.

SOLR sort by grouping num found

Hello StackOverflow,
Environments: Solr 7.5.0|SolrCloud 7.5.0, Custom search index.
According to requirements for search, it should group search response by specific field and sort groups by score and by numFound group value then
I saw some links that noticed that it is not possible or it is possible through two queries using facets (implemented currently). But they are pretty old and probably something changed.
For example:
how to order groups by count in solr
Solr 6: how to group by title and return groups that contain minimum 5 numFound
https://issues.apache.org/jira/browse/SOLR-9678?jql=text%20~%20%22numFound%22
Also, heard that this purpose can be reached through SOLR subqueries. But also no luck to find a solution for this.
Will appreciate any help.

Can you find a specific documents position in a sorted Azure Search index

We have several Azure Search indexes that use a Cosmos DB collection of 25K documents as a source and each index has a large number of document properties that can be used for sorting and filtering.
We have a requirement to allow users to sort and filter the documents and then search and jump to a specific documents page in the paginated result set.
Is it possible to query an Azure Search index with sorting and filtering and get the position/rank of a specific document id from the result set? Would I need to look at an alternative option? I believe there could be a way of doing this with a SQL back-end but obviously that would be a major undertaking to implement.
I've yet to find a way of doing this other than writing a query to paginate through until I find the required document which would be a relatively expensive and possibly slow task in terms of processing on the server.
There is no mechanism in Azure Search for filtering within the resultset of another query. You'd have to page through results, looking for the document ID on the client side. If your queries aren't very selective and produce many pages of results, this can be slow as $skip actually re-evaluates all results up to the page you specify.
You could use caching to make this faster. At least one Azure Search customer is using Redis to cache search results. If your queries are selective enough, you could even cache the results in memory so you'd only pay the cost of paging once.
Trying this at the moment. I'm using a two step process:
Generate your query but set $count=true and $top=0. The query result should contain a field named #odata.count.
You can then pick an index, then use $top=1 and $skip=<index> to return a single entry. There is one caveat: $skip will only accept numbers less than 100000

Creating an ElasticSearch index mixing two CouchDB rivers

I am trying to index two different CouchDB databases with Elastic Search, using a single index, but it seems that only one of the two databases is actually indexed. The documents from the other database are simply not included in the index.
Is there any limitation about the number of rivers which can be connected to a given index?
I am unable to find any documentation about this specific use case ...
You should create two index with one river for each.
Afaik you can not have multiple rivers for the same index. See the content of your _river index.
BTW you should ask questions to the elasticsearch.org mailing list.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources