Designing Twitter Search - How to sort large datasets?

Designing Twitter Search - How to sort large datasets? - search

I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!

First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.

Related

Using top with Azure Search Suggestions

I am building a search page with Azure Search. On my page, I have a search box. I want to provide suggestions to the users. In an attempt to do this, I'm using the Suggestions endpoint on my index. At this time, I have a request that includes the following query string:
search=sta&suggesterName=sites&$top=3
My question is, how does top determine which three results to return? Is it the first three matches it encounters when going through the search index? Or is it something else? Based on the URL structure, I don't think it's using a scoring profile. So, I ruled out relevancy. But then I started reading about the minimumCoverage field and I got confused.
If the suggest endpoint just returns the first [top] matches it encounters, then why is the minimumCoverage field even needed?

In general, $top will give you the top N results based on whatever order the rest of the query specifies. For queries with no $orderby, the sort order is descending by relevance score. This applies to both Suggest and Search.
Note that just because you don't have a scoring profile (such as with Suggest), that doesn't mean Azure Search doesn't calculate relevance scores for each document. Scoring profiles can influence the score, but they do not completely define it.
For queries with an $orderby, the order of results is defined first by the fields in the $orderby, and then by score if there are any ties to be broken.
minimumCoverage has nothing to do with ordering or $top. It has to do with the way search queries are distributed. Every query is executed concurrently against different subsets of the index (this happens regardless of whether or not you have multiple search units). Sometimes one of these subsets fails to execute for whatever reason, usually when your search service is under heavy load. The minimumCoverage parameter provides a way to relax the rule that normally says "X% of the index must successfully execute the query in order to consider the overall query a success" (X is 100 by default for Search and 80 by default for Suggest). This is a way to tradeoff completeness of search results for higher availability in case of heavy load or partial outages.

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.

I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Personalized Search Results for Elasticsearch

How would one go about setting up Elasticsearch so that it returns personalized results?
For example, I would want results returned to a particular user to rank higher if they clicked on a result previously, or if they "starred" that result in the past. You could also have a "hide" option that pushes results further down the ranking. From what I've seen with elasticsearch so far, it seems difficult to return different rankings to users based on that user's own dynamic data.
The solution would have to scale to thousands of users doing a dozen or so searches per day. Ideally, I would like the ranking to change in real-time, but it's not critical.

Elasticsearch provides a wide variety of scoring options , but then to achieve what you have told you will need to do some additional tasks.
Function score query and document terms lookup terms filter would be our tools of our choice
First create a document per user , telling the links or link ID he visited and the links he has liked. This should be housed separately as separate index. And this should be maintained by the user , as he should update and maintain this record from client side.
Now when a user hits the data index, do a function score query with filter function pointing to this fields.
In this approach , as the filter is cached , you should get decent performance too.

Range-based, chronological pagination queries across multiple collections with MongoDB?

Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.
Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.
Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.

MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.
Option A: Get Ids from supporting collection, load full docs, sort in memory
So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.
{
_id: ObjectId,
updated: Date,
type: String
}
This way allows mongo to do the pagination for you.
Option B: 3 Queries, Union, Sort, Limit
Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.
So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.

If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:
A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.
B) Cache everything aggressively so that you rarely need to do the merges

You could use the same idea I explained here, although this post is related to MongoDB text search it applies to any kind of query
MongoDB Index optimization when using text-search in the aggregation framework
The idea is to query all your collections ordering them by date and id, then sort/mix the results in order to return the first page. Subsequent pages are retrieved by using last document's date and id from the previous page.

Multiple queries in Solr

My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index.
So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index.
To summarise this, Index of 5000 will be filtered to 500 and then 50 (5000>500>50). Its basically filtering but I would like to do this in Solr.
I have reasonable basic knowledge and still learning.
Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)
I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option.

Two likely implementations occur to me. The simplest approach would be to just add the first query to the second query;
+(first query) +(new query)
This is a good approach if the first query, which you want to filter on, changes often. If the first query is something like a category of documents, or something similar where you can benefit from reuse of the same filter, then a filter query is the better approach, using the fq parameter, something like:
q=field:query2&fq=categoryField:query1
filter queries cache a set of document ids to filter against, so for commonly used searches, like categories, common date ranges, etc., a significant performance benefit can be gained from it (for uncommon searches, or user-entered search strings, it may just incur needless overhead to cache the results, and pollute the cache with a useless result set)

Filter queries (fq) are specifically designed to do quick restriction of the result set by not doing any score calculation.
So, if you put your first query into fq parameter and your second score-generating query in the normal 'q' parameter, it should do what you ask for.
See also a question discussing this issue from the opposite direction.

I believe you want to use a nested query like this:
text:"roses are red" AND _query_:"type:poems"
You can read more about nested queries here:
http://searchhub.org/2009/03/31/nested-queries-in-solr/

Should take a look at "faceted search" from Solr: http://wiki.apache.org/solr/SolrFacetingOverview This will help you in this kind of "iterative" search.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string