design a search function for high scale [closed] - search

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
(not sure if this is the right forum for this question)
I am very curious about how search in major site, say youtube/quora/stackexcahnge, works?
And I'm NOT looking for an answer like 'They Use Lucene Search engine'. I want to understand exactly how the indexing works there.
Is there a different Index for text search than the autocomplete feature?
Is it done in the background like map reduce.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
Help Please :-)

I voted to close, because I think your question is too broad. Each bullet could form the basis of an SO question. That stated, I'll take a crack at answer how SolrCloud attempts to solve each of the problems you are asking about:
Is there a different Index for text search than the autocomplete feature?
The short answer is "yes". Solr has several options for implementing an autocomplete feature and all of them rely on either building a separate index or being supplied a separate dictionary. You can also roll your own in an even more sophisticated fashion as the blog post "Super flexible AutoComplete with Solr" demonstrates.
Is it done in the background like map reduce?
Generally speaking no. SolrCloud is based on the idea of shards with leaders and replicas. A shard being a subset of your overall index. With a shard being comprised of a leader and possibly one or more replicas.
Queries are executed against all shard leaders. With assigning a particular shard to serve as the aggregator of each shard's response, but unlike map reduce where the individual node responses have all the data the reducing node needs, the aggregating Solr shard may make multiple requests back to the other shards to figure out sort order - for example.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
See my response to your previous question. In short the query is executed against each shard, aggregated by one of those shards, and returned to the requestor. What Solr does - Lucene really - that's the useful magic part that people most often associate with it is Term Frequency Inverse Document Frequency indexing usually with stemming on text searches. While this is not exactly what happens under the hood, and you can vary what's actually done via configuration, it provides a fairly good idea of what's being done.
Other searching, on dates and numbers, or simple textual values is done in a fashion similar to database indexing. That is a simplification, if you want to understand it more fully read the JavaDoc on NumericRangeQuery for an in-depth explanation.
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
If I knew the answer to that I would probably be working for Google and not answering StackOverflow questions :). Seriously whatever they've built is new PhD level work that as far as I know they haven't even release a research paper on, which is what they did with map reduce that led to Yahoo building Hadoop.

Related

How to extract categories out of short text documents?

My data contains the answers to the open-ended question: what are the reasons for recommending the organization you work for?
I want to use an algorithm / technique that, using this data, learns the categories (i.e. the reasons) that occur most frequently, and that a new answer to this question can be placed in one of these categories automatically.
I initially thought of topic modeling (for example LDA), but the text documents are very short in this problem (mostly between the 1 and 10 words per document). Therefore, is this an appropriate method? Or are there other models that are suitable for this? Perhaps a cluster method?
Note: the text is in Dutch
No, clustering will work even worse.
It can't do magic.
You'll need to put in additional information, such as labels to solve this problem - use classification.
Find the most common terms that clearly indicate one reason or another and begin labeling posts.

Event-Sourcing: Aggregate Roots and Performance

I'm building a StackOverflow clone using event-sourcing. The MVP is simple:
Users can post a question
Users can answer a question
Users can upvote and downvote answers to non-closed questions
I've modeled the question as the aggregate root. A question can have zero or more answers and an answer can have zero or more upvotes and downvotes.
This leads to a massive performance problem, though. To upvote an answer, the question (being the aggregate root) must be loaded which requires loading all of its answers. In non-event-sourced DDD, I would use lazy loading to solve this problem. But lazy loading in event-sourcing is non-trivial (http://docs.geteventstore.com/introduction/event-sourcing-basics/)
Is it correct to model the question as the aggregate root?
Firstly don't use lazy loading (while using ORM). You may find yourself in even worse situation, because of that, than waiting a little bit longer. If you need to use it, most of the times it means, that your model is just simply wrong.
You probably want to think about things like below:
How many answers to the question you expect.
What happens, if someone posted an answer while you were submiting yours. The same about upvotes.
Does upvote is just simply +1 and you don't care about it anymore or you can find all upvotes for user and for example change them to downvote (upvotes are identified).
You probably want to go for separate aggregates, not because of performance problems, but because of concurrency problems (question 2).
According to performance and the way your upvote behave you may think about modeling it as value object. (question 3)
Go ahead and read it http://dddcommunity.org/library/vernon_2011/
True performance hit you may achieve by using cqrs read/write separation
http://udidahan.com/2009/12/09/clarified-cqrs/
With a simple read model it should not be a problem. What is the maximum number of answer you would expect for a question? Maybe a few hundred which is not a big deal with a denormalized data model.
The upvote event would be triggered by a very simple command with a handful of properties.
The event handler would most likely have to load the whole question. But it is very quick to load those records by the aggregate root ID and replay the events. If the number of events per question gets very high (due to answer edits, etc) you can implement snapshots instead of replaying every single event. And that process is done asynchronously which will make the read model "eventually consistent" with the event store.

Approximate string matching algorithms state-of-the-art [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I seek a state of the art algorithms to approximate string matching.
Do you offer me references(article, thesis,...)?
thank you
You might have got your answer already but I want to convey my points on approximate string matching so that, others might benefit. I am speaking from my experience having worked to solve cloud service problems to handle really large scale requirements.
If we just want to talk about the Approximate String matching algorithms, then there are many.
Few of them are:
Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc.
A simple googling would give us all the details.
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
Assuming that we have a list of millions of names and if we want to search a given input name against all the entries in the list using one of the standard algorithms above would mean disaster.
A typical, edit distance algorithm has a time complexity of O(N^2) where N is the number of characters in a string. To scan the list of size M, the complexity would be O(M * N^2). This would mean very high hardware requirements and it just doesn't work in your favor regardless of how much h/w you want to stack up.
This is where we have to start thinking about other approaches.
One of the common approaches to solve such a problem in the production environment is to use a standard search engine like -
Apache Lucene.
https://lucene.apache.org/
Lucene indexing engine indexes the reference data(Called as documents) and input query can be fired against the engine. The results are returned which are ranked based on how close they are to the input.
This is close to how google search engine works. Googles crawles and index the whole web but you should have a miniature system mimicking what Google does.
This works for most of the cases including complicated name matching where the first, middle and the last names are interchanged.
You can select your results based on the scores emitted by Lucene.
While you mature in your role, you will start thinking about using hosted solutions like Amazon Cloudsearch which wraps the Solr and ElastiSearch for you. Of-course it uses Lucene underneath and keeps you independent of potential size of the index due to larger reference data which is used for indexing.
http://aws.amazon.com/cloudsearch/
You might want to read about Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance

Difference between a search engine's relevance rankings and a recommender system [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
What's the difference between a search engine's relevance rankings and a recommender system?
Don't both try and achieve the same purpose, i.e. finding the most relevant items for the user?
There is a major difference between a search engine and a recommender system:
In a search engine, the user knows what he is looking for, and he makes the query ! For instance, I might wonder if I should go to see a movie, and search information about it, like actors and directors.
In a recommender system, the user isn't supposed to know what we are recommending to her. We match her tastes with neighbours or whatever algorithm you like, and find things that she would't have looked after, like a new movie!
One is more about information retrieval, while the other is more about information filtering and discovery.
No, there are at two differents levels of analysis.
A search engine, look into a collection of data to get data that match a query. Even if all result are identicals or the results doesn't change from day to day. Very like a special form of databases.
The recommender system, use informations about you to provide specific improved content about the searched data. Very like a servant that know you well and use search engine for you.
Beward, some tool that starts as web-search engine are now more like recommender system.

Counts of web search hits [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).

Resources