Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).
Related
This might not be a coding-related question on its own but over the years, I've wondered about the page ranking and indexing of my search results and how sometimes I have the documentation links to the code bases as the first result as opposed to tutorial based results. It's not that important for me to just select the documentation link but I find it easier to use the tutorials on some days because of the relative page ranking more than the intent to finish a task quicker versus take more time. This question is too vague.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
(not sure if this is the right forum for this question)
I am very curious about how search in major site, say youtube/quora/stackexcahnge, works?
And I'm NOT looking for an answer like 'They Use Lucene Search engine'. I want to understand exactly how the indexing works there.
Is there a different Index for text search than the autocomplete feature?
Is it done in the background like map reduce.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
Help Please :-)
I voted to close, because I think your question is too broad. Each bullet could form the basis of an SO question. That stated, I'll take a crack at answer how SolrCloud attempts to solve each of the problems you are asking about:
Is there a different Index for text search than the autocomplete feature?
The short answer is "yes". Solr has several options for implementing an autocomplete feature and all of them rely on either building a separate index or being supplied a separate dictionary. You can also roll your own in an even more sophisticated fashion as the blog post "Super flexible AutoComplete with Solr" demonstrates.
Is it done in the background like map reduce?
Generally speaking no. SolrCloud is based on the idea of shards with leaders and replicas. A shard being a subset of your overall index. With a shard being comprised of a leader and possibly one or more replicas.
Queries are executed against all shard leaders. With assigning a particular shard to serve as the aggregator of each shard's response, but unlike map reduce where the individual node responses have all the data the reducing node needs, the aggregating Solr shard may make multiple requests back to the other shards to figure out sort order - for example.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
See my response to your previous question. In short the query is executed against each shard, aggregated by one of those shards, and returned to the requestor. What Solr does - Lucene really - that's the useful magic part that people most often associate with it is Term Frequency Inverse Document Frequency indexing usually with stemming on text searches. While this is not exactly what happens under the hood, and you can vary what's actually done via configuration, it provides a fairly good idea of what's being done.
Other searching, on dates and numbers, or simple textual values is done in a fashion similar to database indexing. That is a simplification, if you want to understand it more fully read the JavaDoc on NumericRangeQuery for an in-depth explanation.
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
If I knew the answer to that I would probably be working for Google and not answering StackOverflow questions :). Seriously whatever they've built is new PhD level work that as far as I know they haven't even release a research paper on, which is what they did with map reduce that led to Yahoo building Hadoop.
This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Get random site names in bash
I'm making a program for the university that has to find the occurrences of the words on the web. I need to make an algorithm that finds sites and count the numbers of words used and after it has to record them and sort by how many times they are used. Therefore the most sites my program checks, the better. First of all I was thinking of calculating random IPs, but the problem is that the process takes really too much (I left the computer searching the whole night and it found only 15 sites). I guess this is because site's IPs aren't distributed evenly on the web and most of the IPs belongs to users or other services. Now I had a pair of new approach in mind and I wanted to know what you guys think:
what if I make random searches using some sort of a dictionary through google? The dictionary would start empty at the beginning and each time I perform a search, I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again, by corrupting the occurrences.
Is this easy?
The first thing I want to do is to search also random pages in the google search and not only the first one, how can this be done? I can't figure out how to calculate the max number of pages for that search and how to directly go to a specific page
thanks
While I don't think you could (or should) do this in bash alone, take a look at Google Custom Search API and this question. It allows to programmatically query Google search directly.
As for what queries to use, you could resort to picking words randomly from a dictionary file - though that would not give you a uniform distribution as words like 'cat' are more popular than 'epichorial', say. If you require something which takes into account those differences you can use a word frequency dictionary, although that seems to be the point of you research in itself, so perhaps that would not be appropriate.
I'm getting some erratic results from Foursquare's venue search API and I'm wondering if anyone has any tips on how to process my input parameters for the most "intuitive" results.
For example, suppose I am searching for a venue called "Ise Sushi", around "New York, NY", which is equivalent to (lat: 40.7143528, lon: -74.00597309999999) using Google Maps API. Plugging into the Foursquare Venue API, we get:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7143528%2C-74.00597309999999
This yields pretty underwhelming results: the venue I'm looking for ends up rather far down the list, at 11th place. What's interesting is that reducing the precision of the coordinates appears to produce much better results. For example, suppose we were to round the coordinates to 3 significant digits:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7%2C-74.0
This time, the venue I'm looking for ends up in 2nd place, even though it is actually farther from the center of the search (1072 meters, vs. 833 meters using the first query).
Another modification that appears to help improve the quality of search is substituting underscores for spaces to separate our search terms. For example, here's the original query with underscores:
https://api.foursquare.com/v2/venues/search?query=ise_sushi&ll=40.7143528%2C-74.00597309999999
This produces the most intuitive-seeming results: the venue I'm looking for appears first, and is accompanied by just one other result, "Ise Restaurant" (which is tagged as a "sushi restaurant"). For what it's worth, this actually seems to be the result set of the same search conducted on Foursquare's own website.
I'm curious what lessons I should be learning from this. Should I be reducing the precision of my coordinates? Should I be connecting my search terms with underscores, and if so, does that limit how a user can order their search terms?
Although there are ranking improvements we can make on our end to find this distant exact match, it generally also helps to specify intent=browse (although it looks like in this case, for now, it may give you worse results). By default, /venues/search uses intent=checkin, which tries really hard to find close-by matches for checking in to, at the expense of other ways a venue might match your search. Learn more at https://developer.foursquare.com/docs/venues/search
Somebody has posted an hour ago or so a question that was about the drupal search engine and was about like this:
I know drupal should index anything that is returned by node_view() but this is not happening for my custom content. Also: are there better alternatives to Drupal built-in functionality?
As the question has been removed while I was answering, and didn't want to throw away 20 minutes of my life for nothing ;) I thought to re-create the question a second time. Hope this is fine by the rules of SO! :)
The Drupal search engine is probably not the most celebrated feature of Drupal, but is fairly solid, sophisticated and reliable. There are plenty of modules that enhance or substitute it but - at least in my experience - there is not a commonly accepted "better way" to manage searching and indexing.
However, for very big and busy sites people prefer to use external tools altogether, like a google searchbox or even dedicated software or hardware, like solr / lucene or google search appliance (GSA).
The link I provided above - however - sorts the search-related modules by descending usage statistics, so you will find on the first page the one most commonly used. One that I personally like for English language sites is the porter-stemmer, which index words by their stem (eg: highness, highest and higher will all be returned as matches for the word "high").
That was for the general information on search and Drupal. As for your problem, there are a number of things you could check to track down your problem:
Have your cron.php been executed lately? Indexing is done as part of the cron run, so - if you do not have a crontab set or if you haven't executed it by hand, your node will likely not been indexed yet.
Are the settings correct? Settings for the search module are located at http://example.com/admin/settings/search : is your minimum word length sufficient for your needs (the default is 3 letters)?
Has the 100% of the site being indexed? (You can check that from the setting page). If it is not, and running cron.php doesn't solve the matter, look further down.
Does a re-index solve the problem? Especially if you inserted data by mean of SQL queries directly on the Drupal tables, chances are Drupal hasn't realised the content of the node has changed and therefore doesn't update the index.
Is the node you are trying to find, visible? Search results about unpublished nodes or nodes that require higher-than-yours permissions to be viewed are not returned, AFAIK.
As for the "stuck indexing" that happened to me once as well. It turned out it was some PHP code within a node body that would trigger a PHP exception when the node was being indexed, and as a result the indexing process would halt and all the following nodes would not be indexed as well.
Hope this helps. Good luck!