In each search in Google one can see number of google pages found.
Is it possible to find what was this number in the past?
Best what you can try is Google Zeitgeist. It doesn't give amount of found pages, but it gives insights of popularity of keywords, example: "stackoverflow". You can also check the popularity of a website, example: stackoverflow.com.
No other Google services comes to mind for your particular purpose.
No this number is not available. You could run the query and parse the Google results to capture the information, but I read the terms of use before you try to do that. In most case it will be against the terms of usage to do something like that.
Related
I need to know if it is worth to build a crawler on top of the results given by a search engine.
By that means, for a given query, grab N URLs from a search engine and input them into a crawler to find more relevant pages to the search. Is there any scientific paper/experiment claiming that doing this helps gathering more relevant pages instead of only getting URLs from the search engine?
If I understood it right, you would rebuild the search engine, because it was its job to bring the most related/relevant results first over a search. And, although you did not mention directly your search engine, which I guess it is google, I would suggest you to use the advanced search options before trying anything else. Google provides an API for performing searches, which you can use in your system. But if this approach does not fit to you, it is possible to craw over google results, and even perform custom searches (for example filtering results by site, term or etc) but google would not be happy with this and would eventually block your calls. I suggest you give a try over its open API...
I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.
I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.
As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.
I found a number of similar questions on SO but they are all are either 2+ years old or aren't exactly what I am looking for.
All I would like to do is obtain a list of twitter users whose bio/profile contains certain terms (scientist, democrat, 'dog lover', etc.).
I've considered using a google site search but so far the results are incredibly noisy.
Any suggestions would be much appreciated!
CS
The Twitter API supports a People Search similar to the website's "Find on Twitter" search feature. Although you can not directly search using only profile descriptions, it appears that the description content is used as part of the search space. If you can think of a way to narrow down your results even further by directly searching the returned users' descriptions, you should be able to do what you're looking for. Check out the Twitter API documentation for more info.
Example:
Try searching for "husband father of three", and you get these results, which obviously are returned because of the profile descriptions.
I have used one tool to search twitter profiles using keywords and many advance filters. I love the information which has been provided by the FollowerSearch tool. The information was very specific, which helps me to analyze the public twitter profiles.
One of the best tools for quickly searching among the 800 million public Twitter accounts in the database is FollowerSearch.
With FollowerSearch, you can quickly conduct searches for Twitter influencers and Twitter bios across its massive database of more than 800 million Twitter profiles. You may look for Twitter profiles based on information like their location, line of work, number of followers, etc.
Twitter Influencer Profile Search
A Twitter bios search will assist you in simplifying the process, whether you're looking for influencers or new talent. You can discover Twitter folks who share your interests. Find out exact information on all the accounts whose bios contain your search term.
Identify key accounts and Twitter influencers that have required terms in their Twitter bios.
Look up new and budding talent.
Find Twitter users with similar interests.
Search Twitter profile or Search Twitter bios for any desired term.
I created a tool that does exactly what your looking for. Find70 let's you search for twitter profiles by their twitter bio. In fact, you can set up as many search filters as you want and define your own weighting for each filter. In your example above, you could search for: scientist, democrat, 'dog lover' and it would return all the accounts that have those in the bio. This can be combined with other filters too. Here it is http://www.find70.com/?t=stack
I'm trying to organize a solr search engine. I've already set up the misspelling system and the suggestions.
However I can't seem to find how to retrieve the top 10 most searched words/terms/keywords in solr/lucene. How can I get this? I want to display those on my homepage.
Solr does not provide this kind of feature out of the box. There is the StatsComponent, that provides you with all kind of statistics, but all of those are numeric only.
Depending on how you access solr (directly or via your own app) you could intercept all calls an log the query string. I did this in a recent project where I logged a queries to a database. If you submit all keywords to an other core on your solr server, you can faceting queries on your search terms as described by Hyque
You could use a facet for retrieving the Top X words like this:
http://yourservergoeshere/solr/select?q=*&wt=xml&indent=true&facet=true&facet.query=*&facet.field=message&facet.limit=10&facet.minCount=1
The value of facet.field depends on the field you like to search in. With facet.limit you'll (obviously) limit the amount of results to 10. You'll find the facet results at the end of the results, starting with "facet_counts"
Edit: I really should go to bed earlier. I didn't see the "most searched" in your question. Sorry for that.
Apache Solr does not provide any such capability as of today. There is a desire for this and a JIRA ticket corresponding to it. You can vote for it if you'd like to see it in Solr some day: https://issues.apache.org/jira/browse/SOLR-10359.
The stats component provides information around statistics, but it's mostly numeric in nature. You could parse server logs and come up with a way to build a Frequently Searched Terms (e.g. pump those logs in SiLK or Kibana for visualization).
If you have the ability to change the front end and add some javascript code to the UI or can intercept the search request and make an async or batch calls to APIs for tracking, you can use SearchStax Analytics that provides Search Analytics that tracks searches, clicks, cart actions, revenue, etc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).