Google Custom Search retrieve all the results, is it possible? - search

I'm starting to use the Google Custom Search Engine in order to retrieve a temporal use of some selected word in an online newspaper.
I see that for example my result provides a total of 22000 retrieved articles. I tried to retrieve pages after the 100 index but I can't get any result.
I also tried to search directly on the google web page, but I see that after the 10 page I can't go further, so this only show me the first 1000 result at max.
Does it is possible to retrieve every single result or I've to get just only a small portion of that?
Thanks

Related

Cloudant/Couch db pagination in search API - How to skip n number of records

I am building a typical pagination that allows the user to click on a particular page number and view the results (similar to the google search result view). I am using the cloudant search API for this. The cloudant search API provides the limit option but no skip option. How can I skip n number of results if the user is on page 1 and clicks on page 4 ?
I can see that the pagination is implemented using bookmarks. Does it mean that I need to first get the bookmark for page 4 by sending 3 additional requests one after another to the search api ?
There are a couple of different ways of handling this - one is the one you already suggested, which is just to fetch the pages as needed to get the bookmarks. I'm not sure there are many alternatives for search results where we can't pre-calculate the results.
Another alternative, and this depends a bit on the details of what you are trying to do, is to create a view containing the data and use the keys to narrow down the view to the results you need. View outputs support use of limit and skip which would enable you to implement pagination.
There's also a good example of pagination in the docs: http://docs.couchdb.org/en/2.1.0/ddocs/views/pagination.html

How can I crawl but not index web pages in OpenSearchServer?

I'm using OpenSearchServer to provide search functionality on a web site. I want to crawl all pages on the site for links to follow but I want to exclude some pages from the index. I can't work out how to do this.
Specifically the website includes a shop that has its own product search and I am keeping this search for products and categories. The product pages have URLs like http://www.thesite/p/123 so I don't want to include any page like this in the search results. However some product pages reference background info pages and I want these to be included in the search index.
The problem I have is that the filter has no effect on the results - it doesn't filter out the /p/ and /c/ results. If I change the filter by unticking the negative box I get no results so it seems to be either the contents of the field or the filter criteria that is causing the problem.
I've tried adding a negative filter to the default query called search in the Query > Filter tab on the index with url:"http://www.thesite/p/*"
but it seems that wildcards are not supported for query filters although they are supported for Crawler > Exclusion list filters.
I've tried adding a new field called urlField in Schema > Fields and populating it using an analyzer configured using the Whitespace Tokenizer and a regular expression (http://www.thesite/(c|p)/). When I use the Test button it seems to generate two tokens for my test URL http://www.thesite/p/123:
http://www.thesite/p/
p
I'd hoped to be able to use the first one in a Query > Filter to exclude all the shop results and optionally be able to use the p (for product) or c (for category) if I need to search the product pages sometime in the future.
The urlShop field in the schema is set up as follows:
Indexed: yes
Stored: no (because I don't need the field back, just want to be able to filter on it)
TermVector: No
Analyzer: urlShop
Copy of: url
I've added urlFilter:"http://www.thesite/p/" to Query > Filters with the negative box ticked.
This seems to have no effect on the results when I use the default renderer.
To see whether it affects the returned results I unticked the negative box in the query filter I get no results in the default renderer. This leads me to believe that the urlShop field is not being populated but I'm not sure how to check this directly.
I would like to know whether there is an easier way to do this but if my approach makes sense in the context of OpenSearchServer please can you help me identify what's wrong?
The website is running under IIS and OpenSearchServer will be configured on the same server running in Tomcat.
Finally figured this out...
Go to query and hit edit for your configured query. Then go to the filters tab. Add a query filter like this:
urlExact:"http://myurltoexclude*"
Check the "negative" box. Click add.
Now make sure to click "save in the tiny little button on the right hand side. This is the part I missed. The URLS are still in the DB and crawl, but at least they aren't returned in results.

Sphinx "reverse" search

We have a website where users put up ads for stuff they want to sell, with parameters such as price, location, title and description. These can then be searched for using sphinx and allowing users to specify min- and maxprice, a location with a searchradius (using google maps) etc. Users can choose to save these searches and get emails when new ads appear that fit their search. Herein lies the problem: We want to perform a reverse search every time an ad is posted. With the price, location, title and description as parameters we want to search through all the saved "searches" and get the ones that would have found the ad. The min- and maxprice should just be performed in a query i suppose, and some Quorom syntax to get all ads with at least 2 or mby just 1 occurance in the title/description. Our problem lies mostly in the geo-search. How do we find all searches where the "search-circles" would include our newly posted location without performing a search for every saved search?
That is the main-question, any comment on our suggested solution to the other problems is also very welcome. Thank you in advance / Jenny
The standard 'geo-search' support on sphinx should work just as well on a Prospective Index, as a normal retrospective search.
Having built a sphinx 'index' of all the saved searches...
And you run a query using the 'ad' as the search query:- rather than the 'filter' using a fixed radius, you just use the radius from the attribute (ie the radius stored on the particular query) - if using the API cant use setFilterRange directly, need to use setSelect, to make a new virtual attribute.
$cl->setSelect("*,IF(#geodist<radius,1,0) as myfilter");
$cl->setFilter('myfilter',array(1));
(and yes, the min/maxprice can just be done with normal filters too - just inverting the logic to that you would use in a retrospective search)
... the complication is in the 'full-text' query, if the saved search is anything more than a single keyword, but you appear to have already figured out that part.

Pagination in Marklogic while using Search API

I have around 53,00,000 documents in MarkLogic server and I am building a simple search application. User enters a search term and MarkLogic server searches that term in all the nodes in all the documents and returns the matching documents as the result. I have implemented a custom paging to show results per page. I am showing 10 results per page.
I am using search api for this as:-
import module namespace search="http://marklogic.com/appservices/search" at "/Marklogic/appservices/search/search.xqy";
declare variable $options:=
<options xmlns="http://marklogic.com/appservices/search">
<transform-results apply="raw"/>
</options>;
search:search($p, $options, $noRecFrom, 10)/search:result
where $p is the input from the user $noRecFrom is the number which indicates from where we have to show records. For example for page 1 $noRecFrom will be 1, for page 2 $noRecFrom will be 11, for page 3 $noRecFrom will be 21 and so on. For paging there are hyperlinks to go to First, Next, Prev and Last pages.
To calculate the total number of records returned I am using:-
for $x in search:search($p, $options)
return $x//#total;
While First, Next and Prev hyperlink works perfectly but if someone clicks Last the application stops responding and the query does not show any output. Is it due to the large number of documents in the database or I am implementing it wrongly.
Is there any efficient way for pagination in MarkLogic (for search:search) so that the user can go the Last page without delay in query result for such a large database ?
The way you've implemented it, you're running the search repeatedly in your for loop. And that would indeed be slow.
Instead, you should be calculating a $start parameter based on the #total and number of documents per page, and passing that in as an argument (I think it's the third one) to search:search.
I would also recommend making sure you can run in unfiltered mode. There is good information about optimizing for fast pagination (indexes, etc) on the developer site; the idea is to resolve queries out of indexes to give very good, accurate unfiltered performance.
If you do it that
There is a tutorial on paginated search at http://developer.marklogic.com/learn/2006-09-paginated-search
Once you have resolved the issues mentioned by cwhit above, if you still want to get to the last page of data in a faster manner, you could make your code smart enough to reverse the sort order and pull the correct offset of records.
Here's another tip:
To get better insight to what MarkLogic is doing with search:search, call
search:get-default-options()
to see the starting point for common search applications.

Google-Analytics API to track Site Search?

So there's this nifty _trackPageview() api method on a tracker object, but is there a corresponding method that can be used to manually track a search? In other words, _trackPageview() reports to GA that a user hit a page. I want something like _trackSearch("terms") that would report to GA that a user searched for something.
Though not exactly what I was looking for, it seems that one can generate virtual page views to track search results programatically.
Suppose that you've set up a Site Search parameter called "q", so that when a URI is tracked that contains q=these+are+some+terms, GA will mark it as a search hit. One can use the _trackPageview() method to generate virtual search hits like so:
pageTracker._trackPageview('/custom/search?q=These+are+some+terms')
I pass search parameters by GET, so the URL for a search on "TEST" is
http://www.example.com/search?q=TEST
Selecting Content -> Site Search from my analytics account gives me a list of all keywords searched.
To learn more, check the documentation, especially the How do I set up Site Search for my profile? page.

Resources