Create a search engine on specific sites and gather specific info

Create a search engine on specific sites and gather specific info - search

I need to create a search engine that crawls thru a list of websites and searches there for a query, and those website all return some data in various formats and structures, I need to collect specific info (in a unique structure) from all these websites.
Is there a way I can do that with an existing engine such as Google Custom Search Engine? Or am I better creating one of my own? If yes, what's the first step I should take towards learning about indexing and searching these website efficiently and without filling up my servers with unuseful trash.
So to sum up, besides searching a query on each of these websites' search box, I need to handle the results of each of them appropriately and lay it over in a union structure in one place altogether. All the results are to be parsed and extracted into 4-6 fields (unless, of course, there is a way to this with Google CSE.

Google CSE provides some interfaces to the standard Google web search. You can control the user interface and the search parameters, but you have no control over the indexing, nor any direct access to the index data.
You might be more interested in the Google Search API's that are available with GAE. These are quite different: they are search services in which you provide the data and control the indexes.

here in dec 2018, with google CSE, we can define a set of websites from where we can do our request. google CSE offers up to 2000 website sources to include and up to 5000 sources Overall.
a simple comparison:
Google CSE provides a strong API , custom requests, and nothing to run in your server but in contrast it permits only 100 requests by day for free use.
developing a new SE could be helpful for small sets of websites and it provides a customized SE for the business needs but it requires : time, infrastructure, money investement ,developement of SE algorithms: indexing, storage and analyis.
To sum up. It depends on what side you really need it.

Related

How do I track keywords entered in Google Search using Google Tag Manager?

I want to know what keywords brings users to our website. The result should be such that, every time a user clicks on a link of the company's website, the page URL, timestamp and keywords entered in search are recorded.
I'm not really much of a coder, but I do understand the basics of Google Tag Manager. So I'd appreciate some solutions that can allow me to implement this in GTM's interface itself.
Thanks!

You don't track them. Well, that is unless you can deploy your GTM on Google's search result pages. Which you're extremely unlikely to be able to do.
HTTPS prevents query parameters to get populated for referrers, which is what the core reason for it is.
You still can, technically, track Google search keywords for the extremely rare users who manage to use Google via http, but again, no need to do anything in GTM. GA will automatically track it with its legacy keyword tracking.
Finally, you can use Google Search Console where Google reports what keywords were used to get to your site. That information, however, is so heavily sampled that it's just not joinable to any of the GA data. It is possible, however, to join GSC with GA, but that will only lead to GA having a separate report from GSC and that's it. No real data joins.

Search engine components

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!

The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

What are the pros & cons of using Google CSE vs implementing dedicated search engine for a site like stackoverflow?

I understand that same work should not be repeated when Google CSE is already there, so what may be the reasons to should consider implementing a dedicated search engine for a public facing website similar to SO(& why probably StackOverflow did that ?). Paid version of CSE(Google site Search), already eliminates several drawbacks that forced dedicated implementation. Cost may be one reason to not choose Google CSE, but what are other reasons ?
Another thing I want to ask is my site is similar kind as StackOverflow, so when Google indexes its content every now & then, won't that overload my database servers with lots of queries may be when there is peak traffic time?
I look forward to use Google Custom search API but I need to clarify whether the 1000 paid queries that I get for 5$ are valid only for 1 day or they get adjusted to extra queries(beyond free ones) on the next day & so on. Can anyone clarify on this too?

This depends on the content of your site, the frequency of the updates, and the kind of search you want to provide.
For example, with StackOverflow, there'd probably be no way to search for questions of an individual user through Google, but it can be done with an internal search engine easily.
Similarly, Google can outdate their API at any time; in fact, if past experience is any indication, Google has already done so with their Google Web Search API, where a lot of non-profits that had projects based on such API were left on the street with no Google options for continuation of their services (paying 100 USD/year for only 20'000 search queries per year, may be fine for a posh blog indeed, but greatly limits what you can actually use the search API for).
On the other hand, you probably already want to have Google index all of your pages, to get the organic search traffic, so Google CSE would probably use rather minimal resources of your server, compared to having a complete in-house search engine.

Now that Google Site Search is gone, the best search tool alternative for all the loyal Google fans is Google Custom Search (CSE)
Some of the features of Google Custom Search that I loved the most, were :-
Its free (with ads)
Ability to monetise those ads with your AdSense Account
Tons of Customization options, including removing the Google branding,
Ability to link it with Google Analytics account, for highly comprehensive analytical report,
Powerful auto correct feature to understand the real intention behind the typos,
Cons : Lacks customer Support…
Read More: https://www.techrbun.com/2019/05/google-custom-search-features.html

Reverse Engineering data from Lucene/Solr Indexes

I am investigating whether it is feasable to deploy search servers to the cloud and one of the questions I had revolved around data security. Currently all of our fields (except a few used for faceting) are indexed and not stored (except for the ID, which we use to retrieve the document after search has completed).
If for some reason the servers within the cloud were compromized, would it be possible for that person to reverse engineer our data from the indexes even without the fields being stored.

Depends on the security level you need and the sensitivity of the document content...
With a configuration you describe it wouldn't be possible to rebuild the original as a "clone"... BUT it would be possible to reverse enough information to gain a lot of knowledge about the content... depending on the context this could be damaging...
An important point:
If you use the cloud based servers to build the index and they get compromized THEN there would be no need for "reversing" depending on your configuration: at least for any document you index after the servers get compromized because for building the index the document gets sent over as it is (for example when using http://wiki.apache.org/solr/ExtractingRequestHandler)...

As Yahia says, it's possible to get some information. If you're really concerned about this, use an encrypted file system, as Amazon suggests.

search integration

I am working on a website that currently has a number of disparate search functions, for example:
A crawl 'through the front door' of the website
A search that communicates with a web-service
etc...
What would be the best way to tie these together, and provide what appears to be a unified search function?
I found the following list on wikipedia
Free and open source enterprise search software
Lucene and Solr
Xapian
Vendors of proprietary enterprise search software
AskMeNow
Autonomy Corporation
Concept Searching Limited
Coveo
Dieselpoint, Inc.
dtSearch Corp.
Endeca Technologies Inc.
Exalead
Expert System S.p.A.
Funnelback
Google Search Appliance
IBM
ISYS Search Software
Microsoft (includes Microsoft Search Server, Fast Search & Transfer):
Open Text Corporation
Oracle Corporation
Queplix Universal Search Appliance
SAP
TeraText
Vivísimo
X1 Technologies, Inc.
ZyLAB Technologies
Thanks for any advice regarding this.

Solr is an unbelievably flexible solution for search. Just in the last year I coded 2 solr-based websites and worked on a third existing one, each worked in a very different way.
Solr simply eats XML requests to add something to index, and XML requests to search for something inside an index. It doesn't do crawling or text extraction for you, but most of the time these are easy to do. There are many existing addons to Solr/Lucene stack so maybe something for you already exists.
I would avoid proprietary software unless you're sure Solr is insufficient. It's one of the nicest programs I've worked with, very flexible when you need it and at the same time you can start in minutes without reading long manuals.

Note that no matter what search solution you use, a search setup is "disparate" by nature.
You will still have an indexer, and a search UI, or the "framework".
You WILL corner yourself by marrying a specific search technology. You actually want to have the UI as separate from the search backend as possible. The backend may stop to scale, or there may be a better search engine out there tomorrow.
Switching search engines is very common, so never - ever - write your interface with a specific search engine in mind. Always abstract it, so the UI is not aware of the actual search technology used.
Keep it modular, and you will thank yourself later.
By using a standard web services interface, you can also allow 3rd parties to build stuff for you, and they won't have to "learn" whatever search engine you use on the backend.

Take a look at these similar questions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
My personal recommendation: Solr.

All these companies offer different features of Universal Search. Smaller companies carved themselves very functional and extremely desired niches. For example Queplix enables any search engine to work with structured data and enterprise applications by extracting the data, business objects, roles and permissions from all indexed applications. It provides enterprise-ranking criteria as well as data-compliance alerts.

Two other solutions that weren't as well-known &/or available around the time the original question was asked:
Google Custom Search - especially since the disable public URL option was recently added
YaCy - you can join the network or download and roll your own independent servers

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string