How search engines find websites over internet - web

I'm going to write a Web parser (an application that crawles on the web from one site to another).
How Can I find list of available domains/IPs in the internet (as complete as possible)?
How search engines find websites (What they use as a reliable list of registred IP/Domains for starting point)?
Thanks

As Michael P's comment indicates, depends on what your objective is.
My company recently wanted to answer a question about third-party tools used on leading websites. I used Alexa as a starting point to find the top (by traffic) websites, and created a parser that can answer the specific question my company asked. If you start from such a list, you can program your web crawler to follow the links it encounters to broaden your knowledge of sites on the web.
Hopefully that helps you think about the problem.

Related

What are the pros & cons of using Google CSE vs implementing dedicated search engine for a site like stackoverflow?

I understand that same work should not be repeated when Google CSE is already there, so what may be the reasons to should consider implementing a dedicated search engine for a public facing website similar to SO(& why probably StackOverflow did that ?). Paid version of CSE(Google site Search), already eliminates several drawbacks that forced dedicated implementation. Cost may be one reason to not choose Google CSE, but what are other reasons ?
Another thing I want to ask is my site is similar kind as StackOverflow, so when Google indexes its content every now & then, won't that overload my database servers with lots of queries may be when there is peak traffic time?
I look forward to use Google Custom search API but I need to clarify whether the 1000 paid queries that I get for 5$ are valid only for 1 day or they get adjusted to extra queries(beyond free ones) on the next day & so on. Can anyone clarify on this too?
This depends on the content of your site, the frequency of the updates, and the kind of search you want to provide.
For example, with StackOverflow, there'd probably be no way to search for questions of an individual user through Google, but it can be done with an internal search engine easily.
Similarly, Google can outdate their API at any time; in fact, if past experience is any indication, Google has already done so with their Google Web Search API, where a lot of non-profits that had projects based on such API were left on the street with no Google options for continuation of their services (paying 100 USD/year for only 20'000 search queries per year, may be fine for a posh blog indeed, but greatly limits what you can actually use the search API for).
On the other hand, you probably already want to have Google index all of your pages, to get the organic search traffic, so Google CSE would probably use rather minimal resources of your server, compared to having a complete in-house search engine.
Now that Google Site Search is gone, the best search tool alternative for all the loyal Google fans is Google Custom Search (CSE)
Some of the features of Google Custom Search that I loved the most, were :-
Its free (with ads)
Ability to monetise those ads with your AdSense Account
Tons of Customization options, including removing the Google branding,
Ability to link it with Google Analytics account, for highly comprehensive analytical report,
Powerful auto correct feature to understand the real intention behind the typos,
Cons : Lacks customer Support…
Read More: https://www.techrbun.com/2019/05/google-custom-search-features.html

google chrome extention send email site block of dns

I want to build an extension on Google Chrome which functions will be forwarding address illicit websites that email to parents, which prohibited it site address using DNS Nawala or something similar, with the extension prevents the expected negative impact of the use of the internet.
What are the steps that I did in building this extension ?
Thank you.
This is a very broad "how do I create my entire project" question, but I'll try to give you some broad advice:
An extension alone will not be enough for this. You're going to need a web service as well. You'll likely need to divide the project into two parts:
A Chrome extension that monitors the websites a person visits. You can do this by using the Tab API. Simply look at each site the user has visited and if they visit any of the illicit sites on a blacklist, take an action, probably by making an API call to the web service mentioned below.
You're almost certainly going to need to use a web service developed with a scripting language like PHP or Java, or something similar. This web service would take care of sending the emails to parents. If we're just talking about sending an email to one parent than this service could be quite simple. The extension would tell the web service to send an email when an illicit site is visited, and that's about it. If you're talking about a commercial project then this service would probably need to be a full fledged website that allows parents to sign up for these emails.
Again, this is a very broad question, and generally speaking Stackoverflow is more for asking specific programming questions. But hopefully this will get you moving in the right direction at least, so you can come back and ask more specific questions. :-)

Great variation in web statisics

I have a blog site, a WP 3.0 install. I've dropped Google Analytics' tacker code into the footer (a recommended technique I believe). I also have two different types of web statistic software available on the virtual server, through the hosting company. However the web statistics vary greatly. Why such variation?
Statitics --
http://pastebin.com/Nc10iGaA
Thanks a million!
Google removes bot hits from it's traffic. You might be seeing google bots in the other hit counts.
I would guess that the software counts analytics differently. You should look at the documentation to figure out what qualifies a "visit," which may exclude/include crawlers, certain user agents, certain access patterns, etc.

Resource for developing a website

Can anyone recommend resources to learn how to develop websites, as opposed to web applications?
I am looking to develop a website for a consulting company to be precise. I would be more interested in best practices for creating the layout of a website (user appeal, eye candy, not an eye sore)
Thanks
-M
It really depends upon the language you want to use, your current skill sets, who's going to maintain the site, what OS the site will be hosted on etc etc.
I suspect you need to narrow down your question.
What do you mean by web site rather than web application? Are you talking about the dynamic nature of the content or somethign else?
update
If you're looking for discussions on design of websites (visual design, UX etc) then I'm a great fan of Smashing Magazine.
http://www.smashingmagazine.com/
It doesn't often speak about MS technologies (ASP.NET etc) but it's a great place to see discussions and papers on "what makes a great website". Some recent examples:
http://www.smashingmagazine.com/2009/05/15/optimizing-conversion-rates-its-all-about-usability/
http://www.smashingmagazine.com/2009/05/14/non-profit-website-design-examples-and-best-practices/
Subscribe to their RSS feed and see what those colouring-in people get up to.
Here's your first port of call.
Unless you're artistically inclined, I recommend purchasing or contracting the template design to someone who is skilled in this area.
For $60 a year, you can have unlimited downloads and unlimited use of all the templates at the following site:
http://www.dreamtemplate.com/
There are many more here:
http://www.templatemonster.com/website-templates.php
http://www.w3schools.com/
for purely informational sites, html, and css will probably be plenty, though I think I would reccomend using wordpress if you're just trying to put content on the internet
If you speak German or French, http://www.selfhtml.org is quite a good resource.
Otherwise, I would recommend http://www.w3schools.com/ or http://htmldog.com/. Both are very good as they really go deeply into the matter and tell about standards from the beginning.
sitepoint.com
Their best content is packaged in their books, but their articles are good, too. Covers design best-practices and web standards, but also has good tips on the business of web design and managing clients.
You may want to look at the alistapart website.
simply the best I have seen for this.
I would also - since I have just been reminded of it use
http://www.webmonkey.com/
http://w3schools.com/
http://www.w3schools.com/ is a good start.

Embed Google/ Yahoo search into a web site or build your own

I am looking for an opinion on the whether to use Google custom search, Yahoo search builder or build my own for web projects (no more than 100 pages of content). If I should build my own - do you have any fast start kits you could recommend?
Many thanks
Chris
I have had success using OpenSearch for my personal blog.
While working at BigCorp we used dedicated search applicances in yellow boxes, but in your case (around 100 pages) it does not make sense to take such a route.
I would suggest going with either Google Custom Search, or Yahoo Search Builder (as long as they both index your site sufficiently to provide good results).
More often than not, you'll get better results and you don't have to worry about building your own custom engine (or implementing an off the shelf/open source piece of software to do the job for you).
I've used IBM OmniFind Yahoo Edition and had fantastic results with it. You are limited to a single index per implementation but it's very fast and easy to integrate with and extensible in terms of search customization. I've used it with a ASP.NET site without issue. A caveat being that it needs to be installed on the server and running as a service so it is out of the question for most shared hosting. It has the index capabilities of general search engines (pdf/html/etc) which is very nice.
Edit:
I forgot to mention that some of the reasons I liked it vs other options is that it is free and doesn't require any additional hardware, just FYI.
The main situation I see Google/Yahoo as being sub-optimal is when your site relies on up-to-the-minute results. You're at the mercy of their crawling policies/speed/etc. If that's okay (and I suspect it will be for most 100ish page sites), use them - the results will be great. If realtime results are important, you may have to bite the bullet and install something locally.
Yahoo boss is cheaper and recommended by many people
I am going to integrate it soon.

Resources