How does web crawlers build directories of URLs to scrape contents needed - web

I'm trying to understand how web crawling works. There are 3 questions:
Do we have to have an initial directory of URLs to build a larger
directory of URLs? How does this work?
Are there any open source
web crawlers written in python?
Where is the best place to learn more about web
crawlers?

Answering your second question first; Scrapy is a great tool to do web scraping in python.
When using it there are a number of ways to start the spiders. The CrawlSpider can be given a list of initial URLs to start from. It then scrapes these pages looking for new links which are added to the queue of pages to search.
Another way to use it is with the sitemap spider. For this spider you give the crawler a list of the URLs of websites sitemaps. The spider then looks up the list of pages from the sitemap and crawls those.

Related

Google do not index my posts

I'd like to know why google do not index my posts on my blog writes in NodeJS.
Link of a post : http://icecom.fr/articles-icecom/9
Anthony
There are several reasons why Google isn't indexing your website.
There are no links to your website. Google follows links on the internet to other pages. If there are no links to your website it won't find it.
You are denying access to Google through the robots meta-tag or robots.txt.
You haven't waited long enough yet, Google may take some time before it has indexed your website.
Of course you can supply Google with the proper URL's with a [sitemap]{https://support.google.com/webmasters/answer/156184?hl=en}. A good place to create this if you're new to it could be [here]{http://www.xml-sitemaps.com/}
#szenbalu already mentioned you can upload this sitemap.xml to Google Webmaster Tools and this way Google can index your site without the need of links. It is also faster most of the time.
Another way to get your website indexed through Google Webmaster Tools is the 'Fetch as Google' tool. In here you can tell Google to fetch and index your website. This is especially useful if you change content and want it reindexed.
About your specific case:
* You do not block Google with the meta robots tag
* I can not find a robots.txt file
* I can not find any links to your articles from [OpenSiteExplorer]{http://www.opensiteexplorer.org/}
I think that uploading a sitemap to Google Websmaster Tools + Using the Fetch as Google tool will get your site indexed within no time.
If you have any questions left, feel free to ask. :)
Do you have the robots.txt file and webmaster tools account joined to your page?
With webmaster tools you can upload sitemap that google will use to index pages.

How to crawl English site and avoid crawling other languages?

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to understand the language of that I want to leave the site at the first chance we could detect the language. Could you please tell me if its possible? For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them. Thanks for any help.
If you have a quick look to the HTTP Request parameters (http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) you can ask for the content language and you will get an answer like this: "Content-Language: en".
You do not need to do a GET request (and download the whole page), you could ask for this parameter in a HEAD request (in order to download only headers).
About "For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them."
A site could be multi-language. So you can get the 3 first pages in spanish (or whatever) and you will leave the site, although there are some pages in English.

How to use locations.kml with sitemap.xml

I would like to make sure website ranks as high as possible whenever my Google Places location ranks high.
I have seen references to creating a locations.kml file and putting it in the root directory of my site. Then creating lines in the sitemap.xml file to point to this .kml file.
I get this from this statement on the geolocations page
Google no longer supports the Geo extension to the Sitemap protocol. We recommmend that you tell Google about geographically-based URLs by including them in a regular Web Sitemap.
There is a link to the Web Sitemap page
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=183668
I'm looking for examples of how to include Geo location information in the sitemap.xml file.
Would someone please point me to an example so that I can know how to code the reference?
I think the point is that you dont use any specific formatting in the sitemap. You make sure you include all your locally relevent pages in the sitemap as normal. (ie you dont include any geo location in the sitemap)
GoogleBot will use its normal methods for detereriming if the page should be locally targeted.
(I think Google have found the sitemap-protocol has been abused, and or misunderstood, so they dont need it to tell them so much about the page. Rather its just a way to find pages, that it might take a long time to discover though conventual means. )

How do I check an entire website to see if any page in it links to a particular URL?

We have been hounded by an issue in our websites because web protection facility pages like ones from Norton keep on telling certain visitors in certain browsers that our websites are potential risks because we link to a certain http://something.abnormal.com/ (sample URL only).
I've been trying to scour the site page by page, to no avail.
My question, do you know any site that would be able to "crawl" into our website's pages and then check if any text, image, whatever in them links to the abnormal URL that keeps on bugging.
Thanks so much! :)
What you want is a 'spider' application. I use the spider in 'Burp Suite' but there are a range of free, cheap and expensive ones.
The good thing about Burp is you can get it to spider the entire site and then look at every page for whatever you want, whether it be something to match a regex or dynamic content etc.
If your websites consist of a small amount of static content pages, I would use wget to download all pages (ignoring images)
wget -r -np -R gif,jpg,png http://www.example.com
and then use a text search for the suspicious url on the result. If your websites are more complex, httrack might be easier to configure for a text-only download.

Search engine robot.txt

I want to add a robot.txt so my web page can be found...
So I have heard that putting a robot.txt with meta tags in the root of my site can do this.
Is this true?, if so,
What would be the steps to add or generate this robot.txt?
I have found this
Robots.txt is more for telling the crawlers where to and where to not go once they've already reached your site.
A better way to get crawlers onto your site is to build a sitemap for your site, then use Google Webmaster to submit this sitemap to google. You'll also want to include the sitemap on your site's root url and tell google where it is (all of this can be done in Google's Webmaster Tool linked to above).
No, it won't make your webpage suddenly visible. It just instructs web crawlers on how to index your site.
http://www.robotstxt.org/

Resources