How to crawl English site and avoid crawling other languages? - nutch

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to understand the language of that I want to leave the site at the first chance we could detect the language. Could you please tell me if its possible? For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them. Thanks for any help.

If you have a quick look to the HTTP Request parameters (http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) you can ask for the content language and you will get an answer like this: "Content-Language: en".
You do not need to do a GET request (and download the whole page), you could ask for this parameter in a HEAD request (in order to download only headers).
About "For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them."
A site could be multi-language. So you can get the 3 first pages in spanish (or whatever) and you will leave the site, although there are some pages in English.

Related

How to show links in Google search results like this

I am wondering how can I show my website links like this (with ">" sign) in google search results.
I have also noticed that when I click to these types of results, they take me to altogether on a different page of that website. Dont know if they are doing 301 redirect.Please do let me know if there is any SEO benefit by displaying links like this and doing redirection.
I got the answer....It is done using schema.org. What actually google is showing in the search result is a breadcrumb. I have to tell google about my breadcrumb using rich snippet.
It's a SEO trick. When you're going to submit a website to Google with webmasters account. You have something called sitemap. It's what Google uses to give those nice clean results. You can generate sitemap for you site here

Remove incoming links from duplicate website

There is a duplicate development website that exists for legacy reasons and is pending a complete removal, it always had a rule in it's robots.txt file to deny all search engines, but at one point the robots.txt got deleted by accident, and for a point in time there were two cross-domain duplicates and Google indexed the entire duplicate website, and caused thousands of incoming links to the production website to show up in Google webmaster tools (Your site on the web > Links to your site).
The robots.txt got restored, and the entire development site is protected by a password, but the incoming links from the duplicate site remain in the production website webmaster tools, even though the development site robots.txt was downloaded by Google 19 hours ago.
I have spent hours reading about this, and see a lot of contradiction on the web, so would like to get an updated consensus from stackoverflow on how to perform a complete site removal and remove the links that point from the development site to the production site from Google.
Nobody will be able to tell you exactly how much time will it take for Google to remove the "bad" links from index, but it's likely going to take a few days not hours. Another thing to keep in mind is that only "good" crawlers will be actually honoring your robots.txt file, so if you don't want these links to show up elsewhere, just using disallow in your robots.txt file certainly won't be enough.

Writing a htaccess file - RewriteBase?

Right I'll try and explain my situation as thoroughly as possible while also keeping it brief...
I'm just starting out as a web designer/developer, so I bought the unlimited hosting package with 123-reg. I set up a couple of websites, my main domain being designedbyross.co.uk. I have learnt how to map other domains to a folder within this directory. At the minute, one of my domains, scene63.com is mapped to designedbyross.co.uk/blog63 which is working fine for the home page. However when clicking on another link on scene63.com for example page 2, the URL changes to designedbyross.co.uk/blog63/page2...
I have been advised from someone at 123-reg that I need to write a .htaccess file and use the RewriteBase directive (whatever that is?!) I have looked on a few websites to try and help me understand this, including http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html however it all isn't making much sense at the moment.
Finally, scene63.com is a wordpress site, whether that makes any difference to how the htaccess file is structured I'm not sure...
Any help will be REALLY appreciated - Thanks.
I run my personal public website on Webfusion, which is another branded service offering by the same company on the same infrastructure, and my blog contains a bunch of articles (tagged Webfusion) on how to do this. You really need to do some reading and research -- the Apache docs, articles and HowTos like mine -- to help you get started and then come back with specific Qs, plus the supporting info that we need to answer them.
It sounds like you are using a 123 redirector service, or equivalent for scene63.com which hides the redirection in an iframe. The issue here is that if the links on your site use site-relative links then because the URI has been redirected to http://designedbyross.co.uk/blog6/... then any new pages will be homed in designedbyross.co.uk. (I had the same problem with my wife's business site which mapped the same way to one of my subdirectories).
What you need to do is to configure the blog so that its site base is http://scene63.com/ and to force explicit site-based links so that any hrefs in the pages are of the form http://scene63.com/page2, etc. How you do this depends on the blog engine, but most support this as an option.
It turned out to be a 123-reg problem at the time not correctly applying changes to the DNS.

How do I check an entire website to see if any page in it links to a particular URL?

We have been hounded by an issue in our websites because web protection facility pages like ones from Norton keep on telling certain visitors in certain browsers that our websites are potential risks because we link to a certain http://something.abnormal.com/ (sample URL only).
I've been trying to scour the site page by page, to no avail.
My question, do you know any site that would be able to "crawl" into our website's pages and then check if any text, image, whatever in them links to the abnormal URL that keeps on bugging.
Thanks so much! :)
What you want is a 'spider' application. I use the spider in 'Burp Suite' but there are a range of free, cheap and expensive ones.
The good thing about Burp is you can get it to spider the entire site and then look at every page for whatever you want, whether it be something to match a regex or dynamic content etc.
If your websites consist of a small amount of static content pages, I would use wget to download all pages (ignoring images)
wget -r -np -R gif,jpg,png http://www.example.com
and then use a text search for the suspicious url on the result. If your websites are more complex, httrack might be easier to configure for a text-only download.

Mobile Site SEO - Playing Nicely with Google

If I have an iPhone version of my site, what are the things I need to make sure of so it doesn't interfere with SEO?
I've read quite a bit now about cloaking and sneaky javascript redirects, and am wondering how this fits into iPhone and Desktop websites playing together.
If my iPhone site has a totally different layout, where say the Desktop site has a page with 3 posts and 10 images all on the page, and my iPhone site makes that 2 pages, one with the posts, one with the images (trying to think up an example where the structure's decently different), that's probably not best practice for SEO, so should I just tell google not to look at the mobile site? If so, and assuming my client would like to automatically redirect mobile users to the iPhone site (I'm familiar with the id of taking them to the regular page with a link to the mobile version instead), how do I not make this look like cloaking?
Google actually has a separate index and crawler for mobile content. So all you need to do is design your URLs in such a way that you can exclude googlebot from the mobile pages and googlebot-mobile from the regular pages in robots.txt.
Certainly you have the option of telling the search engines to not look at the mobile page. I would leave it though because you never know who is looking for something specific and maybe Google will prefer certain pages over others for mobi users.
If the 2 pages on mobi make sense to the visitor then I would not worry about it for SEO. If you are redirecting based on mobi then I don't see how the search engines could think you are cloaking, but if you want to be totally sure I suggest using CSS to show different information based on Media type.
The only problem I can think of would be of duplicate content. The SEs may see both pages and not rank one as highly because it likes what it sees on the other page. There is no penalty other than the fact that one page is more interesting than the other and may get better rankings whereas the other drops in rank. If you are making two separate pages it would be an opportunity to tune your information to specific details and maybe get hits for both, but if you are using CSS then it will rank as one page.

Resources