There is a duplicate development website that exists for legacy reasons and is pending a complete removal, it always had a rule in it's robots.txt file to deny all search engines, but at one point the robots.txt got deleted by accident, and for a point in time there were two cross-domain duplicates and Google indexed the entire duplicate website, and caused thousands of incoming links to the production website to show up in Google webmaster tools (Your site on the web > Links to your site).
The robots.txt got restored, and the entire development site is protected by a password, but the incoming links from the duplicate site remain in the production website webmaster tools, even though the development site robots.txt was downloaded by Google 19 hours ago.
I have spent hours reading about this, and see a lot of contradiction on the web, so would like to get an updated consensus from stackoverflow on how to perform a complete site removal and remove the links that point from the development site to the production site from Google.
Nobody will be able to tell you exactly how much time will it take for Google to remove the "bad" links from index, but it's likely going to take a few days not hours. Another thing to keep in mind is that only "good" crawlers will be actually honoring your robots.txt file, so if you don't want these links to show up elsewhere, just using disallow in your robots.txt file certainly won't be enough.
Related
Quite of a newbie question here but recently our Web Developer left our (small) company and has left us in a bind.
We recently (2 days ago) redirected our site to a newer and mobile friendly model and was working well for quite some time. For whatever reasons management deemed they needed to roll back the site to its original model and the site is breaking whenever you type in http://www.example.com. However, https:// works perfectly fine, and it seems like it has something to do with the htaccess file -- but being just the project manager, coding comes second in terms of skill.
If it helps our site is www.mauriprosailing.com -- currently still trying to figure out why the "www" and "http" is breaking the site.
If needed I can post a .txt of our htaccess if that helps.
I appreciate all the help and apologize if this was too broad of a question!
Solution: Granted this may not apply to everyone -- but the problem was not within the htaccess file but with caching of the server. The server was not pulling the right the .css file therefore causing an "explosion" of our site and I found that purging all of cached files did the trick.
Right I'll try and explain my situation as thoroughly as possible while also keeping it brief...
I'm just starting out as a web designer/developer, so I bought the unlimited hosting package with 123-reg. I set up a couple of websites, my main domain being designedbyross.co.uk. I have learnt how to map other domains to a folder within this directory. At the minute, one of my domains, scene63.com is mapped to designedbyross.co.uk/blog63 which is working fine for the home page. However when clicking on another link on scene63.com for example page 2, the URL changes to designedbyross.co.uk/blog63/page2...
I have been advised from someone at 123-reg that I need to write a .htaccess file and use the RewriteBase directive (whatever that is?!) I have looked on a few websites to try and help me understand this, including http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html however it all isn't making much sense at the moment.
Finally, scene63.com is a wordpress site, whether that makes any difference to how the htaccess file is structured I'm not sure...
Any help will be REALLY appreciated - Thanks.
I run my personal public website on Webfusion, which is another branded service offering by the same company on the same infrastructure, and my blog contains a bunch of articles (tagged Webfusion) on how to do this. You really need to do some reading and research -- the Apache docs, articles and HowTos like mine -- to help you get started and then come back with specific Qs, plus the supporting info that we need to answer them.
It sounds like you are using a 123 redirector service, or equivalent for scene63.com which hides the redirection in an iframe. The issue here is that if the links on your site use site-relative links then because the URI has been redirected to http://designedbyross.co.uk/blog6/... then any new pages will be homed in designedbyross.co.uk. (I had the same problem with my wife's business site which mapped the same way to one of my subdirectories).
What you need to do is to configure the blog so that its site base is http://scene63.com/ and to force explicit site-based links so that any hrefs in the pages are of the form http://scene63.com/page2, etc. How you do this depends on the blog engine, but most support this as an option.
It turned out to be a 123-reg problem at the time not correctly applying changes to the DNS.
In Google Anlytics I am getting hundreds of hits to pages which don't exist on my website which I assume are some sort of spam or bot realted thing.
I want to make sure that this isn't going to cause any issues to my site or be a security risk.
My website URL is imageworkshop.com, and the links that I am seeing are to the following paths on this domain:
/imagework/ineeta.V1.02.07.php
/imagework/ineeta.V1.02.13.php
/imagework/ineeta.V1.02.15.php
/imagework/ineeta.V1.03.01.php
/imagework/ineeta.V1.02.16.php
/imagework/ineeta.V1.02.08.php
Each of these pages is showing 150-300 page views (they just show 404 errors).
Average time on page shows 2-4 minutes for these.
Source of link shows as (direct) in google analytics.
Is this some kind of attempt at a brute force / SQL injection attack?
The visits have all happened 3-4 days apart through the month of October 2011.
Any suggestions on what this is or if I should be concerned?
The website is built on wordpress and does have a few plugins used - there is always a possibilty that these links are related to a plugin I guess?
I have wordpress up to date with the latest version (currently 3.2.1)
You're probably right. As long as those pages don't exist on your server there is no security risk.
i need your help and want advice as developer point of view that how people are running like sites like copyscape.com bascially they search copies of data on whole internet i want to know how they are searching and making catalog of all website from internet same like google as google makes index of site from internet
please guide me how they are searching data from all over internet how its possible to keep track of each and every website on internet how google knows that there is new site on internet from where there crawlers knows that new website is launched so in short i want to know how can i develop a site in which i can search copies of data all over internet with out depending on any third party api plzzz advice me i hope you will help me
thanks
Google's crawlers don't know when a new site is launched. Usually developers must submit their sites to Google or get incoming links from sites that are indexed.
And nobody has a copy of the entire Internet. There are websites that are not linked and never get visited by any crawler. This is called the deep web and is generally inaccessible to crawlers.
How do they do it exactly? I don't know. Maybe they index popular sites where text is likely to be copied, like Blogger, ezinearticles, etc. And if they don't find the text on those sites, they simply say its original. Just a theory and I am probably wrong.
Me? I would probably use Google. Just take a good chunk of text from the website you are checking is copied and then filter out the results that are from the original website. And viola, you have the website that have that exact phrase which is presumably copied.
If I buy a hosting (+ domain) service for the website of a friend of mine, and then I decide to use the remaining web space and mysql databases for my development and test...
is google caching my development websites (in other folders and sub-urls) under his website ?
What's the downside to develop on a server with already a production website.. ? I was thinking to create a tiny url linking to a www.myfriendwebsite.com/mydevelopmentSite.. in order to hide the real url.
Thanks
If you don't link to it or don't submit to google or list in a sitemap -- google won't find it.
But, you could also just use a robots.txt to tell google not to index it.
http://en.wikipedia.org/wiki/Robots_exclusion_standard
Update: to stop google and malicious bots:
Put a directory in robots.txt using *, and then put your site in a hard to guess subdirectory of that directory -- also, don't keep directory browsing on.
Also -- don't link to it anywhere, but perhaps you can't stop others from linking -- in that case, only robots.txt will keep you out of google. Malicious bots can get the site from the link.
Your hosting provider may have forbidden that in his Terms of Service (mine has). Other than that, I'd go for a subdomain instead of a subdirectory (like mydevelopmentsite.myfriendswebsite.com).