How do I check an entire website to see if any page in it links to a particular URL?

How do I check an entire website to see if any page in it links to a particular URL? - security

We have been hounded by an issue in our websites because web protection facility pages like ones from Norton keep on telling certain visitors in certain browsers that our websites are potential risks because we link to a certain http://something.abnormal.com/ (sample URL only).
I've been trying to scour the site page by page, to no avail.
My question, do you know any site that would be able to "crawl" into our website's pages and then check if any text, image, whatever in them links to the abnormal URL that keeps on bugging.
Thanks so much! :)

What you want is a 'spider' application. I use the spider in 'Burp Suite' but there are a range of free, cheap and expensive ones.
The good thing about Burp is you can get it to spider the entire site and then look at every page for whatever you want, whether it be something to match a regex or dynamic content etc.

If your websites consist of a small amount of static content pages, I would use wget to download all pages (ignoring images)
wget -r -np -R gif,jpg,png http://www.example.com
and then use a text search for the suspicious url on the result. If your websites are more complex, httrack might be easier to configure for a text-only download.

Related

launch google search from link

I am running a website based on php on a server run by a large host. My goal is very simple. Include link on my site to google search where I dynamically give the search term.
Starting with the url that appears in the address bar, I've narrowed the syntax down to
http://www.google.com/search?q=test
This works when I type it into the address bar. However, when I launch from the server, it redirects to:
www.google.com/webhp...lots of characters
There are references on the web to webhp being related to a virus but I'm pretty sure my host does not have any viruses on its servers.
Does anyone know proper way to launch simple google search from a link? Is a straight link forbidden? I am Willing to use JS to push link to client if necessary (which I use for google maps at Google's recommendation due to usage limits) but want to keep things as simple as possible. This link is just to save people a few clicks.
Thanks for any suggestions.

Simply use the urlencode Method
<?php
echo '<a href="http://www.google.com/search?q=', urlencode($userinput), '">';
?>
If you wish to do it with Javascript the answer is here: Encode URL in JavaScript?
Try to track down the "Url Rewriting", I think its a virus you need to remove: http://www.ehow.com/how_8728291_rid-webhp.html
WebHP is a computer virus that automatically sets your homepage to a
fake Google site, known as Google.com/WebHP. This virus will also
randomly open windows or tabs to load this website, as well as
generate pop-ups and fake errors. Also installed with this virus is a
rootkit which can disable your PC's firewall and other methods of
security. If left untreated, the WebHP virus allows hackers to
remotely access your computer and steal personal information, such as
credit card numbers and email passwords.

How to crawl English site and avoid crawling other languages?

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to understand the language of that I want to leave the site at the first chance we could detect the language. Could you please tell me if its possible? For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them. Thanks for any help.

If you have a quick look to the HTTP Request parameters (http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) you can ask for the content language and you will get an answer like this: "Content-Language: en".
You do not need to do a GET request (and download the whole page), you could ask for this parameter in a HEAD request (in order to download only headers).
About "For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them."
A site could be multi-language. So you can get the 3 first pages in spanish (or whatever) and you will leave the site, although there are some pages in English.

How to use locations.kml with sitemap.xml

I would like to make sure website ranks as high as possible whenever my Google Places location ranks high.
I have seen references to creating a locations.kml file and putting it in the root directory of my site. Then creating lines in the sitemap.xml file to point to this .kml file.
I get this from this statement on the geolocations page
Google no longer supports the Geo extension to the Sitemap protocol. We recommmend that you tell Google about geographically-based URLs by including them in a regular Web Sitemap.
There is a link to the Web Sitemap page
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=183668
I'm looking for examples of how to include Geo location information in the sitemap.xml file.
Would someone please point me to an example so that I can know how to code the reference?

I think the point is that you dont use any specific formatting in the sitemap. You make sure you include all your locally relevent pages in the sitemap as normal. (ie you dont include any geo location in the sitemap)
GoogleBot will use its normal methods for detereriming if the page should be locally targeted.
(I think Google have found the sitemap-protocol has been abused, and or misunderstood, so they dont need it to tell them so much about the page. Rather its just a way to find pages, that it might take a long time to discover though conventual means. )

Writing a htaccess file - RewriteBase?

Right I'll try and explain my situation as thoroughly as possible while also keeping it brief...
I'm just starting out as a web designer/developer, so I bought the unlimited hosting package with 123-reg. I set up a couple of websites, my main domain being designedbyross.co.uk. I have learnt how to map other domains to a folder within this directory. At the minute, one of my domains, scene63.com is mapped to designedbyross.co.uk/blog63 which is working fine for the home page. However when clicking on another link on scene63.com for example page 2, the URL changes to designedbyross.co.uk/blog63/page2...
I have been advised from someone at 123-reg that I need to write a .htaccess file and use the RewriteBase directive (whatever that is?!) I have looked on a few websites to try and help me understand this, including http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html however it all isn't making much sense at the moment.
Finally, scene63.com is a wordpress site, whether that makes any difference to how the htaccess file is structured I'm not sure...
Any help will be REALLY appreciated - Thanks.

I run my personal public website on Webfusion, which is another branded service offering by the same company on the same infrastructure, and my blog contains a bunch of articles (tagged Webfusion) on how to do this. You really need to do some reading and research -- the Apache docs, articles and HowTos like mine -- to help you get started and then come back with specific Qs, plus the supporting info that we need to answer them.
It sounds like you are using a 123 redirector service, or equivalent for scene63.com which hides the redirection in an iframe. The issue here is that if the links on your site use site-relative links then because the URI has been redirected to http://designedbyross.co.uk/blog6/... then any new pages will be homed in designedbyross.co.uk. (I had the same problem with my wife's business site which mapped the same way to one of my subdirectories).
What you need to do is to configure the blog so that its site base is http://scene63.com/ and to force explicit site-based links so that any hrefs in the pages are of the form http://scene63.com/page2, etc. How you do this depends on the blog engine, but most support this as an option.

It turned out to be a 123-reg problem at the time not correctly applying changes to the DNS.

Mobile Site SEO - Playing Nicely with Google

If I have an iPhone version of my site, what are the things I need to make sure of so it doesn't interfere with SEO?
I've read quite a bit now about cloaking and sneaky javascript redirects, and am wondering how this fits into iPhone and Desktop websites playing together.
If my iPhone site has a totally different layout, where say the Desktop site has a page with 3 posts and 10 images all on the page, and my iPhone site makes that 2 pages, one with the posts, one with the images (trying to think up an example where the structure's decently different), that's probably not best practice for SEO, so should I just tell google not to look at the mobile site? If so, and assuming my client would like to automatically redirect mobile users to the iPhone site (I'm familiar with the id of taking them to the regular page with a link to the mobile version instead), how do I not make this look like cloaking?

Google actually has a separate index and crawler for mobile content. So all you need to do is design your URLs in such a way that you can exclude googlebot from the mobile pages and googlebot-mobile from the regular pages in robots.txt.

Certainly you have the option of telling the search engines to not look at the mobile page. I would leave it though because you never know who is looking for something specific and maybe Google will prefer certain pages over others for mobi users.
If the 2 pages on mobi make sense to the visitor then I would not worry about it for SEO. If you are redirecting based on mobi then I don't see how the search engines could think you are cloaking, but if you want to be totally sure I suggest using CSS to show different information based on Media type.
The only problem I can think of would be of duplicate content. The SEs may see both pages and not rank one as highly because it likes what it sees on the other page. There is no penalty other than the fact that one page is more interesting than the other and may get better rankings whereas the other drops in rank. If you are making two separate pages it would be an opportunity to tune your information to specific details and maybe get hits for both, but if you are using CSS then it will rank as one page.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I check an entire website to see if any page in it links to a particular URL? - security

Related

launch google search from link

How to crawl English site and avoid crawling other languages?

How to use locations.kml with sitemap.xml

Writing a htaccess file - RewriteBase?

Mobile Site SEO - Playing Nicely with Google

Categories

Resources