how to find all the urls/ pages on mysite.com - web

i have a website that i now support and need to list all live pages/ url's.
is there a crawler i can use to point to my homepage and have it list all the pages/url's that it finds.
then i can delete any that dont make their way into this listing as they will be orphan pages/url's that have never been cleaned up?
I am using DNN and want to kill un-needed pages.

Since you're using a database-driven CMS, you should be able to do this either via the DNN admin interface or by looking directly in the database. Far more reliable than a crawler.

Back in the old days I used wget for this exact purpose, using its recursive retrieval functionality. It might not be the most efficient way, but it was definitely effective. YMMV, of course, since some sites will return a lot more content than others.

Related

Apache Nutch: crawl only new pages for semantics analysis

I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.
Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".
Reading documentation and searching the Web iI got following questions can't find anywhere else:
What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?
P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.
I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.

Stop a website redirecting to my website

I have a rogue website that is redirecting to my website. It doesn't have any content, 1 page, and some bad backlinks and no relationship with us or our niche, so it's safe to say its up to no good.
We want to distance ourselved from this as best as possible. I've requested that the registrar identify the culprit or remove the redirect, however, I wondered if it was possible to stop the site redirecting to our site full-stop.
We're using IIS7.5 on a Windows 2008 Server, and to date I've looked at blocking requests through urlrewriting but I've had no success. I've also read that request filtering may be an option but again little knowledge as to the capabilities of using this.
I would appreciate any advice regarding the 2 approaches above as to whether they are suited to what I want to achieve and if possible links to a clear example.
Here is an existing post that covers this exact topic. You will need to use URL Rewrite to implement the solution.

Is it possible to create an app for a site without an API?

I would like to create an app for a myBB forum. So the site on the forum will look nicer and much more cleaner on an iPhone or Android.
Is it possible without an API? It isn't my site ether.
everything is possible, it's just a matter of resources...
technically, you can write an app for everything on the web, but:
an API will tell you how you can do things with the site, without having to reverse engineer all pages/posts/..., and the format of every output resulting from post/get operations. reverse engineering may take a long time, and you will surely not come accross all possible results (error pages, bad authentication...);
an API is quite stable and is always updated with great care from the developpers so as not to break existing applications. without an API, there is no guarantees that your app will not break with the next release of the forum when it is upgraded;
a web API generally defines an output format which is easily parseable: many API outputs XML or JSON, which can be processed with standard libraries. without an API, the output format is plain HTML, which may be difficult to reorganize in order to show the results in a different format.
so, yes, you can definitely write an app for a myBB forum, but it may require a fair amount of work.
You can do, it's called screen scraping and is what was done before XML, the semantic web, SOAP, web services and then JSON apis tried to solve the problem better.
In screen scraping, you grab the site's HTML, parse it, get the data you want out of it, then do what you need with that data. It's more work, and breaks each time the site's layout changes, hence the history of improvements to it.
You mention the site in question is not yours. Many sites do not regard screen scraping as fair use, so check with the site's terms and conditions that you can legally create an app from the data posted there.
you can consider useing HTML5 ... do you think it doable for use app ?

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.
Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.
Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.
The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.

Is it possible for a 3rd party to reliably discern your CMS?

I don't know much about poking at servers, etc, but in light of the (relatively) recent Wordpress security issues, I'm wondering if it's possible to obscure which CMS you might be using to the outside world.
Obviously you can rename the default login page, error messages, the favicon (I see the joomla one everywhere) and use a non-default template, but the sorts of things I'm wondering about are watching redirects somehow and things like that. Do most CMS leave traces?
This is not to replace other forms of security, but more of a curious question.
Thanks for any insight!
Yes, many CMS leave traces like the forming of identifiers and hierarchy of elements that are a plain giveaway.
This is however not the point. What is the point, is that there are only few very popular CMS. It is not necessary to determine which one you use. It will suffice to methodically try attack techniques for the 5 to 10 biggest CMS in use on your site to get a pretty good probability of success.
In the general case, security by obscurity doesn't work. If you rely on the fact that someone doesn't know something, this means you're vulnerable to certain attacks since you blind yourself to them.
Therefore, it is dangerous to follow this path. Chose a successful CMS and then install all the available security patches right away. By using a famous CMS, you make sure that you'll get security fixes quickly. Your biggest enemy is time; attackers can find thousands of vulnerable sites with Google and attack them simultaneously using bot nets. This is a completely automated process today. Trying to hide what software you're using won't stop the bots from hacking your site since they don't check which vulnerability they might expect; they just try the top 10 of the currently most successful exploits.
[EDIT] Bot nets with 10'000 bots are not uncommon today. As long as installing security patches is so hard, people won't protect their computers and that means criminals will have lots of resources to attack. On top of that, there are sites which sell exploits as ready-to-use plugins for bots (or bots or rent whole bot nets).
So as long as the vulnerability is still there, camouflaging your site won't help.
A lot of CMS's have id, classnames and structure patterns that can identify them (Wordpress for example). URLs have specific patterns too. You just need someone experienced with the plataform or with just some browsing to identify which CMS it's using.
IMHO, you can try to change all this structure in your CMS, but if you are into all this effort, I think you should just create your own CMS.
It's more important to keep everything up to date in your plataform and follow some security measures than try to change everything that could reveal the CMS you're using.
Since this question is tagged "wordpress:" you can hide your wordpress version by putting this in your theme's functions.php file:
add_action('init', 'removeWPVersionInfo');
function removeWPVersionInfo() {
remove_action('wp_head', 'wp_generator');
}
But, you're still going to have the usual paths, i.e., wp-content/themes/ etc... and wp-content/plugins/ etc... in page source, unless you figure out a way to rewrite those with .htaccess.

Resources