i have some doubt in nutch
while i used the wiki i am asked to edit the crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org/
and i am asked to create an url folder and an list of url...
do i need to create all the links in crawl-urlfilter.txt and in the list of url ...
Yes and no.
crawl-urlfiler.txt act as a filter, so only urls on apache.org will ever be crawled in your example
The url folder gives the 'seed' urls where to let the crawler start.
So if you want the crawler to stay in a set of sites, you will want to make sure they have a positive match with the filter... otherwise it will crawl the entire web. This may mean you have to put the list of sites in the filter
Related
I need to index my company's employee manual, which is hosted on an external website. This page requires login, and supports auto-login through a query string parameter.
Like this: http://manual.externalprovider.com?token=xxxxxxxxx
When entering this URL in my content source I get no result and the following warning:
Item not crawled due to one of the following reasons: Preventive crawl
rule; Specified content source hops/depth exceeded; URL has query
string parameter; Required protocol handler not found; Preventive
robots directive. ( This item was deleted because it was excluded by a
crawl rule. )
Is it impossible to crawl content that has a query string parameter in the start addresss? Any other suggestions on how to solve this?
I think it is possible, but you need to create new crawl rule.
Go to Search Service Application -> Crawl Rules -> New crawl rule.
Then paste your starting url: http://manual.externalprovider.com/* and please check "Include all items in this path" and then "Crawl complex URLs (URLs that contain a question mark (?))".
I put sharethis on my site, and if I go to the site andrewwelch.info without the www, then the shares are different from if I go to www.andrewwelch.info. How can I make sure that this doesn't happen?
ShareThis is rendered inside an IFRAME, and will use the parent frame's URL to determine the page someone is sharing.
You can add span tags with a st_url attribute to specify a canonical URL to use for a given page. An example is:
<span class="st_sharethis" st_url="http://sharethis.com" st_title="Sharing is great!"></span>
See here for more details.
As a side note: To improve your search engine rankings you should ensure your site doesn't present two different versions of each page. Search engines may reduce the relevancy of your site in results if this is the case. For example, the content of the following pages (and every other page on your site) are the same:
http://andrewwelch.info/
http://www.andrewwelch.info/
You need to fix this by choosing whether you want the "www" or not, then using one of the following methods:
Use a "canonical" meta tag to tell search engines which page is the one you want indexed.
Respond to requests for the "www" or "non-www" hostname with a 301 redirect to the other.
I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.
# skip URLs containing certain characters as probable queries, etc.
#-[?*!#=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!#]
By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.
I'm in the process of rewriting all the URLs on my site that end with .php and/or have dynamic URLs so that they're static and more search engine friendly.
I'm trying to decide if I should rewrite file names as simple strings of words, or if I should add .html to the end of everything. For example, is it better to have a URL like
www.example.com/view-profiles
or
www.example.com/view-profiles.html
???
Does anyone know if the search engines favor doing it one way or another? I've looked all over Stack Overflow (and several other resources) but can't find an answer to this specific question.
Thanks!
SEO optimized URLs should be according to this logic (listed in priority)
unique (1 URL == 1 ressource)
permanent (they do not change)
manageable (1 logic per site section, no complicated exceptions)
easily scaleable logic
short
with a targeted keyword phrase
based on this
www.example.com/view-profiles
would be the better choice.
said that:
google has something i call "dust crawling prevention" (see paper: "do not crawl in dust" from this google http://research.google.com/pubs/author6593.html) so if google discovers a URL it must decide if it is worth crawling that specific page.
as google gives URLs with an .html a "bonus" credit of trust "this is an HTML page i probably want to crawl it".
said that: if your site mostly consists out of HTML pages that have actual textual content , this "bonus" is not needed.
i personally only add the .html to HTML sitemap pages that consists only out of long lists and only if i have a few millions of it, as i have seen a slightly better crawlrate above these pages. for all other pages i strictly keep the Franzsche URL logic mentioned above.
br
franz, austria, vienna
p.s.: please see https://webmasters.stackexchange.com/ for not programming related SEO questions
I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index¶ms=12
http://mysite.com/index.php?main_page=index&category=tub¶m=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?
I got the issue fixed.
It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!#=]
I commented this filter and Nutch crawle dall urls :)