How to crawl urls having space using Apache Nutch? - nutch

I am using nutch for crawling but it is getting failed on urls which have space. I have gone through this link http://lucene.472066.n3.nabble.com/URL-with-Space-td619127.html but did not get satisfactory answer.
It works for URL in the seed.txt file but wont work for URLs in the parsed content of a page
I used a URL that has spaces in the conf/seed.txt file and it replaces the space with %20 and I was able to crawl the page.
I have added following in regex-normalize.xml
<regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>
Also, I added the reference of regex-normalize.xml in nutch-site.xml. But still I am facing the same problem.

I had the same problem but with more characters so I changed Fetcher.java!
New URLs add to Queue in "feeding" section!
you have to find this line:
nURL.set(url.toString());
and replace it with this:
nURL.set(URIUtil.encodeQuery(url.toString()));

I had the same problem and added this to my regex-normalize.xml
<regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>

Related

comma in the URL , google ignore all after comma

My site have links contains comma. eg. http://example.com/productName,product2123123.html
I set sitemap of this links and google webmaser tools report information that url is not found.
I see google ignore all after comma in url and try index http://example.com/productName that is error url and site generate 404.
Google have bug ? or i must change routing of my site ? or change comma to "%2C", but this could remove my actual offer from google ?
I'm not sure if this could help you but maybe this could help you understand more of what your problem is. Try reading the following links:
Using commas in URL's can break the URL sometimes?
Are you using commas in your URLs? Here’s what you need to know.
Google SEO News and Discussion Forum

IIS URL Rewrite pattern from subdirectory to subdomain with custom path extraction

Previously there was this...
http://www.website.com/blog/post/2013/04/16/Animal-Kingdom's-Wild-Africa-Trek.aspx
(notice the apostrophe in 'Kingdom's')
and its now located at:
http://blog.website.com/post/Animal-Kingdoms-Wild-Africa-Trek
so, breaking it down the parts are...
remove .aspx from the end of the URL
map the call from www. to blog. and remove the blog part of the path
remove the date from the URL
remove the apostrophe
I understand how to redirect the subdirectory to a subdomain, but im stuck extracting the other parts of the path correctly, and cleaning out the apostrophe too.
A complete solution would be a great help, thanks in advance.
There is no way to remove all ', because it can be more than one. You could try following regexp (it allows up to 4 apostrophes), but it is quite "danger":
/blog/post/\d+/\d+/\d+/(([^']*)'*([^']*)'*([^']*)'*([^']*)'*).aspx
And redirect URL will be:
http://blog.website.com/post/{R:2}{R:3}{R:4}{R:5}
Bellow screenshot of my IIS Rule:

How to recrawle nutch

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.
Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.
Can anybody tell me, what actually I am doing wrong.
Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.
Any help will really appreciable.
I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.
First time when I start nutch I do the following:
Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
... and everything was fine.
Next I made the following changes:
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Next I execute the following commands:
updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3
And nutch still crawl the www.domain01.com
I don't know why ?
I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).
This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)
Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.
Cheers.
u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property
and u just change regex-urlfilter.txt
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# accept anything else
+.
after remove that indexing dir manual or command also like..
rm -r $NUTCH_HOME/indexdir
after run ur crawl cammand...........

Nutch 1.2 - Why won't nutch crawl url with query strings?

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.
# skip URLs containing certain characters as probable queries, etc.
#-[?*!#=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!#]
By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

How to use htaccess to rewrite url to html anchor tag (#)

I have a situation where I want to take the following URL:
/1/john
and have it redirect using Apache's htaccess file to go to
/page.php?id=1&name=john#john
so that it goes to an html anchor with the name of john.
I've found a lot of reference to escaping special characters, and to adding the [NE] flag so that the redirect ignores the # sign, but these don't work. For example, adding [NE,R] means that the URL just appears in the browser address as the original: http://example.com/page.php?id=1&name=john#john.
This is possible using [NE] flag (noescape).
By default, special characters, such as & and ?, for example, will be converted to their hexcode equivalent. Using the [NE] flag prevents that from happening.
More info http://httpd.apache.org/docs/2.2/rewrite/flags.html#flag_ne
You can in fact do one of these things, but not both.
You can use the [NE] flag to signify to Apache not to escape the '#' character, but for the redirect to work, you have to specify an absolute URL to redirect to, not simply a relative page. Apache cannot do the scrolling of the window down to the anchor for you. But the browser will, if you redirect to an absolute URL.
What you want to do, can be accomplished with URL rewriting, or, more specifically, URL beautification.
I just quickly found this well explained blog post for you, I hope it can help you out with the learning to rewrite URLs-part.
As for the #-thing (expecting that you now know what I'm talking about), I don't see a problem in passing the same variable to the rewritten URL twice. Like: (notice the last part of the first line)
RewriteRule ^([a-zA-Z0-9_]+)/([a-zA-Z0-9_]+)$ /$1/$2/#$2 [R]
RewriteRule ^([a-zA-Z0-9_]+)/([a-zA-Z0-9_]+)/$ /index.php?page=$1&subpage=$2
Though, you'll have to escape the #-part, and it seems that it can be done this way:
RewriteRule ^([a-zA-Z0-9_]+)/([a-zA-Z0-9_]+)$ /$1/$2/\%23$2 [R,NE]
BTW, URL rewriting is not that hard (but can become complicated, and I'm not an expert), but Google can help a lot along the way.
You cannot do an internal redirect to an anchor. (Just think about it: how would Apache scroll down to the anchor?) Your link should pointo to /1/john#john. Anchors aren't part of the request uri.

Resources