Nutch 1.2 - Why won't nutch crawl url with query strings?

Nutch 1.2 - Why won't nutch crawl url with query strings? - nutch

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?

See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.

# skip URLs containing certain characters as probable queries, etc.
#-[?*!#=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!#]

By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

Related

Notes 9, rewriting URLs

How do you rewrite a URL in Notes 9 XPages.
Let's say I have:
www.example.com/myapp.nsf/page-name
How do I get rid of that .nsf part:
www.example.com/page-name
I don't want to do lots of manual re-direct because my pages are dynamically formed like wordpress.
I've read this: http://www.ibm.com/developerworks/lotus/library/ls-Web_site_rules/
It does not address the issue.

If you use substitution rules like the following, you can get rid of the db.nsf part and call your XPages directly as example.com/xpage1.xsp:
Rule (substitution): /db.nsf/* -> /db.nsf/*
Rule (substitution): /* -> /db.nsf/*
However, you have to "manually" generate your URLs without the db.nsf part in e.g. menus because the XPages runtime will include the db.nsf part in the URLs if you use for instance the openPage simple action.

To completely control what is going in and out put your Domino behind an Apache HTTP and use mod_rewrite. On Domino 9.0 Windows you can use mod_domino

You can do it with a mix of subsitutions, "URL-pattern" and paritial refresh.
I had the same problem, my customers wants clean URLs for SEO.
My URLs now looks like these:
www.myserver.de/products/financesoftware/anyproduct
First i used one subsitution to cover the folder, database and xpage part of the URL.
My substitution: "/products" -> "/web/techdemo.nsf/product.xsp"
Problem with these is, any update on this site (with in redirect mode) and the user gets back the "dirty" URL.
I solved this with the use of paritial refreshes only.
Last but not least, i uses my own slash pattern at the end of the xpage call (.xsp)
In my case thats the "/financesoftware/anyproduct/" part.
I used facesContext.getExternalContext().getRequestPathInfo() to resolve that URL part.
Currently i used good old RegExp to get the slash separated parameters back out of the url, but i am investigating a REST solution at the moment.

I haven't actually done this, but just saw the option yesterday while looking for something else. In your Xpage, go to All Properties, and look at 'navigationRules' and 'pageBaseUrl'. I think you will find what you are looking for there.

How to recrawle nutch

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.
Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.
Can anybody tell me, what actually I am doing wrong.
Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.
Any help will really appreciable.

I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.
First time when I start nutch I do the following:
Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
... and everything was fine.
Next I made the following changes:
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Next I execute the following commands:
updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3
And nutch still crawl the www.domain01.com
I don't know why ?
I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).

This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)
Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.
Cheers.

u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property
and u just change regex-urlfilter.txt
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# accept anything else
+.
after remove that indexing dir manual or command also like..
rm -r $NUTCH_HOME/indexdir
after run ur crawl cammand...........

Is it possible with canonical URL for this pattern in htaccess: /a/*/id/uniqueid?

A big problem is that I am not a programmer….! So I need to solve this with means within my own competence… I would be very happy for help!
I have an issue with a lot of duplicated URLs in the Google index and there are strong signs that it is causing SEO problems.
I don’t have duplicate links on the site itself, but as it once was set-up, for certain pages the system allows all sorts of variations in the URL. As long as is it has a specific article-id, the same content will be presented under an infinite number of URLs.
I guess the duplicates in Google's index has been growing over long time and is due to links gone wrong from other sites that links to mine. The problem is that the system have accepted the variations.
Here are examples of variations that exists in the Google index:
site.com/a/Cow_Cat/id/5272
site.com/a/cow_cat/id/5272
site.com/a/cow…cat/id/5272
site.com/a/cowcat/id/5272
site.com/a/bird/id/5272
The first URL with mixed case is the one used site-wide and for now I have to live with it, it would take too long time to make a change to all lower case. I cannot make a manual effort via htaccess as it is a total of 300.000 articles. I believe there are 10 ‘s of thousands that have one or more duplicates.
My question is this:
Is it possible to create rules for canonical URLs in htaccess in order to make the above URLs to be handled as one as well as for the rest of the 300.000?
I e, is there a way to say that all URLs having
/a/*/id/uniqueid
should be seen as one = based only on the unique ID and not give any regard to the text expressed with the “*”?
My hope is that it would be possible to say that a certain pattern like above should only be differentiated by the last unique segment.
If it is not possible in htaccess, how would it be done with link rel="canonical" on each page, can the code include wildcards?
I should add that the majority of the duplicates are caused by incoming links being lower case where the site itself is using a mix. Would it be OK to assign a canonical URL only with lower case although the site itself is basically always using a mix of lower/upper case?
If this is possible, I would be very happy to be helped with how to do it!!!!
Jonas
Hi Michael! I am not an expert but this is how I think it could be done:
1) My problem is that the URLs have mixed cases and I cannot change that now.
2) If it is OK for the searchengines, it would be fine for me to make the canonical URL identical to the actual URLs with the difference that it was all lower case, that would solve approx 90% of the duplicates. I e this would be the used URL: site.com/a/Cow_Cat/id/5272 and this would be the canonical: site.com/a/cow_cat/id/5272. As I understand, that would be good SEO...or...?
My idea was NOT to change the address browser address bar (i e using 301 redirect) but rather just telling the search engines which URLs that are duplicates, as I understand, that can be done by defining a canonical URL either in htaccess (as a pattern - I hope) or as a tag on each page.
3) IF, it would be possible to find a wildcard solution...I am not sure if this is possible at all, but that would mean it was possible to NOT assign a specific canonical URL but rather a "group pattern", i e "Please search engine, see all URLs with this patter - having the unique identifier in the end - as if they are one and the same URL, you SE, decide which one you prefer": /a/*/id/uniqueid
Would that work? It will only work in htaccess if canonical URLs can be defined as a group where the group is defined as a pattern with a defined part as the unique id.
Is it possible when adding a tag for each page to say that "all URLs containing this unique id should be treated the same"? If that would work it would look something similar to this
link rel="canonical" /a/*/id/5272
I dont know if this syntax with wildcard exist but it would be nice : )

My advice would be to use 301 redirects, with URL rewriting. Ask your webmaster to place this in your apache config or virtual host config:
RewriteMap lc int:tolower
Then inside your .htaccess file you can use the map ${lc:$1} to convert matches to lower case. Here, the $1 part is a match (backreference from brackets in a regex in the RewriteRule) and the ${lc: } part is just how you apply the lc (lowercase) function set up earlier. Here is an example of what you might want in your .htaccess file:
RewriteCond %{REQUEST_URI} [A-Z] #this matches a url with any uppercase characters
RewriteRule (.*) /${lc:$1} [L,R=301] #this makes it lowercase
As for matching the IDs, presuming your examples mean "always end with the ID" you could use a regex like:
^(.+/)(\d+))$
The first match (brackets) gets everything up to and including the forward slash before the ID, and the second part grabs the ID. We can then use it to point to a single, specific URL (like canonical, but with a 301).
If you do just want to use canonical tags, then you'll have to say what you're using code wise, but an example I use (so as not add tags to hundreds of individual pages, for instance) in PHP would be:
if ($_SERVER["REDIRECT_URL"] != "") {
$canonicalUrl = $_SERVER["SERVER_NAME"] . $_SERVER["REDIRECT_URL"];
} else if ($_SERVER["REQUEST_URI"] != "") {
$canonicalUrl = $_SERVER["SERVER_NAME"] . preg_replace('/^([^?]+)\?.*$/', "$1", $_SERVER['REQUEST_URI']);
}
Here, the redirect URL is used if it's available, and if not the request uri is used. This code strips off the query string (this bold bit in http://www.mysite.com/a/blah/12345/?something=true). Of course you can add to this code to specify a custom path, not just taking off the query string, by playing with the regex.

how to make nutch crawler crawl

i have some doubt in nutch
while i used the wiki i am asked to edit the crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org/
and i am asked to create an url folder and an list of url...
do i need to create all the links in crawl-urlfilter.txt and in the list of url ...

Yes and no.
crawl-urlfiler.txt act as a filter, so only urls on apache.org will ever be crawled in your example
The url folder gives the 'seed' urls where to let the crawler start.
So if you want the crawler to stay in a set of sites, you will want to make sure they have a positive match with the filter... otherwise it will crawl the entire web. This may mean you have to put the list of sites in the filter

Nutch issues with crwaling website where the url differes only in termes of parameters passes

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.
The urls on my webiste are basically of this format
http://mysite.com/index.php?main_page=index&params=12
http://mysite.com/index.php?main_page=index&category=tub&param=17
i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)
Is Nutch unable to crawl such webistes?
What Nutch settings should I do in order to crawl such websites?

I got the issue fixed.
It had everything to do with the url filter set as
skip URLs containing certain characters as probable queries, etc
-[?*!#=]
I commented this filter and Nutch crawle dall urls :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Nutch 1.2 - Why won't nutch crawl url with query strings? - nutch

See my previous question here Adding URL parameter to Nutch/Solr index and search results The first 'Edit' should answer your question.

# skip URLs containing certain characters as probable queries, etc. #-[?!#=] You have to comment it or modify it as : # skip URLs containing certain characters as probable queries, etc. -[!#]

By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

Related

Notes 9, rewriting URLs

How to recrawle nutch

Is it possible with canonical URL for this pattern in htaccess: /a/*/id/uniqueid?

how to make nutch crawler crawl

Nutch issues with crwaling website where the url differes only in termes of parameters passes

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Nutch 1.2 - Why won't nutch crawl url with query strings? - nutch

See my previous question here Adding URL parameter to Nutch/Solr index and search results The first 'Edit' should answer your question.

# skip URLs containing certain characters as probable queries, etc. #-[?*!#=] You have to comment it or modify it as : # skip URLs containing certain characters as probable queries, etc. -[*!#]

By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

Related

Notes 9, rewriting URLs

How to recrawle nutch

Is it possible with canonical URL for this pattern in htaccess: /a/*/id/uniqueid?

how to make nutch crawler crawl

Nutch issues with crwaling website where the url differes only in termes of parameters passes

Categories

Resources

# skip URLs containing certain characters as probable queries, etc. #-[?!#=] You have to comment it or modify it as : # skip URLs containing certain characters as probable queries, etc. -[!#]