Some site have url pattern as www.___.com/id=1 to www.___.com/id=1000. How can I crawl the site using nutch. Is there any wway to provide seed for fetching in range??
I think the easiest way would be to have a script to generate your initial list of urls.
no. you have inject them manually or using a script
Related
My website changes very frequently, and I need a way to dynamically generate a new site map every day.
I tried to use sitemap.js but it requires me to give it specific urls for my site.
I am wondering if there's a way to have it crawl the site and generate a site map based on the urls it finds dynamically.
If not, is there any other server-side script that I can use to dynamically generate site maps?
Thanks
Your website does contain any backend? Or storage for data? Or is it only clean HTML and nothing more? If its any of the 1 or 2nd option, then you can just exctract it from this. Or you can
1. get your homepage, extract all urls, 2. omit those from other domains. 3. Repeat for links that you've stored. 4. Do not store duplicates.
I am crawling a website using Apache Nutch. While crawling, I want nutch to ignore multiple url patterns like http://www.youtube.com/..so on..., http://www.twitter.com/so on.., etc.
I know how to configure regex-urlfilter.txt file to crawl specific url.
But I dont know how to configure nutch to ignore certain url patterns?
I followed following url and found many useful examples
https://scm.thm.de/pharus/nutch-config/blobs/66fba7d3dc015974b5c194e7ba49da60fe3c3199/Nutch-Config/conf/regex-urlfilter.txt
I use Nutch and i would like index an intranet, but how to make sure everything on the intranet will be indexed ?
Thanks.
If you know all the urls of the intranet, then write a robots.txt (or an equivalent page with all the urls and point the crawler to it).
If you don't then you cannot be never secure that you'll have crawled all the urls, because you cannot verify it after the crawling.
In the last case the best chance is to do the crawl at the maximum depth.
Regards
I crawl sites in nutch 1.3. now I want to delete a url from crawldb, how can I do this? how I read from crawldb? I want to see urls that exist in crawldb.
To read from crawlDb you can use the CrawlDBReader class(org.apache.nutch.crawl package). To delete/remove a url from the crawlDb you can use try using the CrawlDBMerger class(org.apache.nutch.crawl package) with the "-filter" option. But I suggest writing a Mapreduce to delete urls according to your needs.
Does Nutch crawl automatically when I add new pages to the website?
No, you have to recrawl or create the index from scratch.
Tt wont re-crawl automatically. You can do either of these:
make the parent page of the new url re-crawled so that the new url enters the crawldb and will be fetched in subsequent fetch round.
Add the new url directly to the crawldb via inject command.
You should do scheduled crawling to keep your data up-to-date.
Open source Java Job Schedulers