Setting Depth for Apache-Nutch Crawler

Setting Depth for Apache-Nutch Crawler - nutch

How to set depth for Apache-Nutch Crawler?
Below command says crawl is deprecated:
bin/nutch crawl seed.txt -dir crawler/stat -depth 1 -topN 5
I tried with bin/crawl instead of crawl. For that, I am getting error:
class cannot be loaded : bin.crawl

If you really want to set maximum depth, you should use the scoring-depth plugin. The crawl script allows you to define the number of iteration, which is an upper limit on the depth, but not the same thing.
The correct format for the crawl command is:
bin/crawl -s seed.txt crawler/stat 1
As with other Nutch scripts, simply run bin/crawl with no parameters to see the help message that explains how to use it.

Related

wget: obtaining files matching regex

According to the man page of wget, --acccept-regex is the argument to use when I need to selectively transfer files whose names matching a certain regular expression. However, I am not sure how to use --accept-regex.
Assuming I want to obtain files diffs-000107.tar.gz, diffs-000114.tar.gz, diffs-000121.tar.gz, diffs-000128.tar.gz in IMDB data directory ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/. "diffs\-0001[0-9]{2}\.tar\.gz" seems to be an ok regex to describe the file names.
However, when executing the following wget command
wget -r --accept-regex='diffs\-0001[0-9]{2}\.tar\.gz' ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/
wget indiscriminately acquires all files in the ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/ directory.
I wonder if anyone could tell what I have possibly done wrong?

Be careful --accept-regex is for the complete URL. But our target is some specific files. So we will use -A.
For example,
wget -r -np -nH -A "IMG[012][0-9].jpg" http://x.com/y/z/
will download all the files from IMG00.jpg to IMG29.jpg from the URL.
Note that a matching pattern contains shell-like wildcards, e.g. ‘books’ or ‘zelazny196[0-9]*’.
reference:
wget manual: https://www.gnu.org/software/wget/manual/wget.html
regex: https://regexone.com/

I'm reading in wget man page:
--accept-regex urlregex
--reject-regex urlregex
Specify a regular expression to accept or reject the complete URL.
and noticing that it mentions the complete URL (e.g. something like ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/diffs-000121.tar.gz)
So I suggest (without having tried it) to use
--accept-regex='.*diffs\-0001[0-9][0-9]\.tar\.gz'
(and perhaps give the appropriate --regex-type too)
BTW, for such tasks, I would also consider using some scripting language à la Python (or use libcurl or curl)

wkhtmltopdf doesn't show desktop size page in pdf

I have been trying to set up wkthmltopdf to convert html to pdf documents on-the-fly on my linux web server. I've looked around for solutions around the default 800x600 issue and found this:
/usr/bin/xvfb-run --server-args="-screen 0, 1024x768x24" wkhtmltopdf --use-xserver http://domain.com /home/location/apps/test1.pdf
However, this STILL gives the resulting pdf a width of 800. What am I missing?
I've tried to run xvfb-run from the ssh command line and it works flawlessly. It also creates the pdf with a normal output.. it's just that the width is NOT being recognized...

Just a guess, but you might need another set of quotes around "0, 1024x768x24' so that it's considered a single argument after being passed on. So something like this:
/usr/bin/xvfb-run --server-args="-screen \"0, 1024x768x24\"" wkhtmltopdf --use-xserver http://domain.com /home/location/apps/test1.pdf
They are escaped with \ so that the xvfb-run command does not consider them argument delimiters, but rather passes them on to wkhtmltopdf.
(I don't know about xvfb-run, but assume that it is in turn building a wkhtmltopdf command line.)

GNU find, how to use debug options, -D

I've been using find more successfully in my daily work especially with xargs, parallel, grep and pcregrep.
I was reading the man pages for find tonight and found a section I'd always skipped right over...Debug options.
I promptly tried:
find -D help
# then
find -D tree . -type f -name "fjdalfj"
# and
find -D search . -type f -name "fjaldkj"
Wow some really interesting output....however I was unsuccessful in tracking down what this means.
Can someone point me to a resource I can read about this or perhaps they care to explain a bit about how they would use these options or when they'd be useful.

Fedora '$ man find' provides illumination. Notice carefully where the "-D" options are placed: before the selection expression and before the search directories are listed. Most of the debug mode deals with how the selection expressions are parsed, as these can get quite involved.
tree: shows the generated parse tree for the expression.
search: verbosely log the navigation of the searched directories.
stat: trace calls to stat(2) and friends when getting the file attributes, such as its size.
rates: indicate how frequently each expression predicate is successful.
opt: provide details on how the selection expression was optimized after it is parsed.
exec: show details about the -exec, -execdir, and -okdir programs run.
You mileage may vary, your version of find(1) may have different debug options. A '$ find -D help' is probably in order.

Nutch does not crawl multiple sites

I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this:
http://1.a.b/
http://2.a.b/
and my regex-urlfilter.txt looks like this:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
#+.
+^http://1.a.b/*
+^http://2.a.b/*
I tried the following for the last part:
+^http://([a-z0-9]*\.)*a.b/*
The only site crawled is the first one. All other configuration is default.
I run the following command:
bin/nutch crawl urls -solr http://localhost:8984/solr/ -dir crawl -depth 10 -topN 10
Any ideas?!
Thank you!

Try this in regex-urlfilter.txt :
Old Settings:
# accept anything else
#+.
+^http://1.a.b/*
+^http://2.a.b/*
New Sertting :
# accept anything else
+.

Change in Server Permits script not to work. Can this be due to PHP.ini being different?

Here is proof that my site is not portable. I had some regex that worked perfectly on my old server. I have now transferred my site to a new server and it doesn't work.
$handle = popen('/usr/bin/python '.YOUTUBEDL.'youtube-dl.py -o '.VIDEOPATH.$fileName.'.flv '.$url.' 2>&1', 'rb');
while(!feof($handle))
{
$progress = fread($handle, 8192);
$pattern = '/(?<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?<filesize>.+) at/';
///######Does not execute this if - no matches
if(preg_match_all($pattern, $progress, $matches)){
fwrite($fh, $matches[0][0]."\r\n");
}
}
The output of the from the shell is something like this and the regex should match filesize and percentage.
[download] 56.8% of 4.40M at 142.40k/s ETA 00:13
The regex worked on the previous server but not this one. Why? How can I debug this?
The difference in the servers is that the previous one was Fedora and its now Centos. Also I specified the shell as /bin/bash.
Is there anything in the PHP.ini that could cause a change in this?
Please help.
Update
The output of $progress is this: (just a small sample)
[download] 87.1% of 4.40M at 107.90k/s ETA 00:05
[download] 89.0% of 4.40M at 107.88k/s ETA 00:04
[download] 91.4% of 4.40M at 106.09k/s ETA 00:03
[download] 92.9% of 4.40M at 105.55k/s ETA 00:03
Update 2
Could that regex fail because of extra spacing in the output?
Also would a different shell make a difference??
[SOLVED]
This was solved and it was due to the regex requiring a P - see here for more details: Does this regex in PHP actually work?

Is the output from the working or non working server? It's possible safe mode is enabled on the second and the script is refusing to execute. Check the Notes section on this page for more information http://us.php.net/popen
To verify, and I do see the comments where you placed the output of the python command, print out the $progress variable in the php script. I'm not sure if the output you supplied is when you ran the python command from cli or got that from the php script itself. But if it's from the python script outputting that $progress variable will help determine if popen() is really executing that command.
(You should be able to tell if safe mode is enabled by running phpinfo() in your script)
Or you could do this
if( ini_get('safe_mode') ){
print "safe mode on\n";
}else{
print "safe mode off\n";
}

I've never heard of regexps breaking while switching PHP configs, so I'm wondering if it's the python that might be breaking. This might be a silly question, but have you checked that your exec line works from a terminal?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Setting Depth for Apache-Nutch Crawler - nutch

How to set depth for Apache-Nutch Crawler? Below command says crawl is deprecated: bin/nutch crawl seed.txt -dir crawler/stat -depth 1 -topN 5 I tried with bin/crawl instead of crawl. For that, I am getting error: class cannot be loaded : bin.crawl

Related

wget: obtaining files matching regex

wkhtmltopdf doesn't show desktop size page in pdf

GNU find, how to use debug options, -D

Nutch does not crawl multiple sites

Change in Server Permits script not to work. Can this be due to PHP.ini being different?

Categories

Resources