Apache nutch fetching but not saving file content - cassandra

I asked nutch to crawl a local file: http://localhost:8080/a.txt. I am running the HTTP server and I can see nutch trying to access the file (and before it, /robots.txt). I am using cassandra as backend.
However, I cannot see any data from the crawl. When I do
./bin/nutch readdb -dump data ..., I get the following output.
Can someone help me with a sane answer to this question? Where is the webpage data?
$ cat data/part-r-00000
http://localhost:8000/a.html key: localhost:http:8000/a.html
baseUrl: null
status: 2 (status_fetched)
fetchTime: 1426811920382
prevFetchTime: 1424219908314
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)

Related

{"error":"not_found","reason":"missing"} error while running couchdb-lucene on windows

I am running CouchDB and Couchdb-lucene on Windows Server 2019 version 1809.
I followed all steps documented on link https://github.com/rnewson/couchdb-lucene
My CouchDB local.ini file
[couchdb]
os_process_timeout = 60000
[external]
fti=D:/Python/python.exe "C:/couchdb-lucene-2.2.0/tools/couchdb-external-hook.py --remote-port 5986"
[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}
[httpd_global_handlers]
_fti = {couch_httpd_proxy, handle_proxy_req, <<"http://127.0.0.1:5986">>}
couchdb-lucene.ini file
[lucene]
# The output directory for Lucene indexes.
dir=indexes
# The local host name that couchdb-lucene binds to
host=localhost
# The port that couchdb-lucene binds to.
port=5986
# Timeout for requests in milliseconds.
timeout=10000
# Timeout for changes requests.
# changes_timeout=60000
# Default limit for search results
limit=25
# Allow leading wildcard?
allowLeadingWildcard=false
# couchdb server mappings
[local]
url = http://localhost:5984/
Curl outputs
C:\Users\serhato>curl http://localhost:5986/_fti
{"couchdb-lucene":"Welcome","version":"2.2.0-SNAPSHOT"}
C:\Users\serhato>curl http://localhost:5984
{"couchdb":"Welcome","version":"3.1.1","git_sha":"ce596c65d","uuid":"cc1269d5a23b98efa74a7546ba45f1ab","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
Design document I defined in CouchDB which aims to create full text search index for RenderedMessage field
{
"_id": "_design/foo",
"_rev": "11-8ae842420bb4e122514fea6f05fac90c",
"fulltext": {
"by_message": {
"index": "function(doc) { var ret=new Document(); ret.add(doc.RenderedMessage); return ret }"
}
}
}
when I navigate to http://localhost:5984/dev-request-logs/_fti/_design/foo/by_message?q=hello
Response is
{"error":"not_found","reason":"missing"}
when I also navigate http://localhost:5984/dev-request-logs/_fti/
Response is same
{"error":"not_found","reason":"missing"}
I think there is a problem with external integration to lucene engine. So to my cruosity i try to execute python command to check if py script is running.
D:/Python/python.exe C:/couchdb-lucene-2.2.0/tools/couchdb-external-hook.py
but the result is
C:\Users\serhato>D:/Python/python.exe C:/couchdb-lucene-2.2.0/tools/couchdb-external-hook.py
File "C:\couchdb-lucene-2.2.0\tools\couchdb-external-hook.py", line 43
except Exception, e:
^
SyntaxError: invalid syntax
What might be the problem ?
After hours of search I finally got into this link
https://github.com/rnewson/couchdb-lucene/issues/265
the query must be through directly Lucene not coucbdb itself. Below url returns the result
C:\Users\serhato>curl http://localhost:5986/localx/dev-requestlogs/_design/foo/by_message?q=hello
Original Documentation is very misleading as all the examples uses couchdb default port not the Lucene.
Or am I missing something ??

LetsEncrypt-ACMESharp http-01 challenge on IIS invalid

On server A (non-IIS) I executed:
Import-Module ACMESharp
Initialize-ACMEVault
New-ACMERegistration -Contacts mailto:somebody#derryloran.com -AcceptTos
New-ACMEIdentifier -Dns www.derryloran.com -Alias dns1
Complete-ACMEChallenge dns1 -ChallengeType http-01 -Handler manual
Response back asked:
* Handle Time: [08/05/2017 22:46:27]
* Challenge Token: [BkqO-eYZ5sjgl9Uf3XpM5_s6e5OEgCj9FimuyPACOhI]
To complete this Challenge please create a new file
under the server that is responding to the hostname
and path given with the following characteristics:
* HTTP URL: [http://www.derryloran.com/.well-known/acme-challenge/BkqO-eYZ5sjgl9Uf3XpM5_s6e5OEgCj9FimuyPACOhI]
* File Path: [.well-known/acme-challenge/BkqO-eYZ5sjgl9Uf3XpM5_s6e5OEgCj9FimuyPACOhI]
* File Content: [BkqO-eYZ5sjgl9Uf3XpM5_s6e5OEgCj9FimuyPACOhI.X-01XUeWTE-LgpxWF4D-W_ZvEfu6ue2fAd7DJNhomQM]
* MIME Type: [text/plain]
Server B is serving www.derryloran.com a page at http://www.derryloran.com/.well-known/acme-challenge/BkqO-eYZ5sjgl9Uf3XpM5_s6e5OEgCj9FimuyPACOhI correctly I believe but when I then, back on Server A execute:
Submit-ACMEChallenge dns1 -ChallengeType http-01
(Update-ACMEIdentifier dns1 -ChallengeType http-01).Challenges | Where-Object {$_.Type -eq "http-01"}
...but the status goes invalid after a few seconds. FWIW I've tried this several times always with same result. Why? What am I doing wrong?
I appreciate there's a lot more to go once I've got the certificate but the site is being served in a docker container hence the Server A/B complexities...
Omg, how many times?!? The file had a BOM when created in VS. Recreating using Notepad++ and saving as UTF-8 (without BOM) and I'm getting a valid response now.

Gitlab misdetects binary file with text file and raises Internal Error (500 Whoops)

What is Issue?
When I push a link of the commit which invlolves a binary file from Commits view of a project on Gitlab, I recieve an Internal error ,"500 Whoops, something went wrong on our end."
This issue also appears when creating Merge Request whose origin is the same commit above.
Production.log says,
Started GET "/TempTest/bsp/commit/3098a49f2fd1c77be0c383994aa6655f5d15ebf8" for 127.0.0.1 at 2016-05-30 16:17:15 +0900
Processing by Projects::CommitController#show as HTML
Parameters:{"namespace_id"=>"TempTest", "project_id"=>"bsp", "id"=>"3098a49f2fd1c77be0c383994aa6655f5d15ebf8"}
Encoding::CompatibilityError (incompatible character encodings: UTF-8 and ASCII-8BIT):
app/views/projects/diffs/_file.html.haml:54:in `_app_views_projects_diffs__file_html_haml__1070266479743635718_49404820'
app/views/projects/diffs/_diffs.html.haml:22:in `block in _app_views_projects_diffs__diffs_html_haml__2984561770205002953_48487320'
app/views/projects/diffs/_diffs.html.haml:17:in `each_with_index'
app/views/projects/diffs/_diffs.html.haml:17:in `_app_views_projects_diffs__diffs_html_haml__2984561770205002953_48487320'
app/views/projects/commit/show.html.haml:12:in `_app_views_projects_commit_show_html_haml__3333221152053087461_45612480'
app/controllers/projects/commit_controller.rb:30:in `show'
lib/gitlab/middleware/go.rb:16:in `call'
Completed 500 Internal Server Error in 210ms (Views: 8.7ms | ActiveRecord: 10.5ms)
Gtilab seems to misdetect binary file with text file.
So HTML formatting engine seems to meet an error.("Encoding::CompatibilityError")
It's ok for me that Gitlab sometimes misdetects binary file with text file, but problem is that Gitlab server stops the transaction by Internal Error when such a misdetects occurs.
Could anyone tell me how to continue server transaction even if such a misjudge occurs?
For example, I assume the following answer.
e.g.1) Force to recognize a file to be a binary.
e.g.2) Bypass a HTML transforming when such a error occurs.
What I tried to resolve.
I added the description '*.XXX binary' to .gitattribute to confirm whether I can let a certain file recognize that it was binary file for Gitlab forcibly.
The Git client recognized the file to be binary file, and the diff did not output a text. However, there was no effect in Gitlab even if I did push it.
versions info
I faced this issue at first on Gitlab 8.6.2, but same issue occurs on 8.8.3.
I use git-2.7.2
Thank you.

Nutch : Crawl Broken Links & Index it in Solr

My purpose is to find how many URLs in an HTML page are invalid (404, 500, HostNotFound). So in Nutch is there a config change that we can do through which the web crawler crawls through broken links and indexes it in solr.
Once all the broken links & valid links are indexed in Solr I can just check the URLs which are invalid and can remove it from my HTML page.
Any help will be highly appreciated.
Thanks in advance.
You don't need to index to solr to find out broken links.
Do the following:
bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
It will give you the links that are 404 as:
Status: 3 (db_gone)
Metadata: _pst_: notfound(14)
go through the output file and you'll find all broken links.
Example:
Put in the url file "http://www.wikipedia.com/somethingUnreal http://en.wikipedia.org/wiki/NocontentPage"
Run the crawl command:bin/nutch crawl urls.txt -depth 1
Run the readdb command:bin/nutch readdb crawl-20140214115539/crawldb/ -dump mydump
Open the output file "part-xxxxx" with a text editor
Results:
http://en.wikipedia.org/wiki/NocontentPage Version: 7
Status: 1 (db_unfetched)
...
Metadata: _pst_: exception(16), lastModified=0: Http code=503, url=http://en.wikipedia.org/wiki/NocontentPage
http://www.wikipedia.com/somethingUnreal Version: 7
Status: 5 (db_redir_perm)
...
Metadata: Content-Type: text/html_pst_: moved(12), lastModified=0: http://www.wikipedia.org/somethingUnreal
This command will give you a dump of just the broken links:
bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump -status db_gone
Remember to exclude URLs with the following tag in the dump since it is generated from respecting robots.txt:
Metadata:
_pst_=robots_denied(18)

Nutch showing following errors, what to do

enter code here
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
but when i run nutch from terminal it show
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
etc etc.....
please tell me what to do
Hey Tejasp i did what u told me, i changed the NUTCH_HOME=/nutch/runtime/local/bin also the crawl.java file is there but when i did this
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
[Fatal Error] nutch-site.xml:6:6: The processing instruction target matching "[xX] [mM][lL]" is not allowed.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1168)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:405)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:585)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:290)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1079)
... 10 more
it showed me this result now what...?
also i checked nutch-site.xml file i have done the following edits in it
<configuration>
<property>
<name>http.agent.name</name>
<value>PARAM_TEST</value><!-- Your crawler name here -->
</property>
</configuration>
Sir, i did as you told me, this time i compiled nutch with 'ant clean runtime' and nutch home is
NUTCH_HOME=/nutch/runtime/deploy/bin
NUTCH_CONF_DIR=/nutch/runtime/local/conf
and now when i run the same command it is giving me this error
npun#nipun:~$ nutch crawl urls -dir crawl -depth 2 -topN 10
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
All i want to create a search engine which can search certain thing from certain websites, for my final year project....
It seems that in Nutch version 2.x the name of the Crawl class has changed to Crawler.
I'm using Hadoop to run Nutch, so I use the following command for crawling:
hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawler urls -solr http://<ip>:8983 -depth 2
If you crawl using Nutch on its own, the nutch script should reference the new class name.
but when i run nutch from terminal it show
This verifies that the NUTCH_HOME/bin/nutch script is present at the correct location.
Please export NUTCH_HOME and NUTCH_CONF_DIR
Which mode of nutch are you trying to use ?
local mode : jobs run without hadoop. you need to have nutch jar inside NUTCH_HOME/lib. Its named after the version that you are using . eg. for nutch release 1.3, the jar name is nutch-1.3.jar.
hadoop mode : jobs run on hadoop cluster. you need to have nutch job file inside NUTCH_HOME. its named after the release version eg. nutch-1.3.job
If you happen to have these files (corresponding to the mode), then extract those and see if the Crawl.class file is indeed present inside it.
If Crawl.class file is not present, then obtain the new jar/job file by compiling the nutch source.
EDIT:
Dont use ant jar. Use ant clean runtime instead. The output gets generated inside NUTCH_INSTALLATION_DIR/runtime/local directory. Run nutch from there. That will be your NUTCH_HOME
Export the required variables JAVA_HOME, NUTCH_HOME and NUTCH_CONF_DIR before running.
I am getting a feeling that the Crawl.class file is not present in the jar. Please extract the jar and check it out. FYI: Command to extract a jar file is jar -xvf <filename>
If after #2, you see that class file aint present in the jar, then see if the nutch source code that you downloaded has the java file. ie. nutch-1.x\src\java\org\apache\nutch\crawl\Crawl.java If not present, get it from internet and rebuild nutch jar.
If after #2, the jar file has class file and you see the issue again, then something is wrong with the environment. Try out some other command like inject. Look for some errors in the hadoop.log file. Let me know what you see.

Resources