Malformed URL: '', skipping (java.net.MalformedURLException - nutch

i crawl sites with nutch 1.3. i see this exception in my log when nutch crawl my sites:
Malformed URL: '', skipping (java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(URL.java:567)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
)
how can i solve this? help me.

According to the docs.
"MalformedURLException is thrown to indicate that a malformed URL has occurred. Either no legal protocol could be found in a specification string or the string could not be parsed."
The thing to be noted here is that this exception is not thrown when the server is down or when the path points to a missing file. It occurs only when URL cannot be parsed.
The error indicates that there is no protocol. and also the crawler does not see any URL,
Malformed URL: '' , skipping (java.net.MalformedURLException: no protocol:
Here is interesting article that I came across, have a look http://www.symphonious.net/2007/03/29/javaneturl-or-javaneturi/
What is the exact URL you are trying to parse?

After having set all setting with regex-urlfilter.txt and seed.txt try this command:
./nutch plugin protocol-file org.apache.nutch.protocol.file.File file:\\\e:\\test.html
(if the file is located at e:\test.htm in my example.
Before this, I always ran this
./nutch plugin protocol-file org.apache.nutch.protocol.file.File \\\e:\test.html
and got this error, because the protocol file: was missing:
java.netMalformedURLException : no protocol : \\e:\test.html

Malformed URL: ''
means that the URL was empty instead of being something like http://www.google.com.

Related

Scrapy how to update image URL if current image url returns 404

I need to change the image link in case if current image URLS return 404 code
I have implemented own pipeline by extending FilesPipeline.
I have supposed the method media_failed will be called it we got 404 code, but it didn't happen.
in the method item_completed I see that results for failed URL contains the following info
<class 'tuple'>: (False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: download-error>)
in this case I have to update origin image link and retry downloading
I see the following info in logs:
[scrapy.pipelines.files] WARNING: File (code: 404): Error downloading file from <GET https://any_dummy_link.jpg> referred in <None>
Request the image URL. If response.status is 404 you can handle it differently.

How do I get a specific version of Cassandra in Embedded-Cassandra

In Embedded-Cassandra (https://github.com/nosan/embedded-cassandra/wiki), the default version seem to be 3.11.4. I want to use 3.11.3. I tried setting the version but got error
val factory = new LocalCassandraFactory
println(s"factory is ${factory}")
factory.setVersion(("3.11.1"))
...
Error
WARN c.g.n.e.c.l.a.RemoteArtifact - HTTP (404 Not Found) status for URL 'http://www.mirrorservice.org/sites/ftp.apache.org/cassandra/3.11.3/apache-cassandra-3.11.3-bin.tar.gz'
Indeed, the version doesn't exist at
http://www.mirrorservice.org/sites/ftp.apache.org/cassandra/
How can I use a specific version of Cassandra in EmbeddedCassandra
Did you get an exception or only a warning message ?
RemoteArtifact tries to download an archive from several URLs.
https://apache.org/dyn/closer.cgi?action=download&filename=cassandra/${version}/apache-cassandra-${version}-bin.tar.gz
https://archive.apache.org/dist/cassandra/${version}/apache-cassandra-${version}-bin.tar.gz
The second link works fine for me.
https://archive.apache.org/dist/cassandra/3.11.3/apache-cassandra-3.11.3-bin.tar.gz

%AddJar for hellospark_2.10-1.0.jar giving Name: java.util.zip.ZipException Message

I am trying to run AddJar in my new notebook in ibm bluemix.
%AddJar https://github.com/ibm-cds-labs/spark.samples/blob/master/dist/helloSpark-assembly-2.1.jar -f
However, I keep receiving this error -
Starting download from https://github.com/ibm-cds-labs/spark.samples/blob/master/dist/helloSpark-assembly-2.1.jar
Finished download of helloSpark-assembly-2.1.jar
Out[8]:
Name: java.util.zip.ZipException
Message: error in opening zip file
StackTrace: java.util.zip.ZipFile.open(Native Method)
java.util.zip.ZipFile.<init>(ZipFile.java:235)
java.util.zip.ZipFile.<init>(ZipFile.java:165)
java.util.zip.ZipFile.<init>(ZipFile.java:179)
I tried all sort of URLs - raw, file etc. as specified in this other link, but no help.
%AddJar for hellospark_2.10-1.0.jar giving Name: java.util.zip.ZipException Message: error in opening zip file
Please advice.
Thanks
Raj
Your URL points to an HTML page with a download button. You must use a URL that points to the actual JAR file instead. I got it by right-clicking on the download button and selecting "Copy Link Address". The URL has /raw/ instead of /blob/ in the path:
%AddJar https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/helloSpark-assembly-2.1.jar -f
That line "worked" for me, in the sense that I got a totally different error messages on the first try: Assertion failed. After restarting the kernel and re-executing the %AddJar, the error was gone. Maybe my service didn't have the download directory yet when I executed the line for the first time.

Gitlab misdetects binary file with text file and raises Internal Error (500 Whoops)

What is Issue?
When I push a link of the commit which invlolves a binary file from Commits view of a project on Gitlab, I recieve an Internal error ,"500 Whoops, something went wrong on our end."
This issue also appears when creating Merge Request whose origin is the same commit above.
Production.log says,
Started GET "/TempTest/bsp/commit/3098a49f2fd1c77be0c383994aa6655f5d15ebf8" for 127.0.0.1 at 2016-05-30 16:17:15 +0900
Processing by Projects::CommitController#show as HTML
Parameters:{"namespace_id"=>"TempTest", "project_id"=>"bsp", "id"=>"3098a49f2fd1c77be0c383994aa6655f5d15ebf8"}
Encoding::CompatibilityError (incompatible character encodings: UTF-8 and ASCII-8BIT):
app/views/projects/diffs/_file.html.haml:54:in `_app_views_projects_diffs__file_html_haml__1070266479743635718_49404820'
app/views/projects/diffs/_diffs.html.haml:22:in `block in _app_views_projects_diffs__diffs_html_haml__2984561770205002953_48487320'
app/views/projects/diffs/_diffs.html.haml:17:in `each_with_index'
app/views/projects/diffs/_diffs.html.haml:17:in `_app_views_projects_diffs__diffs_html_haml__2984561770205002953_48487320'
app/views/projects/commit/show.html.haml:12:in `_app_views_projects_commit_show_html_haml__3333221152053087461_45612480'
app/controllers/projects/commit_controller.rb:30:in `show'
lib/gitlab/middleware/go.rb:16:in `call'
Completed 500 Internal Server Error in 210ms (Views: 8.7ms | ActiveRecord: 10.5ms)
Gtilab seems to misdetect binary file with text file.
So HTML formatting engine seems to meet an error.("Encoding::CompatibilityError")
It's ok for me that Gitlab sometimes misdetects binary file with text file, but problem is that Gitlab server stops the transaction by Internal Error when such a misdetects occurs.
Could anyone tell me how to continue server transaction even if such a misjudge occurs?
For example, I assume the following answer.
e.g.1) Force to recognize a file to be a binary.
e.g.2) Bypass a HTML transforming when such a error occurs.
What I tried to resolve.
I added the description '*.XXX binary' to .gitattribute to confirm whether I can let a certain file recognize that it was binary file for Gitlab forcibly.
The Git client recognized the file to be binary file, and the diff did not output a text. However, there was no effect in Gitlab even if I did push it.
versions info
I faced this issue at first on Gitlab 8.6.2, but same issue occurs on 8.8.3.
I use git-2.7.2
Thank you.

An activeweb-bootstrap can not find controller

I dowloaded the ActiveWeb Bootstrap project from gtihub and have troubles with it.
First, it was imposssible to import it into Eclipse so I executed mvn eclipse:eclipse and then imported the project into eclipse and converted it to maven.
Then I started jetty server from Eclipse and I got an error with path http://localhost:8080/activeweb-bootstrap/
URI Full Path: /activeweb-bootstrap/
URI Path: /activeweb-bootstrap/
Method: GET
org.javalite.activeweb.ClassLoadException: java.lang.ClassNotFoundException: app.controllers.ActivewebBootstrapController
at org.javalite.activeweb.DynamicClassFactory.getCompiledClass(DynamicClassFactory.java:62)
at org.javalite.activeweb.DynamicClassFactory.createInstance(DynamicClassFactory.java:23)
at org.javalite.activeweb.ControllerFactory.createControllerInstance(ControllerFactory.java:27)
at org.javalite.activeweb.Router.recognize(Router.java:80)
at org.javal
I was able to run this project properly only when export it as war file. Why I get this error if start the project with Jetty from Eclipse?
updated:
62406 [qtp31348584-11] WARN org.javalite.activeweb.RequestDispatcher - ActiveWeb 404 WARNING:
Request URL: http://localhost:8080/activeweb-bootstrap/
ContextPath:
Query String: null
URI Full Path: /activeweb-bootstrap/
URI Path: /activeweb-bootstrap/
Method: GET
org.javalite.activeweb.ClassLoadException: java.lang.ClassNotFoundException: app.controllers.ActivewebBootstrapController
62406 [qtp31348584-11] WARN org.javalite.activeweb.ParamCopy - found 'session' value set by controller. It is reserved by ActiveWeb and will be overwritten.
62438 [qtp31348584-11] INFO org.javalite.activeweb.freemarker.FreeMarkerTemplateManager - Rendered template: '/system/404' with layout: '/layouts/default_layout'
Apparently this example had an old version of a dependency. Please, clone and try again. It is fixed already.
ANSWER UPDATE:
The link on the README.md file was incorrect. The project is mapped to root, so instead of accessing http://localhost:8080/activeweb-bootstrap/, you need to access:
http://localhost:8080/.
The README.md file was updated accordingly: https://github.com/javalite/activeweb-bootstrap
The message you are getting:
org.javalite.activeweb.ClassLoadException: java.lang.ClassNotFoundException: app.controllers.ActivewebBootstrapController
is perfectly valid, since the framework is trying to interpret your URI: "activeweb-bootstrap" as a route to controller app.controllers.ActivewebBootstrapController.
Such controller does not exist, so you get a 404.

Resources