Nutch: input url gets modified by nutch parsechecker - nutch

I am using v 1.0 Nutch parsechecker command to parse the following URL
http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267
But on running parsechecker i get the below result
"bin/nutch parsechecker -dumpText http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267"
[1] 8956
$ fetching: http://www.doctorslounge.com/forums/viewtopic.php?f=7
Fetch failed with protocol status: notfound(14), lastModified=0:http://www.doctorslounge.com/forums/viewtopic.php?f=7
Somehow nutch is automatically modifying my input url
http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267
to
http://www.doctorslounge.com/forums/viewtopic.php?f=7
Can anyone help me circumvent this problem. Thanks
P.S - it fetches other urls of the same domain
input- http://www.doctorslounge.com/index.php/articles/page/51032 works perfectly

This appears to be an internal issue with the particular site. The same thing happens when trying to run wget http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267.
Try this:
bin/nutch parsechecker -dumpText "http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267"
That is, you need to quote (or escape) the &.
The other problem you'll have with parsing this page with nutch is that it's prohibited by the site's robots.txt file:
User-agent: *
...
Disallow: /forums/viewtopic.php

Related

Wget and quoted URL

Currently I am struggling with mirroring a website using Wget.
Browsing the web I came out with the following command to mirror a complete website:
wget --mirror --convert-links --adjust-extension --backup-converted --page-requisites -e robots=off http://www.example.com
As expected, after running the command there is a folder called www.example.com containing all downloaded files. However, some background images are missing. Digging through the files and logs I found that wget seems to have a problem with quoted image URLs.
The website uses the following CSS to include a background image:
<div ... style="background-image: url("/path/to/image") ;..." ... />
Collecting the pages requisites wget parses the URL and tries to download the file,
http://www.example.com/"/path/to/image"
which obviously fails with an error 404:
--2018-01-08 18:04:00-- https://www.example.com/"/path/to/image"
Reusing existing connection to www.example.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-01-08 18:04:00 ERROR 404: Not Found
Unfortunately I cannot post the original domain for privacy reasons...
I already tried to find a solution on the web, but I did not manage to find the right keywords to search for, so as a last choice I must ask you for help.
Is there a way to tell Wget to ignore quotes inside URLs?

Linking in localhost(LAMP)

I am newly(about 1 month) started using LAMP and Bootstrap.
I developed a web-site that worked perfectly until I reinstalled LAMP.
Here my progress:
0. reinstalled LAMP
1. moved my "backup-ed" file to my "localhost" direction
2. I run "chmod 777 *" to each dir and file
3. When I write "localhost" to my browser(firefox) the "index.html" is running
4. When I click the link(say: index)
The browser responds:
http://localhost/undefined
Not Found
The requested URL /undefined was not found on this server.
Apache/2.4.7 (Ubuntu) Server at localhost Port 80
Is there any way to fix this, by the way it's working(linking) perfectly when I write file:///var/www/html/index.html.
The reason why I want to use LAMP is add .php files to handle form.
Thanks
What happens when you do hit http://localhost ?
What exactly do you see?? Tried http://localhost/html
What is exactly your document root as per apache conf?
You might need to check that you are placing your files in the root directoy. It should be in the "htdocs" foler.
/opt/lampp/htdocs/
If all else fails, you can try using xampp which is another free alternative to lamp.
I get this a lot, when your browser looks for a file that is not in htaccess you get a forbidden or unfound error. The way to fix this is to make sure the link you click goes to an accessible URL. Try finding if other links in the page or scripts are overwriting your link.
Finally check if you can access it from another browser, or try to demonstrate the security of your machine. From a public library you can request an Ubuntu CD from canonical, and while you're waiting, you can visually inspect your machine for tampering.

Terminal - How to run the HTTP request 'PUT'

So, what I am trying to do is run from Terminal in Linux an HTTP request, 'PUT'. Not POST, not GET, 'PUT'.
I know in terminal you can just type 'GET http://example.com/', but when I did 'PUT http://example.com' (And a bunch of other variables after that...), Terminal said that PUT is not a command.
Here's what I tried:
:~$ PUT http://example.com
PUT: command not found
Well, is there a substitute for the command 'PUT', or some way of sending that HTTP request from terminal?
I don't want to use any external programs.... I don't want to download or install anything. Any other ways?
I would use curl to achieve this: curl -X PUT http://example.com
curl -X PUT -d arg=val -d arg2=val2 http://sssss.zzzz
will work or use postman for HTTP requests www.getpostman.com if terminal is not your main concern, else, CURL is always there.
You are getting
Terminal said that PUT is not a command.
because the information is not being redirected via a network connection (to something that understands HTTP). bash has limited support by itself for communicating over a network, as discussed in
Tech Tip: TCP/IP Access Using bash
More on Using Bash's Built-in /dev/tcp File (TCP/IP)
Advanced Bash-Scripting Guide: Example 29-1. Using /dev/tcp for troubleshooting
Besides that, the HTTP specification says of PUT:
The PUT method requests that the enclosed entity be stored under the supplied Request-URI. If the Request-URI refers to an already existing resource, the enclosed entity SHOULD be considered as a modified version of the one residing on the origin server. If the Request-URI does not point to an existing resource, and that URI is capable of being defined as a new resource by the requesting user agent, the origin server can create the resource with that URI.
Clarifying, if you are PUTing to an existing URI, you may be able to do this, and the command implictly needs some data to reflect a modification.
The example in HTTP - Methods (TutorialsPoint) shows a PUT command used to store an HTML body on a URI. Your script has to redirect the data (as well as the initial request) onto the network connection.
You could do all of that using a here-document, or redirecting a file, e.g., (using that example to show how it might be adapted):
cat >/dev/tcp/example.com/80 <<EOF
PUT /hello.htm HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.tutorialspoint.com
Accept-Language: en-us
Connection: Keep-Alive
Content-type: text/html
Content-Length: 182
<html>
<body>
<h1>Hello, World!</h1>
</body>
</html>
EOF
But your script should also provide for reading the server's response.
Using the -X flag with whatever HTTP verb you want:
curl -X PUT -H "Content-Type: multipart/form-data;" -d arg=val -d arg2=val2 localhost:8080
This example also uses the -d flag to provide arguments with your PUT request.

How to use wget on a page with authentication

I've been searching the internet about wget, and found many posts on how to use wget to log into a site that has a login page.
The site uses https, and the form fields the login page looks for is "userid" and "password". I've verified this by checking the Network tool in Chrome when you hit F12.
I 've been using the following posts as guidelines:
http://www.unix.com/shell-programming-and-scripting/131020-using-wget-curl-http-post-authentication.html
And
wget with authentication
What I've tried:
testlab:/lua_curl_tests# wget --save-cookies cookies.txt --post-data 'userid=myid&password=123123' https://10.123.11.22/cgi-bin/acd/myapp/controller/method1
wget: unrecognized option `--save-cookies'
BusyBox v1.21.1 (2013-07-05 16:54:31 UTC) multi-call binary.
And also
testlab/lua_curl_tests# wget
http://userid=myid:123123#10.123.11.22/cgi-bin/acd/myapp/controller/method1
Connecting to 10.123.11.22 (10.123.11.22:80) wget: server returned
error: HTTP/1.1 403 Forbidden
Can you tell me what I'm doing wrong? Ultimately, what I'd like to do is login, and post data, then grab the resulting page.
I'm also currently looking at curl - to see if I really should be doing this with curl (lua-curl)

How can I test my browser ignoring Location headers?

I want to test a site with my Firefox ignoring Location: headers like this example in PHP.
header('Location: another-page.php');
Is there a plugin available to do this, or any other method?
Would my best bet be surfing the site with Lynx? Does Lynx ignore them?
Thanks
You could try bringing up the pages with cURL.
It is a command line application that is invoked via:
curl http://url
cURL does not follow Location: headers by default.

Resources