Fetch failed with protocol status: exception(16), lastModified=0: Http code=406, url=https://www.randolphnj.org/ - nutch

I am trying to crawl url: https://www.randolphnj.org/
But it is showing this error
2020-09-22 15:03:08,395 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2020-09-22 15:03:08,395 INFO httpclient.Http: http.enable.cookie.header = true
2020-09-22 15:03:08,399 INFO conf.Configuration: found resource httpclient-auth.xml at file:/tmp/hadoop-unjar7802696204891280694/httpclient-auth.xml
Fetch failed with protocol status: exception(16), lastModified=0: Http code=406, url=https://www.randolphnj.org/
may I know what is the reason.kindly help me to solve.

Most likely the server is blocking requests when the HTTP request header "User-agent" includes the string "Nutch". I was able to reproduce the behavior using wget:
$> wget --header='User-Agent: mycrawler/Nutch-1.17' https://www.randolphnj.org/
--2020-09-25 10:55:42-- https://www.randolphnj.org/
Resolving www.randolphnj.org (www.randolphnj.org)... 63.247.128.112
Connecting to www.randolphnj.org (www.randolphnj.org)|63.247.128.112|:443... connected.
HTTP request sent, awaiting response... 406 Not Acceptable
2020-09-25 10:55:43 ERROR 406: Not Acceptable.
$> wget https://www.randolphnj.org/
--2020-09-25 11:02:25-- https://www.randolphnj.org/
Resolving www.randolphnj.org (www.randolphnj.org)... 63.247.128.112
Connecting to www.randolphnj.org (www.randolphnj.org)|63.247.128.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

Related

why the image that i am downloading from drive is not getting detected?

!wget https://drive.google.com/file/d/1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z/view?usp=sharing
!mv view?usp=sharing HorizonZero.png
# Downloading lena.bmp
!wget https://drive.google.com/file/d/1sOxv4LHEuGNnZJEg7gl2ZiehdKU4ojjF/view?usp=sharing
!mv view?usp=sharing lena.bmp
``````````````````````````````````````````````````````````````````
**the image link for horizon.png is actually**
https://drive.google.com/file/d/1gGltsV4k7z3akTqB726FWzi5tUnNSCqf/view?usp=sharing
but !mv i added from view
`````````````````````````````````````````````````````````````````````````
**the output should be like this**
--2020-08-01 19:30:57-- https://drive.google.com/uc?id=1Djfm4PqE7Su4WqEdZKiGL-8HtrbVBuMm
Resolving drive.google.com (drive.google.com)... 74.125.203.139, 74.125.203.113, 74.125.203.100, ...
Connecting to drive.google.com (drive.google.com)|74.125.203.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-40-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/91pcoqqtp058dfpuoqspvbr6s66n7q9t/1596310200000/05356688754188258246/*/1Djfm4PqE7Su4WqEdZKiGL-8HtrbVBuMm [following]
Warning: wildcards not supported in HTTP.
--2020-08-01 19:30:57-- https://doc-08-40-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/91pcoqqtp058dfpuoqspvbr6s66n7q9t/1596310200000/05356688754188258246/*/1Djfm4PqE7Su4WqEdZKiGL-8HtrbVBuMm
Resolving doc-08-40-docs.googleusercontent.com (doc-08-40-docs.googleusercontent.com)... 74.125.203.132, 2404:6800:4008:c03::84
Connecting to doc-08-40-docs.googleusercontent.com (doc-08-40-docs.googleusercontent.com)|74.125.203.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 111636 (109K) [image/png]
Saving to: ‘uc?id=1Djfm4PqE7Su4WqEdZKiGL-8HtrbVBuMm’
uc?id=1Djfm4PqE7Su4 100%[===================>] 109.02K --.-KB/s in 0.002s
2020-08-01 19:30:58 (57.7 MB/s) - ‘uc?id=1Djfm4PqE7Su4WqEdZKiGL-8HtrbVBuMm’ saved [111636/111636]
--2020-08-01 19:31:07-- https://drive.google.com/uc?id=19xZhsjs_r0tLwtu_Wl5DB5rG26dhw069
Resolving drive.google.com (drive.google.com)... 74.125.204.139, 74.125.204.113, 74.125.204.100, ...
Connecting to drive.google.com (drive.google.com)|74.125.204.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-10-40-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/frtrtg5v8nd6vf0ff6cs1dh3jvqv28ui/1596310200000/05356688754188258246/*/19xZhsjs_r0tLwtu_Wl5DB5rG26dhw069 [following]
Warning: wildcards not supported in HTTP.
--2020-08-01 19:31:07-- https://doc-10-40-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/frtrtg5v8nd6vf0ff6cs1dh3jvqv28ui/1596310200000/05356688754188258246/*/19xZhsjs_r0tLwtu_Wl5DB5rG26dhw069
Resolving doc-10-40-docs.googleusercontent.com (doc-10-40-docs.googleusercontent.com)... 74.125.203.132, 2404:6800:4008:c03::84
Connecting to doc-10-40-docs.googleusercontent.com (doc-10-40-docs.googleusercontent.com)|74.125.203.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 263222 (257K) [image/bmp] **it is recognized........................**
Saving to: ‘uc?id=19xZhsjs_r0tLwtu_Wl5DB5rG26dhw069’
uc?id=19xZhsjs_r0tL 100%[===================>] 257.05K --.-KB/s in 0.003s
2020-08-01 19:31:08 (79.6 MB/s) - ‘uc?id=19xZhsjs_r0tLwtu_Wl5DB5rG26dhw069’ saved [263222/263222]
````````````````````````````````````````````````````````````
**but my output I coming like this plz help**
--2021-06-01 07:17:27-- https://drive.google.com/file/d/1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z/view?usp=sharing
Resolving drive.google.com (drive.google.com)... 74.125.31.138, 74.125.31.102, 74.125.31.113, ...
Connecting to drive.google.com (drive.google.com)|74.125.31.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/ServiceLogin?service=wise&passive=1209600&continue=https://drive.google.com/file/d/1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z/view?usp%3Dsharing&followup=https://drive.google.com/file/d/1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z/view?usp%3Dsharing [following]
--2021-06-01 07:17:27-- https://accounts.google.com/ServiceLogin?service=wise&passive=1209600&continue=https://drive.google.com/file/d/1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z/view?usp%3Dsharing&followup=https://drive.google.com/file/d/1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z/view?usp%3Dsharing
Resolving accounts.google.com (accounts.google.com)... 108.177.12.84, 2607:f8b0:400c:c0b::54
Connecting to accounts.google.com (accounts.google.com)|108.177.12.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html] not detected...............?????
Saving to: ‘view?usp=sharing’
view?usp=sharing [ <=> ] 65.35K --.-KB/s in 0.001s
2021-06-01 07:17:27 (104 MB/s) - ‘view?usp=sharing’ saved [66920]
--2021-06-01 07:17:27-- https://drive.google.com/file/d/1sOxv4LHEuGNnZJEg7gl2ZiehdKU4ojjF/view?usp=sharing
Resolving drive.google.com (drive.google.com)... 173.194.217.101, 173.194.217.113, 173.194.217.139, ...
Connecting to drive.google.com (drive.google.com)|173.194.217.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/ServiceLogin?service=wise&passive=1209600&continue=https://drive.google.com/file/d/1sOxv4LHEuGNnZJEg7gl2ZiehdKU4ojjF/view?usp%3Dsharing&followup=https://drive.google.com/file/d/1sOxv4LHEuGNnZJEg7gl2ZiehdKU4ojjF/view?usp%3Dsharing [following]
--2021-06-01 07:17:27-- https://accounts.google.com/ServiceLogin?service=wise&passive=1209600&continue=https://drive.google.com/file/d/1sOxv4LHEuGNnZJEg7gl2ZiehdKU4ojjF/view?usp%3Dsharing&followup=https://drive.google.com/file/d/1sOxv4LHEuGNnZJEg7gl2ZiehdKU4ojjF/view?usp%3Dsharing
Resolving accounts.google.com (accounts.google.com)... 74.125.31.84, 2607:f8b0:400c:c03::54
Connecting to accounts.google.com (accounts.google.com)|74.125.31.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html] ### not detected...............?????
Saving to: ‘view?usp=sharing’
view?usp=sharing [ <=> ] 65.40K --.-KB/s in 0.002s
2021-06-01 07:17:27 (26.4 MB/s) - ‘view?usp=sharing’ saved [66972]
**the problem here I the image is not getting recognized in google colobatory it is coming [text/html] but it should come [png /jpg/bmp]please help to get over this problm**
The following command should address your need:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1BeeBFUY6BS-H66NMcx3hF1rwE3Yi3p3z" -O HorizonZero.png && rm -rf /tmp/cookies.txt
If you are wondering what it does, you see it's configuring the download with Google Drive file id (which I have taken from your question itself), setting HorizonZero.png as an output file, using cookies.txt file as specified by --keep-session-cookie to hold cookie information during the download (it's useful if your file is big), enabled auto-confirmation for the download, also skipping the certificate check by --no-check-certificate, and removed cookies.txt at the end after the download is over.

wget: server returned error: HTTP/1.1 202 Accepted

while doing wget from BusyBox v1.23.1 getting an error :
wget: server returned error: HTTP/1.1 202 Accepted
wget call :
wget http://182.72.194.130:7777/device_mgr/device-mgmt/app/cnc/sno/SCNC12J001/updates?cur_fw_ver=1.1(0)7&cur_config_ver=1.0
But when I tried , within ubuntu it worked. How can it be resolved?
HTTP Code 202
The HyperText Transfer Protocol (HTTP) 202 Accepted response status
code indicates that the request has been received but not yet acted
upon.
can mean "got your request okay but the resource is not yet ready"
e.g. a tape archive needs to be mounted. Best to try again a while later. When you repeated your request on Ubuntu the resource was probably mounted.
Wget has some retry parameters you can play with to delay a follow request: see here https://superuser.com/questions/493640/how-to-retry-connections-with-wget/689340#answer-689340

Remove request from nginx error logs

How can we remove request url being logged in nginx error logs. For example it looks something like:
2015/09/01 15:26:03 [error] 30547#0: *208725 upstream prematurely closed connection while reading response header from upstream, client: 123.123.50.44, server: test.example.com, request: "GET /v1.3/status.json?...."
is it possible to drop request since it can have PII from the log(if present) so it looks something like:
2015/09/01 15:26:03 [error] 30547#0: *208725 upstream prematurely closed connection while reading response header from upstream, client: 123.123.50.44, server: test.example.com
I was able to configure access logs but couldn't find a way to customize error logs.
Edit:
Is there a way to stop logging only upstream errors?

wget not downloading from authentic https url vaultpress

I'm using VaultPress to take my WordPress blog's backup
https://dashboard.vaultpress.com/
After clicking the download backup button, this site sends me a link from where I can download. When I click on this link, it starts downloading my backup in the browser, and that's perfect. But I'm trying to download this in my Ubuntu system using wget or curl but no success till now. Here is what the download URL looks like:
https://dashboard.vaultpress.com/12345/restore/?step=4&job=12345678&check=.
eric#eric:~# wget https://dashboard.vaultpress.com/12345/restore/?step=4&job=12345678&check=<somehashedvalue>
[5] 2229
[6] 2230
[6] Done job=12345678
eric#eric:~# --2015-02-08 02:25:07-- https://dashboard.vaultpress.com/12345/restore/?step=4
Resolving dashboard.vaultpress.com (dashboard.vaultpress.com)... 192.0.96.249, 192.0.96.250
Connecting to dashboard.vaultpress.com (dashboard.vaultpress.com)|192.0.96.249|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: / [following]
--2015-02-08 02:25:09-- https://dashboard.vaultpress.com/
Reusing existing connection to dashboard.vaultpress.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /account/login/ [following]
--2015-02-08 02:25:09-- https://dashboard.vaultpress.com/account/login/
Reusing existing connection to dashboard.vaultpress.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html?step=4’
[ <=> ] 7,709 --.-K/s in 0s
2015-02-08 02:25:09 (20.9 MB/s) - ‘index.html?step=4’ saved [7709]
PS: The file size is almost 1 GB.
Then I used user/pass:
eric#eric:~# wget --user <myusername> --password <mypassword> https://aboveurl
I even used --ask-password:
eric#eric:~# wget --user <myusername> --ask-password https://aboveurl
But in this case, instead of asking password it completes the action and then asks for the password in another shell (I don't know the exact term), something like this:
eric#eric:~# wget --user <myusername> --ask-password https://dashboard.vaultpress.com/12345/restore/?step=4&job=12345678&check=<hashedvalue>
[1] 1979
[2] 1980
eric#eric:~# Password for user ‘<myusername>’: <mypassword-here>
<mypassword>: command not found
And then finally, I gave a try to curl:
eric#eric:~# curl -u <myusername>:<mypassword> https://dashboard.vaultpress.com/12345/restore/?step=4&job=12345678&check=<hashedvalue>
[5] 2010
[6] 2011
eric#eric:~#
I don't know what's happening? What are those [5] 2010 [6] 2011 or [5] 2229
This solution is also not working:
wget with authentication
The ampersands in your URL make Linux create new processes running in the background. The PID is printed out behind the number in the square brackets.
Write the URL within double quotes and try again:
wget "https://dashboard.vaultpress.com/12345/restore/?step=4&job=12345678&check=<somehashedvalue>"

why can't get this page in linux with wget/telnet?

this URL www.jinfuwu.com can be access in windows browser,windows telnet,
but in my ubuntu server, i can't get this page:
telnet (ubuntu):
root#ubuntu:~# telnet www.jinfuwu.com 80
Trying 121.199.111.176...
Connected to www.jinfuwu.com.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.jinfuwu.com
HTTP/1.1 200 OK
Content-Type: text/html
Last-Modified: Sun, 05 Dec 2010 01:34:33 GMT
Accept-Ranges: bytes
ETag: "f671fd911c94cb1:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
X-UA-Compatible: IE=EmulateIE7
Date: Sun, 05 Dec 2010 10:03:21 GMT
Content-Length: 1214Connection closed by foreign host.
wget (ubuntu):
root#ubuntu:~# wget http://www.jinfuwu.com
--18:10:29-- http://www.jinfuwu.com/
=> `index.html.2'
Resolving www.jinfuwu.com... 121.199.111.176
Connecting to www.jinfuwu.com|121.199.111.176|:80... connected.
HTTP request sent, awaiting response...
Read error (Connection reset by peer) in headers.
Retrying.
....
but in my windows ,i using telnet command, i can get the page
telnet (windows7):
run:
telnet www.jinfuwu.com 80
paste:
GET / HTTP/1.1
Host: www.jinfuwu.com
and press doubles Enter,i can see the page HTML code.
google it:
site:jinfuwu.com
google can access this site
can you tell me why?
btw: also www.joytg.com,same question
thanks a lot :)
Did some further digging for you and found the root cause is due to misconfigured routers. You can read about it all here.
The workaround that article mentions is to:
echo 0 > /proc/sys/net/ipv4/tcp_default_win_scale
However, this file has changed and on newer setups you need to instead:
echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
You will need to be root when running that though.
$ wget http://www.jinfuwu.com
--2010-12-05 12:58:39-- http://www.jinfuwu.com/
Resolving www.jinfuwu.com... 121.199.111.176
Connecting to www.jinfuwu.com|121.199.111.176|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12145 (12K) [text/html]
Saving to: `index.html'
100%[====================================================>] 12,145 5.19K/s in 2.3s
2010-12-05 12:58:43 (5.19 KB/s) - `index.html' saved [12145/12145]
FWIW, I can get the page just fine using wget or curl from MacPorts.

Resources