HTTP requests in wget taking most of the time - linux

I'm retrieving large amount of data via wget, with the following command:
wget --save-cookies ~/.urs_cookies --load-cookies ~/.urs_cookies --keep-session-cookies --content-disposition -i links.dat
My problem is that links.dat contains thousands of links. The files are relatively small (100kb). So it takes 0.2s to download the file, and 5s to await for HTTP request response. So it ends up taking 14h to download my whole data, most of the time spent waiting for the requests.
URL transformed to HTTPS due to an HSTS policy
--2017-02-15 18:01:37-- https://goldsmr4.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FMERRA2%2FM2I1NXASM.5.12.4%2F1980%2F01%2FMERRA2_100.inst1_2d_asm_Nx.19800102.nc4&FORMAT=bmM0Lw&BBOX=43%2C1.5%2C45%2C3.5&LABEL=MERRA2_100.inst1_2d_asm_Nx.19800102.SUB.nc4&FLAGS=&SHORTNAME=M2I1NXASM&SERVICE=SUBSET_MERRA2&LAYERS=&VERSION=1.02&VARIABLES=t10m%2Ct2m%2Cu50m%2Cv50m
Connecting to goldsmr4.gesdisc.eosdis.nasa.gov (goldsmr4.gesdisc.eosdis.nasa.gov)|198.118.197.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50223 (49K) [application/octet-stream]
Saving to: ‘MERRA2_100.inst1_2d_asm_Nx.19800102.SUB.nc4.1’
This might be a really noob question, but it seems really counter productive that this is working this way. I have really little knowledge on what is happening behind the scenes, but I just wanted to be sure I'm not doing anything wrong and that the process can indeed be faster.
If details helps, I'm downloading MERRA-2 data for specific nodes.
Thanks!

Wget will re-use an existing connection for multiple requests to the same server, potentially saving you the time required to establish and tear down the socket.
You can do this by providing multiple URLs on the command line. For example, to download 100 per batch:
#!/usr/bin/env bash
wget_opts=(
--save-cookies ~/.urs_cookies
--load-cookies ~/.urs_cookies
--keep-session-cookies
--content-disposition
)
manyurls=()
while read url; do
manyurls+=( "$url" )
if [ ${#manyurls[#]} -eq 100 ]; then
wget "${wget_opts[#]}" "${manyurls[#]}"
manyurls=()
fi
done < links.dat
if [ ${#manyurls[#]} -gt 0 ]; then
wget "${wget_opts[#]}" "${manyurls[#]}"
fi
Note that I haven't tested this. It may work. If it doesn't, tell me your error and I'll try to debug.
So ... that's "connection re-use" or "keepalive". The other thing that would speed up your download is HTTP Pipelining, which basically allows a second request to be sent before the first response has been received. wget does not support this, and curl supports it in its library, but not the command-line tool.
I don't have a ready-made tool to suggest that supports HTTP pipelining. (Besides which, tool recommendations are off-topic.) You can see how pipelining works in this SO answer. If you feel like writing something in a language of your choice that supports libcurl, I'm sure any difficulties you come across make for another an interesting additional StackOverflow question.

Related

Get files from a page that is redirected using wget

I am trying to download the contents of http://someurl.com/123 using wget. The problem is that http://someurl.com/123 is redirected to http://someurl.com/456.
I already tried using --max-redirect 0 and this is my result:
wget --max-redirect 0 http://someurl/123
...
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://someurl.com/456 [following]
0 redirections exceeded.
Is there a way to get the actual contents of http://someurl.com/123?
By default, wget can follow up to 20 redirections. But since you set --max-redirect to 0, it returned 0 redirections exceeded.
Try this command
wget --content-disposition http://someurl.com/123
--max-redirect=number
Specifies the maximum number of redirections to follow for a resource. The default is 20, which is usually far more than necessary. However, on those occasions where you want to allow more (or fewer), this is the option to use.
--content-disposition
If this is set to on, experimental (not fully-functional) support for Content-Disposition headers is enabled. This can currently result in extra round-trips to the server for a HEAD request, and is known to suffer from a few bugs, which is why it is not currently enabled by default.

youtube api v3 search through bash and curl

I'm having a problem with the YouTube API. I am trying to make a bash application that will make watching YouTube videos easy on command line in Linux. I'm trying to take some video search results through cURL, but it returns an error: curl: (16) HTTP/2 stream 1 was not closed cleanly: error_code = 1
the cURL command that I use is:
curl "https://ww.googleapis.com/youtube/v3/search" -d part="snippet" -d q="kde" -d key="~~~~~~~~~~~~~~~~"
And of course I add my YouTube data API key where the ~~~~~~~~ are.
What am I doing wrong?
How can I make it work and return the search attributes?
I can see two things that are incorrect in your request:
First, you mistyped "www" and said "ww". That is not a valid URL
Then, curl's "-d" options are for POSTing only, not GETting ,at least not by default. You have two options:
Add the -G switch to url, which lets curl re-interpret -d options as query options:
curl -G https://www.googleapis.com/youtube/v3/search -d part="snippet" -d q="kde" -d key="xxxx"
Rework your url to a typical GET request:
curl "https://www.googleapis.com/youtube/v3/search?part=snippet&q=kde&key=XX"
As a tip, using bash to interpret the resulting json might not be the best way to go. You might want to look into using python, javascript, etc. to run your query and interpret the resulting json.

wget result from site with cookies

Trying to enter a number here in the website and get the result with WGET.
[http://]
I've tried wget --cookies=on --save-cookies=site.txt URL to save the sessionID/cookie,
then went on with 'wget --cookies=on --keep-session-cookies --load-cookies=site.txt URL', but with no luck.
Also tried monitoring the POST data sent with WireShark and tried replicating it with wget --post-data --referer etc, but also without luck.
Anyone who has an easy way of doing this? I'm all ears! :)
All help is much appreciated.
Thank you.
The trick is to send the second request to http://202.91.22.186/dms-scw-dtac_th/page/?wicket:interface=:0:2:::0:
With my Xidel it seems to work like this (do not have a valid number to test)
xidel "http://202.91.22.186/dms-scw-dtac_th/page/" -f '{"post": "number=0812345678", "url": resolve-uri("?wicket:interface=:0:2:::0:")}' -e "#id5"

Wget in bash script with 406 Not Acceptable Error

On my Centos 6.2 I have this bash script:
[le_me]$ cat get.nb
#! /bin/bash
/usr/bin/wget -O /var/www/html/leFile.xml http://www.leSite.com/leFeed.xml
[le_me]$ source getFeeds.nb
: command not found
--2012-06-22 12:46:18-- http://www.leSite.com/leFeed.xml%0D
Resolving www.leSite.com... 1.2.3.4
Connecting to www.leSite.com|1.2.3.4|:80... connected.
HTTP request sent, awaiting response... 406 Not Acceptable
2012-06-22 12:46:18 ERROR 406: Not Acceptable.
The strange thing for me is that when I run this command
/usr/bin/wget -O /var/www/html/leFile.xml http://www.leSite.com/leFeed.xml
in the console, everything works fine and the file is downloaded without a problem.
I did google about it and I noticed this %0D which supposed to be a carrige return character, and I tried putting another space after the link like so: http://www.leSite.com/leFeed.xml[spaceChar]
and I got the file downloaded but I'm concerned about the command not found output and fetching that carrige return in the end (which ofc I know it's because of the space, but now at least I downloaded the file I originally wanted):
[le_me]$ source get.nb
: command not found
--2012-06-22 13:05:26-- http://www.leSite.com/leFeed.xml
Resolving www.leSite.com... 2.17.249.51
Connecting to www.leSite.com|2.17.249.51|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35671435 (34M) [application/atom+xml]
Saving to: “/var/www/html/leFile.xml”
100%[=================================>] 35,671,435 37.2M/s in 0.9s
2012-06-22 13:05:27 (37.2 MB/s) - “/var/www/html/leFile.xml” saved [35671435/35671435]
--2012-06-22 13:05:27-- http://%0D/
Resolving \r... failed: Name or service not known.
wget: unable to resolve host address “\r”
FINISHED --2012-06-22 13:05:27--
Downloaded: 1 files, 34M in 0.9s (37.2 MB/s)
Can anyone shed some light on this please?
Your script file apparently has DOS-style lines, and the carriage return character is interpreted as just another character in the command line. If you have no space after the URL, it is interpreted as the last character of the URL; if you have a space, it is interpreted as a separate one-character parameter.
You should save your script file with UNIX-style lines. How you do that depends on your editor.
I'd suggest to quote the URL.
/usr/bin/wget -O /var/www/html/leFile.xml 'http://www.leSite.com/leFeed.xml'
The
: command not found
error suggests there is a problem in http:// part. As a rule of thumb I always quote those urls when using in command line. You often have bash/shell special characters in there.
In my case this command works for me without 406 problem (with some real http address). You should copy/paste the exact address. It probably contains something that causes it.
If none of the other answers work, try an alternative for wget
curl -o /var/www/html/leFile.xml 'http://www.leSite.com/leFeed.xml'

wget: don't follow redirects

How do I prevent wget from following redirects?
--max-redirect 0
I haven't tried this, it will either allow none or allow infinite..
Use curl without -L instead of wget. Omitting that option when using curl prevents the redirect from being followed.
If you use curl -I <URL> then you'll get the headers instead of the redirect HTML.
If you use curl -IL <URL> then you'll get the headers for the URL, plus those for the URL you're redirected to.
Some versions of wget have a --max-redirect option: See here
wget follows up to 20 redirects by default. However, it does not span hosts. If you have asked wget to download example.com, it will not touch any resources at www.example.com. wget will detect this as a request to span to another host and decide against it.
In short, you should probably be executing:
wget --mirror www.example.com
Rather than
wget --mirror example.com
Now let's say the owner of www.example.com has several subdomains at example.com and we are interested in all of them. How to proceed?
Try this:
wget --mirror --domains=example.com example.com
wget will now visit all subdomains of example.com, including m.example.com and www.example.com.
In general, it is not a good idea to depend on a specific number of redirects.
For example, in order to download IntellijIdea, the URL that is promised to always resolve to the latest version of Community Edition for Linux is something like https://download.jetbrains.com/product?code=IIC&latest&distribution=linux, but if you visit that URL nowadays, you are going to be redirected twice (2 times) before you reach the actual downloadable file. In the future you might be redirected three times, or not at all.
The way to solve this problem is with the use of the HTTP HEAD verb. Here is how I solved it in the case of IntellijIdea:
# This is the starting URL.
URL="https://download.jetbrains.com/product?code=IIC&latest&distribution=linux"
echo "URL: $URL"
# Issue HEAD requests until the actual target is found.
# The result contains the target location, among some irrelevant stuff.
LOC=$(wget --no-verbose --method=HEAD --output-file - $URL)
echo "LOC: $LOC"
# Extract the URL from the result, stripping the irrelevant stuff.
URL=$(cut "--delimiter= " --fields=4 <<< "$LOC")
echo "URL: $URL"
# Optional: download the actual file.
wget "$URL"

Resources