wget: obtaining files matching regex - linux

According to the man page of wget, --acccept-regex is the argument to use when I need to selectively transfer files whose names matching a certain regular expression. However, I am not sure how to use --accept-regex.
Assuming I want to obtain files diffs-000107.tar.gz, diffs-000114.tar.gz, diffs-000121.tar.gz, diffs-000128.tar.gz in IMDB data directory ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/. "diffs\-0001[0-9]{2}\.tar\.gz" seems to be an ok regex to describe the file names.
However, when executing the following wget command
wget -r --accept-regex='diffs\-0001[0-9]{2}\.tar\.gz' ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/
wget indiscriminately acquires all files in the ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/ directory.
I wonder if anyone could tell what I have possibly done wrong?

Be careful --accept-regex is for the complete URL. But our target is some specific files. So we will use -A.
For example,
wget -r -np -nH -A "IMG[012][0-9].jpg" http://x.com/y/z/
will download all the files from IMG00.jpg to IMG29.jpg from the URL.
Note that a matching pattern contains shell-like wildcards, e.g. ‘books’ or ‘zelazny196[0-9]*’.
reference:
wget manual: https://www.gnu.org/software/wget/manual/wget.html
regex: https://regexone.com/

I'm reading in wget man page:
--accept-regex urlregex
--reject-regex urlregex
Specify a regular expression to accept or reject the complete URL.
and noticing that it mentions the complete URL (e.g. something like ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/diffs-000121.tar.gz)
So I suggest (without having tried it) to use
--accept-regex='.*diffs\-0001[0-9][0-9]\.tar\.gz'
(and perhaps give the appropriate --regex-type too)
BTW, for such tasks, I would also consider using some scripting language à la Python (or use libcurl or curl)

Related

Curl execution in python failed

I wanted to download files for around 300 item. An example is below:
curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Absrep1' -b cookies > Absrep1.xml
This opens the page and downloads the content and stores it as xml file in my end
I tried to do a batch script in perl with system command, like
system('curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Absrep1'
-b cookies > Absrep1.xml');
But, it did not work. There was syntax error, which I guess is due to single quotes.
I tried with python,
import subprocess
bash_com = 'curl "http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Absrep1" '
subprocess.Popen(bash_com)
output = subprocess.check_output(['bash','-c', bash_com])
It did not work. I get the error, File does not exist. Even if it works, how can I include the
-b cookies > Absrep1.xml'
part in it?
Please help. Thanks in Advance,
AP
In Perl, you should be able to use this:
system(q{curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Absrep1' -b cookies > Absrep1.xml});
However, you might be better off using LWP or possibly even HTTP::Tiny(unless you need the cookies) instead of shelling out. For more advanced uses, there is also WWW::Mechanize.
The syntax error is almost certainly down to the quotes in the system call:
system('curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Absrep1' -b cookies > Absrep1.xml');
The single quotes either need to be escaped or alternative parentheses can be used such as double quotes or custom parentheses with q or qq, eg:
system(q{curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=Absrep1' -b cookies > Absrep1.xml});
It's hard to tell from the context given, but wrapping the curl call in perl or python would likely be a less than optimal approach. Perl has LWP, Python has requests, and the bash shell is already well equipped to run simple batch jobs. It might be best to stick to a single interpreter unless there's a good reason not to.

How to keep directory structure with aria2?

I need to download files simultaneously- wget doesn't support that so I want to try aria2. But I don't see an option in aria2 to keep directory structure.
Determine the directory structure first,
then build and use a download description file:
aria2c -i uri.txt
where uri.txt might contain
http://serverA/file1.iso http://mirror-serverB/file1.iso
# parameters must begin with a space, otherwise it's treatened as url!
dir=/downloads/a
# not mandatory
out=file1.iso
http://serverA/file2.iso http://mirror-serverB/file2.iso
dir=/downloads/b
out=file2.iso
Keep in mind that aria2 is a download util - not an sync util, like rsync or lftp.
Referencing an rsync answer: https://stackoverflow.com/a/4147263/1163786
and an lftp answer: https://superuser.com/a/305236.

Bash String Interpolation wget

I'm trying to download files using wget and for some reason I think I'm encountering some problems with string interpolation. I have a list of files to be downloaded that I have successfully parsed etc (no small feat for me) and would like to incorporate these into a for loop wget statement combo which downloads these files en masse.
Please forgive me the URLs won't work for you because the password and data have been changed.
I have tried single and double quotes as well as escaping a few characters in the URL (the &s and #s which I presume are at the marrow of this).
list of files
files.txt
/results/KIDFROST#LARAZA.com--B627-.txt.gz
/results/KIDFROST#LARAZA.com--B626-part002.csv.gz
/results/KIDFROST#LARAZA.com--B62-part001.csv.gz
wget statement works as a single command
wget -O files/$i "https://tickhistory.com/HttpPull/Download? user=KIDFROST#LARAZA.com&pass=RICOSUAVE&file=/results/KIDFROST#LARAZA.com--N6265-.txt.gz"
but a loop doesn't work
for i in `cat files.txt`; do
wget -O files/$i "https://tickhistory.com/HttpPull/Download? user=KIDFROST#LARAZA.com&pass=RICOSUAVE&file=$i"
done
Edit: Main issues was "files/$i" instead of "files$i" (see comments)
Don't use a for-loop/cat for reading files, use a while/read loop. Less problems with buffer-overflow/splitting/etc...
while read -r i; do
wget -O "files$i" "https://tickhistory.com/HttpPull/Download? user=KIDFROST#LARAZA.com&pass=RICOSUAVE&file=$i"
done < files.txt
Also make sure you quote variables to prevent expansion/splitting errors. Also not really sure why you're doing files/$i, since $i has full path.

Handling command line options with multiple arguments for some flags

I'm writing a program where the command line usage should be something like:
mkblueprint FILE FILE FILE -o <output name> -s <string> -r <number> -p pOPT1 pOPT2 pOPT3
I'm currently using CmdLib and I can't figure out a way to handle this; a flag is required for each input(so I can't just have FILEs sitting alone) and there doesn't appear to be a way to pass multiple arguments to a flag, as with -p. These are extremely common in command line programs so I figure I'm just misunderstanding the documentation, but it's not mentioned in any command line library I look at for Haskell.
After some more work with CmdLib I was able to handle the bare FILE input via the Extra tag and then checking that each string is a valid file, which seems to be the standard way to handle it despite the name. -p pOPT1 pOPT2 pOPT3 is apparently not allowed under the POSIX standard, which is why I'm not finding libraries that will do it.
You might consider the GetOpt bindings that come with base. They're not as sexy as some of the more modern alternatives, but they support bare arguments and final options well.

Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep)

In Linux shell, I am trying to return links to JPG files from the downloaded HTML script file. So far I only got to this point:
grep 'http://[:print:]*.jpg' 'www_page.html'
I don't want to use auxiliary commands like 'tr', 'cut', 'sed' etc...'lynx' is okay!
Using grep alone without massaging the file is doable but not recommended as many have pointed out in the comments.
If you can loosen up your requirements a bit then you can use html tidy to massage the downloaded HTML file so that each html entities are on a single line so that the regular expression can be simpler like you wanted, something like this:
$ tidy file.html|grep -o 'http://[[:print:]]*.jpg'
Note the use of "-o" option to grep to print only the matching part of the input

Resources