Using wget but ignore url parameters - linux

I want to download the contents of a website where the URLs are built as
http://www.example.com/level1/level2?option1=1&option2=2
Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.

You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.
wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/
This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.

wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.

It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:
rename -v -n 's/[?].*//' *[?]*
The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.

Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:
http://www.example.com/main-topic/whatever-content-in-this-page
All other URLs had references to the CMS. I got all I neede with
wget -r http://www.example.com -A "*-*"
This did the trick. Thanks for thought sharing!

#kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:
wget --reject "*\?*" -m -c --content-disposition http://example.com/

Related

How to shorten URLs of static website copy to have less than 256 characters? (created with wget)

I have to create a full website copy / mirror of https://www.landesmuseum-mecklenburg.de. This copy has to run locally on a Windows 7 system, which has no network connection. Windows 7 has a path length limitation (255), which is reached in my case. How do I circumvent this?
I create a static website copy on a Debian system with: wget --mirror --adjust-extension --convert-links --restrict-file-names=windows --page-requisites -e robots=off --no-clobber --no-parent --base=./ https://www.landesmuseum-mecklenburg.de/.
This way I got nearly everything I need. Some special URLs / images are downloaded via a seperated URL-list via: wget -c -x -i imagelist.txt
After I create an archive out of this files and transfer them to my Windows 7 Test system and extract them to a local Webserver (called "MiniWebserver") I can visit http://localhost/ and everything seems to work.
But some deep links, especially images have a path length in the windows NTFS filesystem of over 255 characters. All these images are not displayed in the local webpage.
I tried the -nH --cut-dirs=5 option of wget, but with no acceptable result (the index.html gets overwritten each time).
Another idea was to use the DOS path shorten compatbility feature, so that long directory names would be translated to 8 character names. E.g. longdirname translated to longdi~1. But I have no idea how to automate this.
Update: Yet another idea
One more thing, which came to my mind was to use hashes over the entire path (e.g. md5) and use this instead of the full path + filename. Additionally all URLs in the downloaded .html files must be substituted processed too. But again: I have no idea how to accomplish this using Debian command line tools.
You can try the following:
wget --trust-server-names <url>
--trust-server-names
If this is set to on, on a redirect the last component of the
redirection URL will be used as the local file name. By default it is
used the last component in the original URL.

What is the command to get the .listing file from SFTP server using cURL command

In wget I am trying to get the list of files and its properties from FTP server using below Wget command,
wget --no-remove-listing ftp://myftpserver/ftpdirectory/
This will generate two files: .listing (this is what I am looking in cURL) and index.html which is the html version of the listing file.
My expectation:
In cURL how to achieve this scenario?
What is the command to get the .listing and index.html file from FTP/SFTP server using CURL.
This is what I found on http://curl.haxx.se/docs/faq.html#How_do_I_get_an_FTP_directory_li
If you end the FTP URL you request with a slash, libcurl will provide you with a directory listing of that given directory. You can also set CURLOPT_CUSTOMREQUEST to alter what exact listing command libcurl would use to list the files.
The follow-up question that tend to follow the previous one, is how a program is supposed to parse the directory listing. How does it know what's a file and what's a dir and what's a symlink etc. The harsh reality is that FTP provides no such fine and easy-to-parse output. The output format FTP servers respond to LIST commands are entirely at the server's own liking and the NLST output doesn't reveal any types and in many cases don't even include all the directory entries. Also, both LIST and NLST tend to hide unix-style hidden files (those that start with a dot) by default so you need to do "LIST -a" or similar to see them.
Thanks & Regards,
Alok
I have checked it on WIN->CygWin and it works for me:
Do not forget to use / at the end of the path.
$ curl -s -l -u test1#test.com:Test_PASSWD --ftp-ssl 'ftps://ftp.test.com/Ent/'

wget some website .gif files

I tried hard with different options but still i cant wget
http://auno.org/ao/nanos.php?prof=nano-technician
I would like to get all the littles .gif images.
My current request is:
wget -A.gif http://auno.org/ao/nanos.php?prof=nano-technician/
I tried many options like -O or -U firefox etc.
Thanks for reading.
This way you're only downloading the requested page, nothing else. If you want to download more then that, you must turn on recursive downloading (-r).
Also, as the images are on different hosts, you may want to enable host spanning (-H), and when you do that you should also specify a restrictive recursion limit to avoid downloading half the internet (-l):
wget -A .gif -r -l 1 -H http://auno.org/ao/nanos.php?prof=nano-technician/

How can wget save only certains file types linked to from pages linked to by the target page?

How can wget save only certain file types linked to from pages linked to by the target page, regardless of the domain in which the certain files are?
Trying to speed up a task I have to do often.
I've been rooting through the wget docs and googling, but nothing seems to work. I keep on either getting just the target page or the subpages without the files (even using -H), so I'm obviously doing badly at this.
So, essentially, example.com/index1/ contains links to example.com/subpage1/ and example.com/subpage2/, while the subpages contain links to example2.com/file.ext and example2.com/file2.ext, etc. However, example.com/index1.html may link to example.com/index2/ which has links to more subpages I don't want.
Can wget even do this, and if not then what do you suggest I use? Thanks.
Following command worked for me.
wget -r --accept "*.ext" --level 2 "example.com/index1/"
Need to do recursively so -r should be added.
Something like this should Work:
wget --accept "*.ext" --level 2 "example.com/index1/"

Can wget be used to get all the files on a server?

Can wget be used to get all the files on a server.Suppose if this is the directory structure using Django framework on my site foo.com
And if this is the directory structure
/web/project1
/web/project2
/web/project3
/web/project4
/web/templates
Without knowing the name of directories of /project1,project2.....Is it possible to download all the files
You could use
wget -r -np http://www.foo.com/pool/main/z/
-r (fetch files/folders recursively)
-np (do not descent to parent directory when retrieving recursively)
or
wget -nH --cut-dirs=2 -r -np http://www.foo.com/pool/main/z/
--cut-dirs (it makes Wget not "see" number remote directory components)
-nH (invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.)
First of all, wget can only be used to retrieve files served by the web server. It's not clear in the question you're posting whether you mean actual files or web pages. I would guess from the way you phrased your question that your intent is to download the server files, not the web pages served by Django. If this is correct, then no wget won't work. You need to use something like rsync or scp.
If you do mean using wget to retrieve all of the generated pages from Django, then this will only work if links point to those directories. So, you need a page that has code like:
<ul>
<li>Project1</li>
<li>Project2</li>
<li>Project3</li>
<li>Project4</li>
<li>Templates</li>
</ul>
wget is not a psychic; it can only pull in pages it knows about.
try recursive retrieval - the -r option.

Resources