How to use wget on a page with authentication - linux

I've been searching the internet about wget, and found many posts on how to use wget to log into a site that has a login page.
The site uses https, and the form fields the login page looks for is "userid" and "password". I've verified this by checking the Network tool in Chrome when you hit F12.
I 've been using the following posts as guidelines:
http://www.unix.com/shell-programming-and-scripting/131020-using-wget-curl-http-post-authentication.html
And
wget with authentication
What I've tried:
testlab:/lua_curl_tests# wget --save-cookies cookies.txt --post-data 'userid=myid&password=123123' https://10.123.11.22/cgi-bin/acd/myapp/controller/method1
wget: unrecognized option `--save-cookies'
BusyBox v1.21.1 (2013-07-05 16:54:31 UTC) multi-call binary.
And also
testlab/lua_curl_tests# wget
http://userid=myid:123123#10.123.11.22/cgi-bin/acd/myapp/controller/method1
Connecting to 10.123.11.22 (10.123.11.22:80) wget: server returned
error: HTTP/1.1 403 Forbidden
Can you tell me what I'm doing wrong? Ultimately, what I'd like to do is login, and post data, then grab the resulting page.
I'm also currently looking at curl - to see if I really should be doing this with curl (lua-curl)

Related

Wget error: HTTP request sent, awaiting response... 401 Unauthorized Authorization failed

So i have to download a whole webpage for worst case scenario if our network collapses. But with Wget I only get Error 401 unauthorized. I suspect kerberos.
I've tried curl, at first it only showed index code then I added --output and it only downloads index page. But only index beacause after that the site is password protected.
wget --header="Authorization: Basic XXXXXXXXXXXXXXXXXXX" --recursive --wait=5 --level=2 --execute="robots = off" --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains my.domain --no-parent https://webpage.internal
and curl
curl --anyauth --negotiate -u admin https://webpage.internal --output index.html
Is there any way to use curl for whole website, or is there a simple fix for my wget.
Thanks.
Okay solved it myself. Just needed to change:
-header="Authorization: Basic XXXXXXXXXXXXXXXXXXX"
to
-header="Authorization: OAuth XXXXXXXXXXXXXXXXXXX"
and it started to clone.
Edit: did not solve.

Ubuntu 18, proxy not working on terminal but work on browser

(related and perhaps more simple problem to solve: proxy authentication by MSCHAPv2)
Summary: I am using a Ubuntu 18, the proxy is working with web-browser but not with terminal applications (wget, curl or apt update). Any clues? Seems the problem is to interpretate a proxy's "PAC file"... Is it? How to translate to Linux's proxy variables? ... Or the problem is simple: my proxy-config (see step-by-step procedure below) was wrong?
Details:
By terminal env | grep -i proxy we obtain
https_proxy=http://user:pass#pac._ProxyDomain_/proxy.pac:8080
http_proxy=http://user:pass#pac._ProxyDomain_/proxy.pac:8080
no_proxy=localhost,127.0.0.0/8,::1
NO_PROXY=localhost,127.0.0.0/8,::1
ftp_proxy=http://user:pass#pac._ProxyDomain_/proxy.pac:8080
and browser (Firefox) is working fine for any URL, but:
wget http://google.com say Resolving pac._ProxyDomain_ (pac._ProxyDomain_)... etc.etc.0.26 connecting to pac._ProxyDomain_ (pac._ProxyDomain_)|etc.etc.0.26|:80... conected.
Proxy request has been sent, waiting for response ... 403 Forbidden
2019-07-25 12:52:19 ERROR 403: Forbidden.
curl http://google.com say "curl: (5) Could not resolve proxy: pac._ProxyDomain_/proxy.pac"
Notes
(recent news here: purge exported proxy changes something and not tested all again...)
The proxy configuration procedures that I used (there are some plug-and-play PAC file generator? I need a PAC file?)
Config procedures used
All machine was running, with a direct non-proxy internet connection... Them the machine goes to the LAN with the proxy.
Add lines of "export *_proxy" (http, https and ftp) in my ~/.profile. The URL definitions are in the form http_proxy="http://user:pwd#etc" (supposing that is correct, because testesd before with user:pwd#http://pac.domain/proxy.pac syntax and Firefox promped proxy-login)(if the current proxy-password is using # character, need to change?)
Add lines of "export *_proxy" in my ~root/.profile.(need it?)
(can reboot and test with echo $http_proxy)
visudo procedure described here
reboot and navigate by Firefox without need of login, direct (good is working!). Testing env | grep -i proxy, it shows all correct values as expected.
Testing wget and curl as the begin of this report, proxy bug.
Testing sudo apt update, bug.
... after it more one step, supponing that for apt not exist a file, created by sudo nano /etc/apt/apt.conf.d/80proxy and add 3 lines for Acquire::*::proxy "value"; with value http://user:pass#pac._ProxyDomain_/proxy.pac:8080. where pass is etc%23etc, url-encoded.
Summary of tests performed
CONTEXT-1.1
(this was a problem but now ignoring it to focus on more relevant one)
After (the proxied) cable connection and proxy configurations in the system. (see above section "Config procedures used"). Proxy-password with special character.
curl http://google.com say "curl: (5) Could not resolve proxy..."
When change all .profile from %23 to # the error on wget changes, but curl not. Wget changes to "Error parsing proxy URL http://user:pass#pac._ProxyDomain_/proxy.pac:8080: Bad port number"
PS: when used $ on password the system (something in the internal export http_proxy command or use of http_proxy confused it with a variable).
CONTEXT-1.2
Same as context-1.1 above, but password with no special character. Good and clean proxy-password.
curl http://google.com say "curl: (5) Could not resolve proxy..."
CONTEXT-2
After (the proxied) cable connection and no proxy configurations in the system (but confirmed that connection is working on browser after automatic popup form login).
curl -x 192.168.0.1:8080 http://google.com "curl: (7) Failed to connect..."
curl --verbose -x "http://user:pass#pac._proxyDomain_/proxy.pac" http://google.com say "curl: (5) Could not resolve proxy..."
Other configs in use
As #Roadowl suggested to check:
files ~/.netrc and ~root/.netrc not exists
file more /etc/wgetrc exists, but all commented, exept by passive_ftp = on

Wget and quoted URL

Currently I am struggling with mirroring a website using Wget.
Browsing the web I came out with the following command to mirror a complete website:
wget --mirror --convert-links --adjust-extension --backup-converted --page-requisites -e robots=off http://www.example.com
As expected, after running the command there is a folder called www.example.com containing all downloaded files. However, some background images are missing. Digging through the files and logs I found that wget seems to have a problem with quoted image URLs.
The website uses the following CSS to include a background image:
<div ... style="background-image: url("/path/to/image") ;..." ... />
Collecting the pages requisites wget parses the URL and tries to download the file,
http://www.example.com/"/path/to/image"
which obviously fails with an error 404:
--2018-01-08 18:04:00-- https://www.example.com/"/path/to/image"
Reusing existing connection to www.example.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-01-08 18:04:00 ERROR 404: Not Found
Unfortunately I cannot post the original domain for privacy reasons...
I already tried to find a solution on the web, but I did not manage to find the right keywords to search for, so as a last choice I must ask you for help.
Is there a way to tell Wget to ignore quotes inside URLs?

How to download Bing Static Map in Linux

I'm trying to download the static map using Bing Maps API. It works when I load the URL from Chrome, but when I tried to curl to wget from Linux, I got Auth Failed error.
The URL are identical but for some reason Bing is blocking calls from Linux?
Here's the commands I tried:
wget -O map.png http://dev.virtualearth.net/REST/V1/Imagery/Map/Road/...
curl -O map.png http://dev.virtualearth.net/REST/V1/Imagery/Map/Road/...
Error:
Resolving dev.virtualearth.net (dev.virtualearth.net)... 131.253.14.8
Connecting to dev.virtualearth.net (dev.virtualearth.net)|131.253.14.8|:80... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Username/Password Authentication Failed.
--2016-10-24 15:42:30-- http://dev.virtualearth.net/REST/V1/Imagery/Map/Road/.../12?mapSize=340,500
Reusing existing connection to dev.virtualearth.net:80.
HTTP request sent, awaiting response... 401 Unauthorized
Username/Password Authentication Failed.
I'm not sure if it has anything to do with Key Type, I've tried several from Public Website to Dev/Test but still didn't work.
The url needs to be wrapped (because of & symbol in query string that needs to be escaped) with quotes:
wget 'http://dev.virtualearth.net/REST/V1/Imagery/Map/Road/...'
Examples
Via wget:
wget -O map.jpg 'http://dev.virtualearth.net/REST/V1/Imagery/Map/Road/Bellevue%20Washington?mapLayer=TrafficFlow&key=<key>'
Via curl:
curl -o map.jpg 'http://dev.virtualearth.net/REST/V1/Imagery/Map/Road/Bellevue%20Washington?mapLayer=TrafficFlow&key=<key>'
Have been verified under Ubuntu 16.04

Download file/folder from sharepoint using Curl/Wget automatically

I have been trying to use Curl and wget to download file from Sharepoint. I am planning to make it as Script which runs automatically everyday and download the file from URL.
I tried using CURL with following command
curl -O --user Myusername:Mypassword https://OurDomain.sharepoint.com/_XXX&file=IPS_cleaned.xlsx&action=default
But it gave me error about SSL connection. I got to know that there is some existing bug in CURL 7.35 So i downgraded it to 7.22. But still gives me same error.
I also tried using Wget
wget --user=Myusername --password=MyPassword --no-check-certificate https://OurDomain.sharepoint.com/_XXX&file=IPS_cleaned.xlsx&action=default
But it still gives me error -- Unable to establish SSL connection
Can someone please let me know how i can accomplish my task
UPDATE
I was able to resolve the error in CURL. Below is the command that i gave
curl -O -L --sslv3 -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13" --user Myusername:Mypassword 'https://OurDomain.sharepoint.com/_%7BB21r-9CA2-345DEF%7D&file=IPS_cleaned.xlsx&action=default'
Now what it downloads is a file, which when i open it shows me Login page of Sharepoint. It does not download the actual excel file.
Any reason?
Another potential solution to this involves taking your sharepoint link and replacing the text after the '?' with download=1:
This:
https://my.sharepoint.com/:u:/g/XXX/XXXX-bunchofRandomText?e=kRlVi
Becomes this:
https://my.sharepoint.com/:u:/g/XXX/XXXX-bunchofRandomText?download=1
Now, you can just:
wget https://my.sharepoint.com/:u:/g/XXX/XXXX-bunchofRandomText?download=1
*Note, this example used a single file and a link where anyone with the link could access the file (no credentials required)
Please use rclone
Download and install the latest one from https://rclone.org/downloads
First option: Use OneDrive to access SharePoint sites/personal folder. This option will help you to upload large files.
1.create rclone configurations using the rclone config command
2.Select New remote and give a name
3.Select cloud storage OneDrive
4.Leave client ID and secret as blank
5.Edit advanced config: n
6.Remote config: Use auto-config: y
7.Open the URL on the browser and give access to rclone
8.Select personal/shared site URL option
8a.Shared site URL option you have to give the site URL. ie; https://sharepoint.com/sites/SiteName
9.Select personal/Documents drive. Documents drive will show if you selected the shared site URL option in the 8th step
Save config and quit
And the configuration file contents will be like the following. If you selected the Personal option drive type will be personal.
[onedrive]
type = onedrive
token =
drive_id =
drive_type = documentLibrary
Second option: In this option, you can upload up to 2 GB-sized files.
1.create rclone configurations using rclone config command
2.Select New remote and give a name
3.Select cloud storage WebDAV
4.Give site URL, username and password
5.Save and quit
And the configuration file contents will be like the following. Password will be in an encrypted format.
vim /root/.config/rclone/rclone.conf
[sharepoint]
type = webdav
url = https://sharepoint.com/sites/SiteName/Documents
vendor = sharepoint
user =
pass =
Download a file from SharePoint.
rclone copy --ignore-times --ignore-size --verbose sharepoint:SourceFolder/file.txt DestFolder
Firefox plugin that captures the link with session ID etc.. and it provides a command you could paste in the console for curl or wget.
If anyone has a better suggestion please let me know.
It gives you a curl or wget command with headers, cookies and all, with a copy to clipboard button, right on the download dialogue.
Download URL: https://addons.mozilla.org/en-US/firefox/addon/cliget
Reference: https://superuser.com/questions/27243/how-to-find-out-the-real-download-url-on-download-sites-that-use-redirects/1239026#1239026
Struggled with the same issue myself, and had my not-so-automatic-but-man-so-convenient way, with a daily log-in.
logged into Sharepoint with a browser,
exported the cookie,
run the following command.
wget --cookies=on --load-cookies cookies.txt --keep-session-cookies --no-check-certificate -m https://yoursharepoint.com
And files were downloaded just fine.
For anyone using CURL to download a file on Sharepoint with an "Anyone with the link" download option. Below are the steps I had to follow to download. Essentially you have to use the cookie from the share link, and then download the file from a different download link they don't provide easily for you.
When sending the CURL command for the “share link” it returns a 302 message, a forward link, and a cookie. If we save that cookie and use it to hit a “download” link I am able to download the file. Essentially, Microsoft uses the initial “share link” to send the cookie to the browser, and then redirect to their “View File” website. On that website you need to use the cookie provided (authentication), and select your next function (On screen view, print, download, etc). When you click the download button you hit a different link. I was able to find this link by going to the "view page" website for the file/link, turning on developer tools, and watching the link the browser follows when hitting download. You can then replicate that link for each file. If we use that download link along with the cookie, we can download the file.
curl -i -c cookies.txt SHARE LINK
curl -o docsdownloaded.pdf -b cookies.txt DOWNLOAD LINK
Share Link Ex: https://tenant.sharepoint.com/:b:/s/Folder/EdNUf4xAVzFJgBoO0MqkfppR5tgobxLrmCnRqU4LFJQ?e=rOGNSD
Download Link Ex:https://tenant.sharepoint.com/sites/Folder/_layouts/15/download.aspx?SourceUrl=%2Fsites%2FFolder%2FShared%20Documents%2FGeneral%2FBig%2Dfile%2Epdf
Similar to the answer Zyglute gave, using cURL:
You can export your login cookie using the cookies.txt Chrome extension: https://chrome.google.com/webstore/detail/njabckikapfpffapmjgojcnbfjonfjfg
Then use the following code:
curl -b cookie.txt https://OurDomain.sharepoint.com/_XXX&file=IPS_cleaned.xlsx&action=default
At some point your Sharepoint session will expire (not sure how long that takes), and you will need a new cookie file.
EDIT: If a malicious user gets a hold of your cookie.txt, they could get into your SharePoint account, so be sure to keep it safe.
Use wget adding &download=1 at the end of the link.
wget "<yourlink>&download=1"
it will be download with <yourlink> string as name, then just mv with the correct name after.

Resources