Wget and quoted URL - linux

Currently I am struggling with mirroring a website using Wget.
Browsing the web I came out with the following command to mirror a complete website:
wget --mirror --convert-links --adjust-extension --backup-converted --page-requisites -e robots=off http://www.example.com
As expected, after running the command there is a folder called www.example.com containing all downloaded files. However, some background images are missing. Digging through the files and logs I found that wget seems to have a problem with quoted image URLs.
The website uses the following CSS to include a background image:
<div ... style="background-image: url("/path/to/image") ;..." ... />
Collecting the pages requisites wget parses the URL and tries to download the file,
http://www.example.com/"/path/to/image"
which obviously fails with an error 404:
--2018-01-08 18:04:00-- https://www.example.com/"/path/to/image"
Reusing existing connection to www.example.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-01-08 18:04:00 ERROR 404: Not Found
Unfortunately I cannot post the original domain for privacy reasons...
I already tried to find a solution on the web, but I did not manage to find the right keywords to search for, so as a last choice I must ask you for help.
Is there a way to tell Wget to ignore quotes inside URLs?

Related

Unable to use wget for downloading git hub search results

I am trying to use wget to download github code search results into a logfile.
I've been using the following command :
wget -o logfile -r -l 2 https://github.com/search?l=Dockerfile&q=openjdk&type=Code&utf8=%E2%9C%93
I do however, get a robots.txt file that says the following :
# If you would like to crawl GitHub contact us at support#github.com.
# We also provide an extensive API: https://developer.github.com/
Do I need some sort of permission from github for this?
Can someone help?
I think the message is pretty clear: you're trying to crawl GitHub site and they don't like that.
They advise you to use the GraphQL API.
The v3 API is still REST, so you could do something like:
wget --output-document search-results.json --user <YOUR_GITHUB_ID> \
"https://api.github.com/search/code?q=openjdk+language:Dockerfile"

Nutch: input url gets modified by nutch parsechecker

I am using v 1.0 Nutch parsechecker command to parse the following URL
http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267
But on running parsechecker i get the below result
"bin/nutch parsechecker -dumpText http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267"
[1] 8956
$ fetching: http://www.doctorslounge.com/forums/viewtopic.php?f=7
Fetch failed with protocol status: notfound(14), lastModified=0:http://www.doctorslounge.com/forums/viewtopic.php?f=7
Somehow nutch is automatically modifying my input url
http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267
to
http://www.doctorslounge.com/forums/viewtopic.php?f=7
Can anyone help me circumvent this problem. Thanks
P.S - it fetches other urls of the same domain
input- http://www.doctorslounge.com/index.php/articles/page/51032 works perfectly
This appears to be an internal issue with the particular site. The same thing happens when trying to run wget http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267.
Try this:
bin/nutch parsechecker -dumpText "http://www.doctorslounge.com/forums/viewtopic.php?f=7&t=40267"
That is, you need to quote (or escape) the &.
The other problem you'll have with parsing this page with nutch is that it's prohibited by the site's robots.txt file:
User-agent: *
...
Disallow: /forums/viewtopic.php

Download file/folder from sharepoint using Curl/Wget automatically

I have been trying to use Curl and wget to download file from Sharepoint. I am planning to make it as Script which runs automatically everyday and download the file from URL.
I tried using CURL with following command
curl -O --user Myusername:Mypassword https://OurDomain.sharepoint.com/_XXX&file=IPS_cleaned.xlsx&action=default
But it gave me error about SSL connection. I got to know that there is some existing bug in CURL 7.35 So i downgraded it to 7.22. But still gives me same error.
I also tried using Wget
wget --user=Myusername --password=MyPassword --no-check-certificate https://OurDomain.sharepoint.com/_XXX&file=IPS_cleaned.xlsx&action=default
But it still gives me error -- Unable to establish SSL connection
Can someone please let me know how i can accomplish my task
UPDATE
I was able to resolve the error in CURL. Below is the command that i gave
curl -O -L --sslv3 -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13" --user Myusername:Mypassword 'https://OurDomain.sharepoint.com/_%7BB21r-9CA2-345DEF%7D&file=IPS_cleaned.xlsx&action=default'
Now what it downloads is a file, which when i open it shows me Login page of Sharepoint. It does not download the actual excel file.
Any reason?
Another potential solution to this involves taking your sharepoint link and replacing the text after the '?' with download=1:
This:
https://my.sharepoint.com/:u:/g/XXX/XXXX-bunchofRandomText?e=kRlVi
Becomes this:
https://my.sharepoint.com/:u:/g/XXX/XXXX-bunchofRandomText?download=1
Now, you can just:
wget https://my.sharepoint.com/:u:/g/XXX/XXXX-bunchofRandomText?download=1
*Note, this example used a single file and a link where anyone with the link could access the file (no credentials required)
Please use rclone
Download and install the latest one from https://rclone.org/downloads
First option: Use OneDrive to access SharePoint sites/personal folder. This option will help you to upload large files.
1.create rclone configurations using the rclone config command
2.Select New remote and give a name
3.Select cloud storage OneDrive
4.Leave client ID and secret as blank
5.Edit advanced config: n
6.Remote config: Use auto-config: y
7.Open the URL on the browser and give access to rclone
8.Select personal/shared site URL option
8a.Shared site URL option you have to give the site URL. ie; https://sharepoint.com/sites/SiteName
9.Select personal/Documents drive. Documents drive will show if you selected the shared site URL option in the 8th step
Save config and quit
And the configuration file contents will be like the following. If you selected the Personal option drive type will be personal.
[onedrive]
type = onedrive
token =
drive_id =
drive_type = documentLibrary
Second option: In this option, you can upload up to 2 GB-sized files.
1.create rclone configurations using rclone config command
2.Select New remote and give a name
3.Select cloud storage WebDAV
4.Give site URL, username and password
5.Save and quit
And the configuration file contents will be like the following. Password will be in an encrypted format.
vim /root/.config/rclone/rclone.conf
[sharepoint]
type = webdav
url = https://sharepoint.com/sites/SiteName/Documents
vendor = sharepoint
user =
pass =
Download a file from SharePoint.
rclone copy --ignore-times --ignore-size --verbose sharepoint:SourceFolder/file.txt DestFolder
Firefox plugin that captures the link with session ID etc.. and it provides a command you could paste in the console for curl or wget.
If anyone has a better suggestion please let me know.
It gives you a curl or wget command with headers, cookies and all, with a copy to clipboard button, right on the download dialogue.
Download URL: https://addons.mozilla.org/en-US/firefox/addon/cliget
Reference: https://superuser.com/questions/27243/how-to-find-out-the-real-download-url-on-download-sites-that-use-redirects/1239026#1239026
Struggled with the same issue myself, and had my not-so-automatic-but-man-so-convenient way, with a daily log-in.
logged into Sharepoint with a browser,
exported the cookie,
run the following command.
wget --cookies=on --load-cookies cookies.txt --keep-session-cookies --no-check-certificate -m https://yoursharepoint.com
And files were downloaded just fine.
For anyone using CURL to download a file on Sharepoint with an "Anyone with the link" download option. Below are the steps I had to follow to download. Essentially you have to use the cookie from the share link, and then download the file from a different download link they don't provide easily for you.
When sending the CURL command for the “share link” it returns a 302 message, a forward link, and a cookie. If we save that cookie and use it to hit a “download” link I am able to download the file. Essentially, Microsoft uses the initial “share link” to send the cookie to the browser, and then redirect to their “View File” website. On that website you need to use the cookie provided (authentication), and select your next function (On screen view, print, download, etc). When you click the download button you hit a different link. I was able to find this link by going to the "view page" website for the file/link, turning on developer tools, and watching the link the browser follows when hitting download. You can then replicate that link for each file. If we use that download link along with the cookie, we can download the file.
curl -i -c cookies.txt SHARE LINK
curl -o docsdownloaded.pdf -b cookies.txt DOWNLOAD LINK
Share Link Ex: https://tenant.sharepoint.com/:b:/s/Folder/EdNUf4xAVzFJgBoO0MqkfppR5tgobxLrmCnRqU4LFJQ?e=rOGNSD
Download Link Ex:https://tenant.sharepoint.com/sites/Folder/_layouts/15/download.aspx?SourceUrl=%2Fsites%2FFolder%2FShared%20Documents%2FGeneral%2FBig%2Dfile%2Epdf
Similar to the answer Zyglute gave, using cURL:
You can export your login cookie using the cookies.txt Chrome extension: https://chrome.google.com/webstore/detail/njabckikapfpffapmjgojcnbfjonfjfg
Then use the following code:
curl -b cookie.txt https://OurDomain.sharepoint.com/_XXX&file=IPS_cleaned.xlsx&action=default
At some point your Sharepoint session will expire (not sure how long that takes), and you will need a new cookie file.
EDIT: If a malicious user gets a hold of your cookie.txt, they could get into your SharePoint account, so be sure to keep it safe.
Use wget adding &download=1 at the end of the link.
wget "<yourlink>&download=1"
it will be download with <yourlink> string as name, then just mv with the correct name after.

How to use wget on a page with authentication

I've been searching the internet about wget, and found many posts on how to use wget to log into a site that has a login page.
The site uses https, and the form fields the login page looks for is "userid" and "password". I've verified this by checking the Network tool in Chrome when you hit F12.
I 've been using the following posts as guidelines:
http://www.unix.com/shell-programming-and-scripting/131020-using-wget-curl-http-post-authentication.html
And
wget with authentication
What I've tried:
testlab:/lua_curl_tests# wget --save-cookies cookies.txt --post-data 'userid=myid&password=123123' https://10.123.11.22/cgi-bin/acd/myapp/controller/method1
wget: unrecognized option `--save-cookies'
BusyBox v1.21.1 (2013-07-05 16:54:31 UTC) multi-call binary.
And also
testlab/lua_curl_tests# wget
http://userid=myid:123123#10.123.11.22/cgi-bin/acd/myapp/controller/method1
Connecting to 10.123.11.22 (10.123.11.22:80) wget: server returned
error: HTTP/1.1 403 Forbidden
Can you tell me what I'm doing wrong? Ultimately, what I'd like to do is login, and post data, then grab the resulting page.
I'm also currently looking at curl - to see if I really should be doing this with curl (lua-curl)

How can I test my browser ignoring Location headers?

I want to test a site with my Firefox ignoring Location: headers like this example in PHP.
header('Location: another-page.php');
Is there a plugin available to do this, or any other method?
Would my best bet be surfing the site with Lynx? Does Lynx ignore them?
Thanks
You could try bringing up the pages with cURL.
It is a command line application that is invoked via:
curl http://url
cURL does not follow Location: headers by default.

Resources