Wget to access university pdf - linux

I am trying to use wget to download a pdf from my university website
the url looks something like this :
https://online.myuni.ac.uk/webapps/blackboard/execute/content/file?cmd=view&content_id=_xxxxxxx_1&course_id=_xxxxx_1
I have tried using wget with both cookies and also with --user=xxx --password=xxx
However what it downloads is a html page showing me a log-in screen saying I have insufficient permission.
I cannot get this to work, and I am not sure how to proceed. I am very inexperienced with linux and programming in general any help is appreciated.

I finally got this working by exporting cookies from a browser, but the key was to include the "keep-session-cookies" parameter.

Related

Getting wrong html content when trying to download a webpage using python requests

I am trying to download a URL something like "https://some.com/ATS/cgi-bin/view.pl?axm_f_qtr_ins_chk_bd_sds%20status%20P01".
I think that is because of the %20. Can you please help me with this encoding issue? The garbage webpage which gets downloaded is as attached. Also, I cannot use beautifulsoup. I am restricted only to requests package. Thanks!

Marklogic installation on redhat 7

I have a redhat instance on aws. I am trying to install marklogic. I got the url after signing up for marklogic developer. using the following command, I get a 403 forbidden return.
curl https://developer.marklogic.com/download/binaries/7.0/MarkLogic-7.0-6.x86_64.rpm?
The curl statement that you get after signing up, and clicking the download of a MarkLogic installer should include an access token. Something like:
https://developer.marklogic.com/download/binaries/8.0/MarkLogic-8.0-4.2-x86_64.dmg?t=xxxxxxxxxxx&email=mypersonal%40email.com
You may have overlooked the last bit, it looks like the UI is wrapping the long url after the ?. I suggest using the 'Copy to clipboard' button to make sure you get the full url.
HTH!
Looks like you need to visit the site in a web browser first to register (Name, Email), and then you will get a link you can use with curl.
http://developer.marklogic.com/products/download-via-curl
You need to register an account first. Have you done that step yet?
The URL will have special characters in it (the email id will at least have '#'). So wrap the URL within quotes.
For eg:
wget "http://developer.marklogic.com/download/binaries/9.0/MarkLogic-9.0-1.1.x86_64.rpm?t=Xtxyxyxyxx.&email=xyz.abc%40abc.com"

Download webpage source from a page that requires authentication

I would like to download a webpage source code from a page that requires authentication, using shell script or something similar (like Perl, Python, etc..) in a Linux machine.
I tried to use wget and curl, but when I pass the URL, the source code that is being downloaded is for a page that ask me for credential. The same page is already open on Firefox, or Chrome, but I don't known how I can re-use this session.
Basically what I need to do is run a refresh on this page in a regular basis, and grep for some information inside the source code. If I found what I'm looking for, I will trigger another script.
-- Edit --
Tks #Alexufo .I managed to make it work, this way:
1 - Download a Firefox addon to allow me save the cookies in a TXT file. I used this addon: https://addons.mozilla.org/en-US/firefox/addon/export-cookies/
2 - Logged in the site I want, and saved the cookie.
3 - Using wget:
wget --load-cookies=cookie.txt 'http://my.url.com' -O output_file.txt
4 - Now the page source code is inside output_file.txt and I can parse the way I want.
CURL should works anywhere.
1) do first response for autorization. Save cookes.
2) use cookes when you try second response to get you source page code.
update:
Wget should work with post autorization like curl
wget with authentication
update2: http://www.httrack.com/
Mechanize (http://mechanize.rubyforge.org/) can do that. I am using it (together) with Ruby 2.0.0 for exactly that.

How to auto upload and check in the files to sharepoint using curl?

I am trying to upload a file from linux to sharepoint with my sharepoint login credentials.
I use the cURL utility to achieve this. The upload is successful.
The command used is : curl --ntlm --user username:password --upload-file myfile.txt -k https://sharepointserver.com/sites/mysite/myfile.txt
-k option is used to overcome the certificate errors for the non-secure sharepoint site.
However, this uploaded file is showing up in "checked out" view(green arrow) in sharepoint from my login.
As a result, this file is non-existent for users from other logins.
My login has the write access previlege to sharepoint.
Any ideas on how to "check in" this file to sharepoint with cURL so that the file can be viewed from anyone's login ?
I don't have curl available to test right now but you might be able to fashion something out of the following information.
Check in and check out is handled by /_layouts/CheckIn.aspx
The page has the following querystring variables:
List - A GUID that identifies the current list.
FileName - The name of the file with extension.
Source - The full url to the allitems.aspx page in the library.
I was able to get the CheckIn.aspx page to load correctly just using the FileName and Source parameters and omitting the List parameter. This is good because you don't have to figure out a way to look up the List GUID.
The CheckIn.aspx page postbacks to itself with the following form parameters that control checkin:
PostBack - boolean set to true.
CheckInAction - string set to ActionCheckin
KeepCheckout - set to 1 to keep checkout and 0 to keep checked in
CheckinDescription - string of text
Call this in curl like so
curl --data "PostBack=true&CheckinAction=ActionCheckin&KeepCheckout=0&CheckinDescription=SomeTextForCheckIn" http://{Your Server And Site}/_layouts/checkin.aspx?Source={Full Url To Library}/Forms/AllItems.aspx&FileName={Doc And Ext}
As I said I don't have curl to test but I got this to work using the Composer tab in Fiddler 2
I'm trying this with curl now and there is an issue getting it to work. Fiddler was executing the request as a POST. If you try to do this as a GET request you will get a 500 error saying that the AllowUnsafeUpdates property of the SPWeb will not allow this request over GET. Sending the request as a POST should correct this.
Edit I am currently going through the checkin.aspx source in the DotPeek decompiler and seeing some additional options for the ActionCheckin parameter that may be relevant such as ActionCheckinPublish and ActionCheckinFromClientPublish. I will update this with any additional findings. The page is located at Microsoft.SharePoint.ApplicationPages.Checkin for anyone else interested.
The above answer by Junx is correct. However, Filename variable is not only the document filename and the extension, but should also include the library name. I was able to get this to work using the following.
Example: http://domain/_layouts/Checkin.aspx?Filename=Shared Documents/filename.txt
My question about Performing multiple requests using cURL has a pretty comprehensive example using bash and cURL, although it suffers from having to reenter the password for each request.

Automatically saving web pages requiring login/HTTPS

I'm trying to automate some datascraping from a website. However, because the user has to go through a login screen a wget cronjob won't work, and because I need to make an HTTPS request, a simple Perl script won't work either. I've tried looking at the "DejaClick" addon for Firefox to simply replay a series of browser events (logging into the website, navigating to where the interesting data is, downloading the page, etc.), but the addon's developers for some reason didn't include saving pages as a feature.
Is there any quick way of accomplishing what I'm trying to do here?
A while back I used mechanize wwwsearch.sourceforge.net/mechanize and found it very helpful. It supports urllib2 so it should also work with HTTPS requests as I read now. So my comment above could hopefully prove wrong.
You can record your action with IRobotSoft web scraper. See demo here: http://irobotsoft.com/help/
Then use saveFile(filename, TargetPage) function to save the target page.

Resources