Download webpage source from a page that requires authentication - linux

I would like to download a webpage source code from a page that requires authentication, using shell script or something similar (like Perl, Python, etc..) in a Linux machine.
I tried to use wget and curl, but when I pass the URL, the source code that is being downloaded is for a page that ask me for credential. The same page is already open on Firefox, or Chrome, but I don't known how I can re-use this session.
Basically what I need to do is run a refresh on this page in a regular basis, and grep for some information inside the source code. If I found what I'm looking for, I will trigger another script.
-- Edit --
Tks #Alexufo .I managed to make it work, this way:
1 - Download a Firefox addon to allow me save the cookies in a TXT file. I used this addon: https://addons.mozilla.org/en-US/firefox/addon/export-cookies/
2 - Logged in the site I want, and saved the cookie.
3 - Using wget:
wget --load-cookies=cookie.txt 'http://my.url.com' -O output_file.txt
4 - Now the page source code is inside output_file.txt and I can parse the way I want.

CURL should works anywhere.
1) do first response for autorization. Save cookes.
2) use cookes when you try second response to get you source page code.
update:
Wget should work with post autorization like curl
wget with authentication
update2: http://www.httrack.com/

Mechanize (http://mechanize.rubyforge.org/) can do that. I am using it (together) with Ruby 2.0.0 for exactly that.

Related

Executing php scripts without opening browser

I want to execute a php file located in my apache server on localhost/remote from Processing. But I want to do it without opening the browser. How can I do this. The php script just adds data to mysql database using values obtained from GET request. Is this possible? Actually I tried using link("/receiver.php?a=1&b=2") but it opens a web page containing the php output.
Ideally such scripts must be generic so that it can be used as utility for both web and bash scripts, in-case you cannot change/modify script, then I would suggest to curl from terminal/bash script to make HTTP request to the given link.
Refer to this solution, as how to make request with CURL.

Marklogic installation on redhat 7

I have a redhat instance on aws. I am trying to install marklogic. I got the url after signing up for marklogic developer. using the following command, I get a 403 forbidden return.
curl https://developer.marklogic.com/download/binaries/7.0/MarkLogic-7.0-6.x86_64.rpm?
The curl statement that you get after signing up, and clicking the download of a MarkLogic installer should include an access token. Something like:
https://developer.marklogic.com/download/binaries/8.0/MarkLogic-8.0-4.2-x86_64.dmg?t=xxxxxxxxxxx&email=mypersonal%40email.com
You may have overlooked the last bit, it looks like the UI is wrapping the long url after the ?. I suggest using the 'Copy to clipboard' button to make sure you get the full url.
HTH!
Looks like you need to visit the site in a web browser first to register (Name, Email), and then you will get a link you can use with curl.
http://developer.marklogic.com/products/download-via-curl
You need to register an account first. Have you done that step yet?
The URL will have special characters in it (the email id will at least have '#'). So wrap the URL within quotes.
For eg:
wget "http://developer.marklogic.com/download/binaries/9.0/MarkLogic-9.0-1.1.x86_64.rpm?t=Xtxyxyxyxx.&email=xyz.abc%40abc.com"

Script to Open URL as HTTP POST request (not just GET) in web browser

How can I open a URL from the command line script as a POST request to a URL? I'd like to to do this on Linux, Windows, and MacOS so a cross platform way of doing this
For each of these I can open a GET request to a URL using:
xdg-open [url] - Linux
start [url] - Windows
open [url] - MacOS
... but I don't see a way of doing a POST request. Is it possible, and if so how?
Also, showing how to add a POST body to the request would be appreciated too. Fyi, the script I'm writing is written in Ruby so anything built-in to the OS or something using Ruby is fine too.
UPDATED:
To clarify I'd like this open up in the default browser, not just issue the POST request and get the result.
Sorry, but there's no reliable way to do this across browsers and across operating systems. This was a topic the IE team spent a bunch of time looking into when we built Accelerators in IE8.
The closest you can come is writing a HTML page to disk that has an auto-submitting form (or XmlHttpRequest), then invoking the browser to load that page which will submit the form. This constrains your code to only submitting forms that are legal to submit from the browser (e.g. only thee possible content types, etc).
If you're using ruby you can use the NET::HTTP library: http://ruby-doc.org/stdlib-2.0/libdoc/net/http/rdoc/Net/HTTP.html
require 'net/http'
uri = URI('http://www.example.com/search.cgi')
res = Net::HTTP.post_form(uri, 'q' => 'ruby', 'max' => '50')
puts res.body
I don't know how compatible my solution is across browsers and platforms.
But instead of using a temporary file as described by #EricLaw,
you can encode that html with the auto-submit form in a Data URL, as i did here:
https://superuser.com/questions/1281950/open-chrome-with-method-post-body-data-on-specific-url/1631277#1631277

How to auto upload and check in the files to sharepoint using curl?

I am trying to upload a file from linux to sharepoint with my sharepoint login credentials.
I use the cURL utility to achieve this. The upload is successful.
The command used is : curl --ntlm --user username:password --upload-file myfile.txt -k https://sharepointserver.com/sites/mysite/myfile.txt
-k option is used to overcome the certificate errors for the non-secure sharepoint site.
However, this uploaded file is showing up in "checked out" view(green arrow) in sharepoint from my login.
As a result, this file is non-existent for users from other logins.
My login has the write access previlege to sharepoint.
Any ideas on how to "check in" this file to sharepoint with cURL so that the file can be viewed from anyone's login ?
I don't have curl available to test right now but you might be able to fashion something out of the following information.
Check in and check out is handled by /_layouts/CheckIn.aspx
The page has the following querystring variables:
List - A GUID that identifies the current list.
FileName - The name of the file with extension.
Source - The full url to the allitems.aspx page in the library.
I was able to get the CheckIn.aspx page to load correctly just using the FileName and Source parameters and omitting the List parameter. This is good because you don't have to figure out a way to look up the List GUID.
The CheckIn.aspx page postbacks to itself with the following form parameters that control checkin:
PostBack - boolean set to true.
CheckInAction - string set to ActionCheckin
KeepCheckout - set to 1 to keep checkout and 0 to keep checked in
CheckinDescription - string of text
Call this in curl like so
curl --data "PostBack=true&CheckinAction=ActionCheckin&KeepCheckout=0&CheckinDescription=SomeTextForCheckIn" http://{Your Server And Site}/_layouts/checkin.aspx?Source={Full Url To Library}/Forms/AllItems.aspx&FileName={Doc And Ext}
As I said I don't have curl to test but I got this to work using the Composer tab in Fiddler 2
I'm trying this with curl now and there is an issue getting it to work. Fiddler was executing the request as a POST. If you try to do this as a GET request you will get a 500 error saying that the AllowUnsafeUpdates property of the SPWeb will not allow this request over GET. Sending the request as a POST should correct this.
Edit I am currently going through the checkin.aspx source in the DotPeek decompiler and seeing some additional options for the ActionCheckin parameter that may be relevant such as ActionCheckinPublish and ActionCheckinFromClientPublish. I will update this with any additional findings. The page is located at Microsoft.SharePoint.ApplicationPages.Checkin for anyone else interested.
The above answer by Junx is correct. However, Filename variable is not only the document filename and the extension, but should also include the library name. I was able to get this to work using the following.
Example: http://domain/_layouts/Checkin.aspx?Filename=Shared Documents/filename.txt
My question about Performing multiple requests using cURL has a pretty comprehensive example using bash and cURL, although it suffers from having to reenter the password for each request.

Automatically saving web pages requiring login/HTTPS

I'm trying to automate some datascraping from a website. However, because the user has to go through a login screen a wget cronjob won't work, and because I need to make an HTTPS request, a simple Perl script won't work either. I've tried looking at the "DejaClick" addon for Firefox to simply replay a series of browser events (logging into the website, navigating to where the interesting data is, downloading the page, etc.), but the addon's developers for some reason didn't include saving pages as a feature.
Is there any quick way of accomplishing what I'm trying to do here?
A while back I used mechanize wwwsearch.sourceforge.net/mechanize and found it very helpful. It supports urllib2 so it should also work with HTTPS requests as I read now. So my comment above could hopefully prove wrong.
You can record your action with IRobotSoft web scraper. See demo here: http://irobotsoft.com/help/
Then use saveFile(filename, TargetPage) function to save the target page.

Resources