How to login into website using wget method POST with redirect - linux

What I'm looking for
I want to scrape a website and have an alert as soon as there are changes.
But I have a little trouble to understand the authentication mechanism set up by this website.
I tried almost all tricks to be able to connect via wget, but none worked.
The website : offre.astria.com
What I tried
username=XXXXX
password=XXXXX
wget --save-cookies cookies.txt \
--keep-session-cookies \
--post-data 'username=$username&password=$password' \
--delete-after \
https://cas.astria.com/cas-ext/login?service=https://offre.astria.com
wget --load-cookies cookies.txt \
https://offre.astria.com
After your comments, I tried this too :
username=XXXX
password=XXXX
code=`wget -qO- https://cas.astria.com/cas-ext/login?service=https://offre.astria.com | cat | grep 'name="lt"' | cut -d"_" -f2`
hidden_code=_$code
wget --save-cookies cookies.txt \
--keep-session-cookies \
--post-data 'username=$username&password=$password&lt=$hidden_code&_eventId=submit' \
--delete-after \
https://cas.astria.com/cas-ext/login?service=https://offre.astria.com
wget --load-cookies cookies.txt \
https://offre.astria.com
And the error message goes from http 302 to http 500 servlet internal error.
Maybe, because the hidden field change the value between the first attempt and the second ...
I tried this too
username=XXXX
password=XXXX
code=`wget -qO- --save-cookies cookies.txt --keep-session-cookies --delete-after https://cas.astria.com/cas-ext/login?service=https://offre.astria.com | cat | grep 'name="lt"' | cut -d"_" -f2,3 | cut -d"\"" -f1`
wget --load-cookies cookies.txt \
--post-data 'username=$username&password=$password&lt=$hidden_code&_eventId=submit' \
--delete-after \
https://cas.astria.com/cas-ext/login?service=https://offre.astria.com
wget --load-cookies cookies.txt \
https://offre.astria.com
And same results :s

If you look at the source of the web page, you will find the login form has a hidden field:
<input type="hidden" name="lt" value="_c72D26258-FFFE-C561-D3EC-9BAC7F85B9A4_kCB5ECD54-57E0-1FD5-8A2F-BF72E114C604" />
The value of this field is generated server side, and changes every time you refresh the page.
This is a mechanism to prevent exactly what you are trying to do.
You may be able to bypass it, by having a script read the page, extract the value, and then use wget to submit a proper form.
It is however possible that you will encounter other defensive methods on the site.

Related

Exclude directories from wget to create sitemap

Im trying to use a shell script to scrape a website to get a list of all pages. I found the shell script "Written by Elmar Hanlhofer https://www.plop.at" and it works well. However, I need to exclude directories and the documentation isn't working for me.
# Example, exclude files from /print and /slide:
# files=$(find | grep -i html | grep -v "$SITE/print" | grep -v "$SITE/slide")
I need to exclude a forum install located at /support (and all children directories) so I modify the code to be:
files=$(find | grep -i html | grep -v "$SITE/support")
However it is still scanning /support/directory/directory/ etc. How do I modify the grep command to exclude /support AND ALL CHILD DIRECTORIES?
I am very new to linux / unix commands, so I may not be expressing this correctly. Thank you.
The original script will download whole site, run find to filter out content you don't want.
The section related to wget is copied below,
wget \
--recursive \
--no-clobber \
--page-requisites \
--convert-links \
--restrict-file-names=windows \
--no-parent \
--directory-prefix="$TMP" \
--domains $DOMAIN \
--user-agent="$AGENT" \
$URL >& $WGET_LOG
To exclude support directory, add --exclude-directories option,
wget \
--recursive \
--no-clobber \
--page-requisites \
--convert-links \
--restrict-file-names=windows \
--no-parent \
--directory-prefix="$TMP" \
--domains $DOMAIN \
--user-agent="$AGENT" \
--exclude-directories=/support \
$URL >& $WGET_LOG
Read this answer if you want to have more control on directories.

How to convert curl into python

I am reading the instagram api , there is command of curl , I have been trying to read through what curl -F means, but somehow I am getting nowhere...
Appreciate if anyone can provide any insight on this topic.
curl -F 'client_id=CLIENT_ID' \
-F 'client_secret=CLIENT_SECRET' \
-F 'grant_type=authorization_code' \
-F 'redirect_uri=AUTHORIZATION_REDIRECT_URI' \
-F 'code=CODE' \
https://api.instagram.com/oauth/access_token
-F is for --form, but for most curl tasks I will often enter the curl into a web converter.
Check out https://curl.trillworks.com/
Looks like there is a library that does it too, but I've not used it: https://github.com/spulec/uncurl

Mailgun API cURL Using HTML, Does it accept head and body tags?

I am trying to send an email using Mailgun API through curl, I have an html email so all I did was change text to html from the sample given. I added my template but for some reason I notice it doesn't accept head and body tags, right?
I have the url with the newsletter please help me http://bit.ly/10B63PU
Here on the documentation the sample code for sending with api:
http://documentation.mailgun.com/quickstart-sending.html#send-via-api
I learned I had to use --form-string not just replace text with html. The example below is directly from mailgun, before they only listed the text version for sending mail.
http://documentation.mailgun.com/api-sending.html#examples
curl -s --user 'api:key-3ax6xnjp29jd6fds4gc373sgvjxteol0' \
https://api.mailgun.net/v2/samples.mailgun.org/messages \
-F from='Excited User <me#samples.mailgun.org>' \
-F to='foo#example.com' \
-F cc='bar#example.com' \
-F bcc='baz#example.com' \
-F subject='Hello' \
-F text='Testing some Mailgun awesomness!' \
--form-string html='<html>HTML version of the body</html>' \
-F attachment=#files/cartman.jpg \
-F attachment=#files/cartman.png
You need to specify the HTML payload to pass HTML, otherwise it will assume it is plain text.
curl -s --user 'api:key-3ax6xnjp29jd6fds4gc373sgvjxteol0' \
https://api.mailgun.net/v2/samples.mailgun.org/messages \
-F from='Excited User <me#samples.mailgun.org>' \
-F to=baz#example.com \
-F to=bar#example.com \
-F subject='Hello' \
-F text='Plain text version of my awesome newsletter' \
-F html='<body>My awesome newsletter</body>'
Thank you for asking the question - it might not seem as obvious!

Curl: How to insert value to a cookie?

Ho to insert cookies value in curl? from firebug request headers I can see in the following
Cookie: PHPSESSID=gg792c2ktu6sch6n8q0udd94o0; was=1; uncheck2=1; uncheck3=1; uncheck4=1; uncheck5=0; hd=1; uncheck1=1"
I have tried the following:
curl http://site.com/ -s -L -b cookie.c -c cookie.c -d "was=1; uncheck2=1; uncheck3=1; uncheck4=1; uncheck5=0; hd=1; uncheck1=1" > comic
and the only thing i see in cookie.c is
PHPSESSID=gg792c2ktu6sch6n8q0udd94o0; was=1;
To pass keys/values to cURL cookie, you need the -b switch, not -d.
For the forms -d, the data will be separated by & and not by ; in your curl command.
So :
curl http://site.com/ \
-s \
-L \
-b cookie.c \
-c cookie.c \
-b "was=1; uncheck2=1; uncheck3=1; uncheck4=1; uncheck5=0; hd=1; uncheck1=1"
> comic
If you need to know the names of the forms to be POSTed, you can run the following command :
mech-dump --forms http://site.com/
It comes with libwww-mechanize-perl package with debian or derivated.

make my autodownloading shell script better

So I want to download multiple files from rapidshare. This what I currently have. I created a cookie by running-
wget \
--save-cookies ~/.cookies/rapidshare \
--post-data "login=USERNAME&password=PASSWORD" \
--no-check-certificate \
-O - \
https://ssl.rapidshare.com/cgi-bin/premiumzone.cgi \
> /dev/null
and now I have a shell script which I run which looks like this-
#!/bin/bash
wget -c --load-cookies ~/.cookies/rapidshare http://rapidshare.com/files/219920856/file1.rar
wget -c --load-cookies ~/.cookies/rapidshare http://rapidshare.com/files/393839302/file2.rar
wget -c --load-cookies ~/.cookies/rapidshare http://rapidshare.com/files/398293204/file3.rar
....
I want two things-
The shell script needs to read the files to download from a file.
The shell script should download anywhere from 2 - 8 files at a time.
Thanks!
When you want parallel jobs, think make.
#!/usr/bin/make -f
login:
wget -qO/dev/null \
--save-cookies ~/.cookies/rapidshare \
--post-data "login=USERNAME&password=PASSWORD" \
--no-check-certificate \
https://ssl.rapidshare.com/cgi-bin/premiumzone.cgi
$(MAKEFILES):
%: login
wget -ca$(addsuffix .log,$(notdir $#)) \
--load-cookies ~/.cookies/rapidshare $#
#echo "Downloaded $# (log in $(addsuffix .log,$(notdir $#)))"
Save this as rsget somewhere in $PATH (make sure you use tabs and not spaces for indentation), give it chmod +x, and run
rsget -kj8 \
http://rapidshare.com/files/219920856/file1.rar \
http://rapidshare.com/files/393839302/file2.rar \
http://rapidshare.com/files/398293204/file3.rar \
...
This will log in, then wget each target. -j8 tells make to run up to 8 jobs in parallel, and -k means "keep going even if a target returned failure".
Edit
Tested with GNU Make 3.79 and 3.81.
Try this. I think it should do what you want:
#! /bin/bash
MAX_CONCURRENT=8
URL_BASE="http://rapidshare.com/files/"
cookie_file=~/.cookies/rapidshare
# do your login thing here...
[ -n "$1" -a -f "$1" ] || { echo "please provide a file containing the stuff to download"; exit 1; }
inputfile=$1
count=0
while read x; do
if [ $count -ge $MAX_CONCURRENT ]; then
count=0
wait
fi
{ wget -c --load-cookies "$cookie_file" "${URL_BASE}$x" && echo "Downloaded $x"; } &
count=$((count + 1))
done < $inputfile

Resources