Bash scripting regarding Wget with credentials - linux

I am trying to do a script to log on a certain site and get the page infos as I would be logged.
I've searched on stacks and it appears that I must do it with 3 wgets:
one to get the hidden token , one for the cookies and post datas and the last one to get what I want. Here's the code:
#!/bin/bash
# get the login page to get the hidden field data
wget -a log.txt -O loginpage.html --user-agent="Mozilla/5.0" site/login
hiddendata=$(cat loginpage.html | grep __Req | cut -d'"' -f6,6 | head -n1 | sed s/\"//g)
echo "Logging with user $1 and pass $2"
wget --secure-protocol=auto --save-cookies cookies.txt --post-data="LoginDataModel.LoginName=$1&LoginDataModel.Password=$2&__RequestVerificationToken=${hidden_data}" --user-agent="Mozilla/5.0" site/login/login
where site/login is the login page and site/login/login is the post action and the post-data values are the
Logging with user x and pass y
--2015-02-07 12:29:07-- site/Login/Login
Resolving site (site)... 91.208.180.39
Connecting to site (site)|91.208.180.39|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: site/Login/Login [following]
--2015-02-07 12:29:18-- site/Login/Login
Resolving site (site)... 91.208.180.39
Connecting to site (site)|91.208.180.39|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-02-07 12:29:23 ERROR 404: Not Found.
when I check , the site/login/login exists. what am I doing wrong? Thank you.
I haven't done yet the third wget to get what I want since I can't connect properly.

I still can't comment so I will post it as answer. I would recommend to use curl instead of wget.
An example how to do it with curl: Appnexus Authentication Service. By your description its exactly what you need.
http://wiki.appnexus.com/display/sdk/Authentication+Service

Related

How to get http status code and content separately using curl in linux

I have to fetch some data using curl linux utility. There are two cases, one request is successful and second it is not. I want to save output to a file if request is successful and if request is failed due to some reason then error code should be saved only to a log file. I have search a lot on www but could not found exact solution that's why I have posted a new question on curl.
One option is to get the response code with -w, so you could do it something like
code=$(curl -s -o file -w '%{response_code}' http://example.com/)
if test "$code" != "200"; then
echo $code >> response-log
else
echo "wohoo 'file' is fine"
fi
curl -I -s -L <Your URL here> | grep "HTTP/1.1"
curl + grep is your friend, then you can extract the status code later for your need.

grep and curl commands

I am trying to find the instances of the word (pattern) "Zardoz" in the output of this command:
curl http://imdb.com/title/tt0070948
I tried using: curl http://imdb.com/title/tt0070948 | grep "Zardoz"
but it just returned "file not found".
Any suggestions? I would like to use grep to do this.
You need to tell curl use to -L (--location) option:
curl -L http://imdb.com/title/tt0070948 | grep "Zardoz"
(HTTP/HTTPS) If the server reports that the requested page has
moved to a different location (indicated with a Location: header
and a 3XX response code), this option will make curl redo the
request on the new place
When curl follows a redirect and the request is not a plain GET
(for example POST or PUT), it will do the following request with
a GET if the HTTP response was 301, 302, or 303. If the response
code was any other 3xx code, curl will re-send the following
request using the same unmodified method
.

How to see all Request URLs the server is doing (final URLs)

How list from the command line URLs requests that are made from the server (an *ux machine) to another machine.
For instance, I am on the command line of server ALPHA_RE .
I do a ping to google.co.uk and another ping to bbc.co.uk
I would like to see, from the prompt :
google.co.uk
bbc.co.uk
so, not the ip address of the machine I am pinging, and NOT an URL from servers that passes my the request to google.co.uk or bbc.co.uk , but the actual final urls.
Note that only packages that are available on normal ubuntu repositories are available - and it has to work with command line
Edit
The ultimate goal is to see what API URLs a PHP script (run by a cronjob) requests ; and what API URLs the server requests 'live'.
These ones do mainly GET and POST requests to several URLs, and I am interested in knowing the params :
Does it do request to :
foobar.com/api/whatisthere?and=what&is=there&too=yeah
or to :
foobar.com/api/whatisthathere?is=it&foo=bar&green=yeah
And does the cron jobs or the server do any other GET or POST request ?
And that, regardless what response (if any) these API gives.
Also, the API list is unknown - so you cannot grep to one particular URL.
Edit:
(OLD ticket specified : Note that I can not install anything on that server (no extra package, I can only use the "normal" commands - like tcpdump, sed, grep,...) // but as getting these information with tcpdump is pretty hard, then I made installation of packages possible)
You can use tcpdump and grep to get info about activity about network traffic from the host, the following cmd line should get you all lines containing Host:
tcpdump -i any -A -vv -s 0 | grep -e "Host:"
If I run the above in one shell and start a Links session to stackoverflow I see:
Host: www.stackoverflow.com
Host: stackoverflow.com
If you want to know more about the actual HTTP request you can also add statements to the grep for GET, PUT or POST requests (i.e. -e "GET"), which can get you some info about the relative URL (should be combined with the earlier determined host to get the full URL).
EDIT:
based on your edited question I have tried to make some modification:
first a tcpdump approach:
[root#localhost ~]# tcpdump -i any -A -vv -s 0 | egrep -e "GET" -e "POST" -e "Host:"
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
E..v.[#.#.......h.$....P....Ga .P.9.=...GET / HTTP/1.1
Host: stackoverflow.com
E....x#.#..7....h.$....P....Ga.mP...>;..GET /search?q=tcpdump HTTP/1.1
Host: stackoverflow.com
And an ngrep one:
[root#localhost ~]# ngrep -d any -vv -w byline | egrep -e "Host:" -e "GET" -e "POST"
^[[B GET //meta.stackoverflow.com HTTP/1.1..Host: stackoverflow.com..User-Agent:
GET //search?q=tcpdump HTTP/1.1..Host: stackoverflow.com..User-Agent: Links
My test case was running links stackoverflow.com, putting tcpdump in the search field and hitting enter.
This gets you all URL info on one line. A nicer alternative might be to simply run a reverse proxy (e.g. nginx) on your own server and modify the host file (such as shown in Adam's answer) and have the reverse proxy redirect all queries to the actual host and use the logging features of the reverse proxy to get the URLs from there, the logs would probably a bit easier to read.
EDIT 2:
If you use a command line such as:
ngrep -d any -vv -w byline | egrep -e "Host:" -e "GET" -e "POST" --line-buffered | perl -lne 'print $3.$2 if /(GET|POST) (.+?) HTTP\/1\.1\.\.Host: (.+?)\.\./'
you should see the actual URLs
A simple solution is to modify your '/etc/hosts' file to intercept the API calls and redirect them to your own web server
api.foobar.com 127.0.0.1

Auto-download involving password and specific online clicking

I want to use cron to do a daily download of portfolio info with 2 added complications:
It needs a password
I want to get the format I can get, when on the site myself, by clicking on "Download to a Spreadsheet
If I use:
wget -U Chromium --user='e-address' --password='pass' \
https://www.google.com/finance/portfolio > "file_"`date +"%d-%m-%Y"`+.csv
I Get the response:
=========================================================================
--2013-10-20 12:16:13-- https://www.google.com/finance/portfolio
Resolving www.google.com (www.google.com)... 74.125.195.105, 74.125.195.103, 74.125.195.99, ...
Connecting to www.google.com (www.google.com)|74.125.195.105|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘portfolio’
[ <=> ] 16,718 --.-K/s in 0.04s
2013-10-20 12:16:13 (431 KB/s) - ‘portfolio’ saved [16718]
==========================================================================
It saves to a file called "portfolio" rather than where I asked it to ("file_"date +"%d-%m-%Y"+.csv).
When I look at "portfolio" in the browser it says I need to sign in to my account ie no notice is taken of the user and password information I've included.
If I add to the web address the string I get by hovering on the "Download to a Spreadsheet" link:-
wget -U Chromium --user='e-address' --password='pass' \
https://www.google.com/finance/portfolio?... > "file_"`date +"%d-%m-%Y"`+.csv
I get:
[1] 5175
[2] 5176
[3] 5177
[4] 5178
--2013-10-20 12:44:56-- https://www.google.com/finance/portfolio?pid=1
Resolving www.google.com (www.google.com)... [2] Done output=csv
[3]- Done action=view
[4]+ Done pview=pview
hg21#hg21-sda2:~$ 74.125.195.106, 74.125.195.103, 74.125.195.104, ...
Connecting to www.google.com (www.google.com)|74.125.195.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘portfolio?pid=1’
[ <=> ] 16,768 --.-K/s in 0.05s
2013-10-20 12:44:56 (357 KB/s) - ‘portfolio?pid=1.1’ saved [16768]
and at this point it hangs. The file it writes at this point (‘portfolio?pid=1’) is the same as the 'portfolio' file with the previously used wget.
If I then put in my password it continues:
pass: command not found
[1]+ Done wget -U Chromium --user="e-address" --password='pass' https://www.google.com/finance/portfolio?pid=1
[1]+ Done wget -U Chromium --user="e-address" --password='pass' https://www.google.com/finance/portfolio?pid=1
Any help much appreciated.
There are a couple of issues here:
1) wget is not saving to the correct filename
Use the -O option instead of > shell redirection.
Change > file_`date+"%d-%m-%Y"`.csv to -O file_`date+"%d-%m-%Y"`.csv
Tip: If you use date+"%Y-%m-%d", your files will naturally sort chronologically.
This is esssentially a duplicate of wget command to download a file and save as a different filename
See also man wget for options.
2) wget is spawning multiple processes and "hanging"
You have &s in your URL which are being interpreted by the shell instead of being included in the argument passed to wget. You need to wrap the URL in quotation marks.
https://finance.google.com/?...&...&...
becomes
"https://finance.google.com/?...&...&..."

curl and cookies - command line tool

I am trying to get more familiar to curl, so I am trying to send post request to a webpage from a site I am already logged into :
curl --data "name=value" http://www.mysite.com >> post_request.txt
I had an output consisting of all the hmtl page with a message telling me I was not logged in so I retrieved the site's cookie (called PHPSESSID, is that of importance) stored it in /temp/cookies.txt and then
curl --cookie /tmp/cookies.txt --cookie-jar /tmp/cookies.txt --data "name=value" http://www.mysite.com > post_request.txt
but I still have the same message. Someone could help ?
I suggest you take a network dumping tool (wireshark or tcpdump) and sniff the communication between your shell and the server. Check if the cookie is sent and if it has the right content.

Resources