Get final URL after curl is redirected - linux

I need to get the final URL after a page redirect preferably with curl or wget.
For example http://google.com may redirect to http://www.google.com.
The contents are easy to get(ex. curl --max-redirs 10 http://google.com -L), but I'm only interested in the final url (in the former case http://www.google.com).
Is there any way of doing this by using only Linux built-in tools? (command line only)

curl's -w option and the sub variable url_effective is what you are
looking for.
Something like
curl -Ls -o /dev/null -w %{url_effective} http://google.com
More info
-L Follow redirects
-s Silent mode. Don't output anything
-o FILE Write output to <file> instead of stdout
-w FORMAT What to output after completion
More
You might want to add -I (that is an uppercase i) as well, which will make the command not download any "body", but it then also uses the HEAD method, which is not what the question included and risk changing what the server does. Sometimes servers don't respond well to HEAD even when they respond fine to GET.

Thanks, that helped me. I made some improvements and wrapped that in a helper script "finalurl":
#!/bin/bash
curl $1 -s -L -I -o /dev/null -w '%{url_effective}'
-o output to /dev/null
-I don't actually download, just discover the final URL
-s silent mode, no progressbars
This made it possible to call the command from other scripts like this:
echo `finalurl http://someurl/`

as another option:
$ curl -i http://google.com
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Sat, 19 Jun 2010 04:15:10 GMT
Expires: Mon, 19 Jul 2010 04:15:10 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
here.
</BODY></HTML>
But it doesn't go past the first one.

Thank you. I ended up implementing your suggestions: curl -i + grep
curl -i http://google.com -L | egrep -A 10 '301 Moved Permanently|302 Found' | grep 'Location' | awk -F': ' '{print $2}' | tail -1
Returns blank if the website doesn't redirect, but that's good enough for me as it works on consecutive redirections.
Could be buggy, but at a glance it works ok.

You can do this with wget usually. wget --content-disposition "url" additionally if you add -O /dev/null you will not be actually saving the file.
wget -O /dev/null --content-disposition example.com

The parameters -L (--location) and -I (--head) still doing unnecessary HEAD-request to the location-url.
If you are sure that you will have no more than one redirect, it is better to disable follow location and use a curl-variable %{redirect_url}.
This code do only one HEAD-request to the specified URL and takes redirect_url from location-header:
curl --head --silent --write-out "%{redirect_url}\n" --output /dev/null "https://""goo.gl/QeJeQ4"
Speed test
all_videos_link.txt - 50 links of goo.gl+bit.ly which redirect to youtube
1. With follow location
time while read -r line; do
curl -kIsL -w "%{url_effective}\n" -o /dev/null $line
done < all_videos_link.txt
Results:
real 1m40.832s
user 0m9.266s
sys 0m15.375s
2. Without follow location
time while read -r line; do
curl -kIs -w "%{redirect_url}\n" -o /dev/null $line
done < all_videos_link.txt
Results:
real 0m51.037s
user 0m5.297s
sys 0m8.094s

curl can only follow http redirects. To also follow meta refresh directives and javascript redirects, you need a full-blown browser like headless chrome:
#!/bin/bash
real_url () {
printf 'location.href\nquit\n' | \
chromium-browser --headless --disable-gpu --disable-software-rasterizer \
--disable-dev-shm-usage --no-sandbox --repl "$#" 2> /dev/null \
| tr -d '>>> ' | jq -r '.result.value'
}
If you don't have chrome installed, you can use it from a docker container:
#!/bin/bash
real_url () {
printf 'location.href\nquit\n' | \
docker run -i --rm --user "$(id -u "$USER")" --volume "$(pwd)":/usr/src/app \
zenika/alpine-chrome --no-sandbox --repl "$#" 2> /dev/null \
| tr -d '>>> ' | jq -r '.result.value'
}
Like so:
$ real_url http://dx.doi.org/10.1016/j.pgeola.2020.06.005
https://www.sciencedirect.com/science/article/abs/pii/S0016787820300638?via%3Dihub

This would work:
curl -I somesite.com | perl -n -e '/^Location: (.*)$/ && print "$1\n"'

I'm not sure how to do it with curl, but libwww-perl installs the GET alias.
$ GET -S -d -e http://google.com
GET http://google.com --> 301 Moved Permanently
GET http://www.google.com/ --> 302 Found
GET http://www.google.ca/ --> 200 OK
Cache-Control: private, max-age=0
Connection: close
Date: Sat, 19 Jun 2010 04:11:01 GMT
Server: gws
Content-Type: text/html; charset=ISO-8859-1
Expires: -1
Client-Date: Sat, 19 Jun 2010 04:11:01 GMT
Client-Peer: 74.125.155.105:80
Client-Response-Num: 1
Set-Cookie: PREF=ID=a1925ca9f8af11b9:TM=1276920661:LM=1276920661:S=ULFrHqOiFDDzDVFB; expires=Mon, 18-Jun-2012 04:11:01 GMT; path=/; domain=.google.ca
Title: Google
X-XSS-Protection: 1; mode=block

Can you try with it?
#!/bin/bash
LOCATION=`curl -I 'http://your-domain.com/url/redirect?r=something&a=values-VALUES_FILES&e=zip' | perl -n -e '/^Location: (.*)$/ && print "$1\n"'`
echo "$LOCATION"
Note: when you execute the command curl -I http://your-domain.com have to use single quotes in the command like curl -I 'http://your-domain.com'

You could use grep. doesn't wget tell you where it's redirecting too? Just grep that out.

Related

How can I use WGET to get only status info and save it somewhere?

can I use WGET to get, let's say, status 200 OK and save that status somewhere? If not, how can I do that using ubuntu linux?
Thanks!
With curl you can
curl -L -o /dev/null -s -w "%{http_code}\n" http://google.com >> status.txt
You use --save-headers to add the headers to the output, put the output to the console using -O -, ignore the errors stream using >/dev/null and get only the status line using grep HTTP/.
You can then output that into a file using >status_file
$ wget --save-headers -O - http://google.com/ 2>/dev/null | grep HTTP/ > status_file
The question suggests that the output of the wget command be stored somewhere. As another alternative, the following example shows how to store the output of wget execution in a shell variable (wget_status). Where after the execution of the wget command the status of the execution is stored in the variable wget_status. The wget status is displayed in the console using the echo command.
$ wget_status=$(wget --server-response ${URL} 2>&1 | awk '/^ HTTP/{print $2}')
$ echo $wget_status
200
After the execution of the wget command, the execution status can be manipulated using the value of the wget_status variable.
For more information consult the following link as a reference:
https://www.unix.com/shell-programming-and-scripting/148595-capture-http-response-code-wget.html
The tests were executed using cloudshell on a linux system.
Linux cs-335831867014-default 5.10.90+ #1 SMP Wed Mar 23 09:10:07 UTC 2022 x86_64 GNU/Linux

Redirect cat output to bash script [duplicate]

This question already has answers here:
How can I loop over the output of a shell command?
(4 answers)
Closed 1 year ago.
I need to write a bash script, which will get the subdomains from "subdomains.txt", which are separated by line breaks, and show me their HTTP response code. I want it to look this way:
cat subdomains.txt | ./httpResponse
The problem is, that I dont know, how to make the bash script get the subdomain names. Obviously, I need to use a loop, something like this:
for subdomains in list
do
echo curl --write-out "%{http_code}\n" --silent --output /dev/null "subdomain"
done
But how can I populate the list in loop, using the cat and pipeline? Thanks a lot in advance!
It would help if you provided actual input and expect output, so I'll have to guess that the URL you are passing to curl is in someway derived from the input in the text file. If the exact URL is in the input stream, perhaps you merely want to replace $URL with $subdomain. In any case, to read the input stream, you can simply do:
while read subdomain; do
URL=$( # : derive full URL from $subdomain )
curl --write-out "%{http_code}\n" --silent --output /dev/null "$URL"
done
Playing around with your example let me decide for wget and here is another way...
#cat subdomains.txt
osmc
microknoppix
#cat httpResponse
for subdomain in "$(cat subdomains.txt)"
do
wget -nv -O/dev/null --spider ${subdomain}
done
#sh httpResponse 2>response.txt && cat response.txt
2021-04-05 13:49:25 URL: http://osmc/ 200 OK
2021-04-05 13:49:25 URL: http://microknoppix/ 200 OK
Since wget puts out on stderr 2>response.txt leads to right output.
The && is like the then and is executed only if httpResponse succeeded.
You can do this without cat and a pipeline. Use netcat and parse the first line with sed:
while read -r subdomain; do
echo -n "$subdomain: "
printf "GET / HTTP/1.1\nHost: %s\n\n" "$subdomain" | \
nc "$subdomain" 80 | sed -n 's/[^ ]* //p;q'
done < 'subdomains.txt'
subdomains.txt:
www.stackoverflow.com
www.google.com
output:
www.stackoverflow.com: 301 Moved Permanently
www.google.com: 200 OK

how to use wget spider to identify broken urls from a list of urls and save broken ones

I am trying to write a shell script to identify broken urls from a list of urls.
here is input_url.csv sample:
https://www.google.com/
https://www.nbc.com
https://www.google.com.hksjkhkh/
https://www.google.co.jp/
https://www.google.ca/
Here is what I have which works:
wget --spider -nd -nv -H --max-redirect 0 -o run.log -i input_url.csv
and this gives me '2019-09-03 19:48:37 URL: https://www.nbc.com 200 OK' for valid urls, and for broken ones it gives me '0 redirections exceeded.'
what i expect is that i only want to save those broken links into my output file.
sample expect output:
https://www.google.com.hksjkhkh/
I think I would go with:
<input.csv xargs -n1 -P10 sh -c 'wget --spider --quiet "$1" || echo "$1"' --
You can use -P <count> option to xargs to run count processes in parallel.
xargs runs the command sh -c '....' -- for each line of the input file appending the input file line as the argument to the script.
Then sh inside runs wget ... "$1". The || checks if the return status is nonzero, which means failure. On wget failure, echo "$1" is executed.
Live code link at repl.
You could filter the output of wget -nd -nv and then regex the output, well like
wget --spider -nd -nv -H --max-redirect 0 -i input 2>&1 | grep -v '200 OK' | grep 'unable' | sed 's/.* .//; s/.$//'
but this looks not expendable, is not parallel so probably is slower and probably not worth the hassle.

How to ensure that user & pass is correct in curl using bash

I wrote the following script:
#!/bin/bash
if [ $# -ne 1 ];then
echo "Usage: ./script <input-file>"
exit 1
fi
while read user pass; do
curl -iL --data-urlencode user="$user" --data-urlencode password="$pass" http://foo.com/signin 1>/dev/null 2>&1
if [ $? -eq 0 ];then
echo "ok"
elif [ $? -ne 0 ]; then
echo "failed"
fi
done < $1
question:
Whenever I run with even wrong user & pass the result is ok for me...
How can I be sure that my parameters are correct or not?
Thanks
It’s because you are getting output from your curl command. Typing that command with random user/pass gets this:
$ curl -iL --data-urlencode user=BLAHBLAH --data-urlencode password=BLAH http://foo.com/signin
HTTP/1.1 301 Moved Permanently
Server: nginx/1.0.5
Date: Wed, 19 Feb 2014 05:53:36 GMT
Content-Type: text/html
Content-Length: 184
Connection: keep-alive
Location: http://www.foo.com/signin
. . .
. . .
<body>
<!-- This file lives in public/500.html -->
<div class="dialog">
<h1>We're sorry, but something went wrong.</h1>
</div>
</body>
</html>
Hence,
$ echo $?
0
But modify the URL to garbage:
$ curl -iL --data-urlencode user=BLAHBLAH --data-urlencode password=BLAH http://foof.com/signin
curl: (6) Could not resolve host: foof.com
$ echo $?
6
Even when the login fails, the HTTP server will still return a page with an error message. As curl is able to retrieve this page it will finish successfully.
In order to get curl to fail on server errors, you need the --fail parameter. Although this may not be fail-safe according to the curl man page, it is worth a try.
If --fail does not work, you could parse the header in the output of your curl request or have a look at the --write-out parameter.

CURL: grabbing liveleak video

Can i grab a video using curl?
I was using a website to download videos from liveleak, but it stopped working. I need this for one of my scripts.
basically this is the link:
http://www.liveleak.com/e/955_1345380192
redirected to this
http://edge.liveleak.com/80281E/u/u/ll2_player_files/mp55/player.swf?config=http://www.liveleak.com/player?a=config%26item_token=955_1345380192%26embed=1%26extra_params=
and that conf link contains the video link. every time i try to download it, i get
--->Make sure a file_url, file_token or playlist_token are set!
http://www.liveleak.com/player?a=config%26item_token=955_1345380192%26embed=1%26extra_params=
what I've tried so far:
curl http://edge.liveleak.com/80281E/u/u/ll2_player_files/mp55/player.swf?config=http://www.liveleak.com/player?a=config%26item_token=955_1345380192%26embed=1%26extra_params= -s -L -b LCOOKIE -c LCOOKIE -o LIVE
curl http://edge.liveleak.com/80281E/u/u/ll2_player_files/mp55/player.swf?config=http://www.liveleak.com/player?a=config%26item_token=955_1345380192%26embed=1%26extra_params= -I
curl http://edge.liveleak.com/80281E/u/u/ll2_player_files/mp55/player.swf?config=http://www.liveleak.com/player?a=config%26item_token=955_1345380192%26embed=1%26extra_params= -v
curl http://www.liveleak.com/player?a=config&item_token=955_1345380192&embed=1&extra_params=
wget http://www.liveleak.com/player?a=config&item_token=955_1345380192&embed=1&extra_params=
curl -A "Mozilla/5.0 (X11; U; Linux x86_64; ru; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.10 (maverick) Firefox/3.6.15" http://www.liveleak.com/player?a=config&item_token=955_1345380192&embed=1&extra_params=
Here is your epic one-liner:
UA="Mozilla/5.0 (X11; U; Linux x86_64; ru; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.10 (maverick) Firefox/3.6.15"; curl -A "$UA" $(sed -n -e 's/.*<file>\(.*\)<\/file>.*/\1/p' <(wget -q -O - $(wget -U "$UA" -nv -r -np -nd -H --spider "http://www.liveleak.com/e/955_1345380192" 2>&1 | egrep ' URL:' | awk '{print $4}' | sed "s/.*\?config\=//g" | sed -e's/%\([0-9A-F][0-9A-F]\)/\\\\\x\1/g' | xargs echo -e)))
As requested, it uses curl (and few additional tools); see bash manual and documentation of other commands for explanation.
Short summary: they moved information about video in xml file. To have it easy next time use the latest Firefox and it's ability to spy on all HTTP request and log their contents (no add-ons needed!)

Resources