How to check status of URLs from text file using bash shell script - linux

I have to check the status of 200 http URLs and find out which of these are broken links. The links are present in a simple text file (say URL.txt present in my ~ folder). I am using Ubuntu 14.04 and I am a Linux newbie. But I understand the bash shell is very powerful and could help me achieve what I want.
My exact requirement would be to read the text file which has the list of URLs and automatically check if the links are working and write the response to a new file with the URLs and their corresponding status (working/broken).

I created a file "checkurls.sh" and placed it in my home directory where the urls.txt file is also located. I gave execute privileges to the file using
$chmod +x checkurls.sh
The contents of checkurls.sh is given below:
#!/bin/bash
while read url
do
urlstatus=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "$url" )
echo "$url $urlstatus" >> urlstatus.txt
done < $1
Finally, I executed it from command line using the following -
$./checkurls.sh urls.txt
Voila! It works.

#!/bin/bash
while read -ru 4 LINE; do
read -r REP < <(exec curl -IsS "$LINE" 2>&1)
echo "$LINE: $REP"
done 4< "$1"
Usage:
bash script.sh urls-list.txt
Sample:
http://not-exist.com/abc.html
https://kernel.org/nothing.html
http://kernel.org/index.html
https://kernel.org/index.html
Output:
http://not-exist.com/abc.html: curl: (6) Couldn't resolve host 'not-exist.com'
https://kernel.org/nothing.html: HTTP/1.1 404 Not Found
http://kernel.org/index.html: HTTP/1.1 301 Moved Permanently
https://kernel.org/index.html: HTTP/1.1 200 OK
For everything, read the Bash Manual. See man curl, help, man bash as well.

What about to add some parallelism to the accepted solution. Lets modify the script chkurl.sh to be little easier to read and to handle just one request at a time:
#!/bin/bash
URL=${1?Pass URL as parameter!}
curl -o /dev/null --silent --head --write-out "$URL %{http_code} %{redirect_url}\n" "$URL"
And now you check your list using:
cat URL.txt | xargs -P 4 -L1 ./chkurl.sh
This could finish the job up to 4 times faster.

Herewith my full script that checks URLs listed in a file passed as an argument e.g. 'checkurls.sh listofurls.txt'.
What it does:
check url using curl and return HTTP status code
send email notifications when url returns other code than 200
create a temporary lock file for failed urls (file naming could be improved)
send email notification when url becoms available again
remove lock file once url becomes available to avoid further notifications
log events to a file and handle increasing log file size (AKA log
rotation, uncomment echo if code 200 logging required)
Code:
#!/bin/sh
EMAIL=" your#email.com"
DATENOW=`date +%Y%m%d-%H%M%S`
LOG_FILE="checkurls.log"
c=0
while read url
do
((c++))
LOCK_FILE="checkurls$c.lock"
urlstatus=$(/usr/bin/curl -H 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code}' "$url" )
if [ "$urlstatus" = "200" ]
then
#echo "$DATENOW OK $urlstatus connection->$url" >> $LOG_FILE
[ -e $LOCK_FILE ] && /bin/rm -f -- $LOCK_FILE > /dev/null && /bin/mail -s "NOTIFICATION URL OK: $url" $EMAIL <<< 'The URL is back online'
else
echo "$DATENOW FAIL $urlstatus connection->$url" >> $LOG_FILE
if [ -e $LOCK_FILE ]
then
#no action - awaiting URL to be fixed
:
else
/bin/mail -s "NOTIFICATION URL DOWN: $url" $EMAIL <<< 'Failed to reach or URL problem'
/bin/touch $LOCK_FILE
fi
fi
done < $1
# REMOVE LOG FILE IF LARGER THAN 100MB
# alow up to 2000 lines average
maxsize=120000
size=$(/usr/bin/du -k "$LOG_FILE" | /bin/cut -f 1)
if [ $size -ge $maxsize ]; then
/bin/rm -f -- $LOG_FILE > /dev/null
echo "$DATENOW LOG file [$LOG_FILE] has been recreated" > $LOG_FILE
else
#do nothing
:
fi
Please note that changing order of listed urls in text file will affect any existing lock files (remove all .lock files to avoid confusion). It would be improved by using url as file name but certain characters such as : # / ? & would have to be handled for operating system.

I recently released deadlink, a command-line tool for finding broken links in files. Install with
pip install deadlink
and use as
deadlink check /path/to/file/or/directory
or
deadlink replace-redirects /path/to/file/or/directory
The latter will replace permanent redirects (301) in the specified files.
Example output:

if your input file contains one url per line you can use a script to read each line, then try to ping the url, if ping success then the url is valid
#!/bin/bash
INPUT="Urls.txt"
OUTPUT="result.txt"
while read line ;
do
if ping -c 1 $line &> /dev/null
then
echo "$line valid" >> $OUTPUT
else
echo "$line not valid " >> $OUTPUT
fi
done < $INPUT
exit
ping options :
-c count
Stop after sending count ECHO_REQUEST packets. With deadline option, ping waits for count ECHO_REPLY packets, until the timeout expires.
you can use this option as well to limit waiting time
-W timeout
Time to wait for a response, in seconds. The option affects only timeout in absense
of any responses, otherwise ping waits for two RTTs.

curl -s -I --http2 http://$1 >> fullscan_curl.txt | cut -d: -f1 fullscan_curl.txt | cat fullscan_curl.txt | grep HTTP >> fullscan_httpstatus.txt
its work me

Related

Redirect cat output to bash script [duplicate]

This question already has answers here:
How can I loop over the output of a shell command?
(4 answers)
Closed 1 year ago.
I need to write a bash script, which will get the subdomains from "subdomains.txt", which are separated by line breaks, and show me their HTTP response code. I want it to look this way:
cat subdomains.txt | ./httpResponse
The problem is, that I dont know, how to make the bash script get the subdomain names. Obviously, I need to use a loop, something like this:
for subdomains in list
do
echo curl --write-out "%{http_code}\n" --silent --output /dev/null "subdomain"
done
But how can I populate the list in loop, using the cat and pipeline? Thanks a lot in advance!
It would help if you provided actual input and expect output, so I'll have to guess that the URL you are passing to curl is in someway derived from the input in the text file. If the exact URL is in the input stream, perhaps you merely want to replace $URL with $subdomain. In any case, to read the input stream, you can simply do:
while read subdomain; do
URL=$( # : derive full URL from $subdomain )
curl --write-out "%{http_code}\n" --silent --output /dev/null "$URL"
done
Playing around with your example let me decide for wget and here is another way...
#cat subdomains.txt
osmc
microknoppix
#cat httpResponse
for subdomain in "$(cat subdomains.txt)"
do
wget -nv -O/dev/null --spider ${subdomain}
done
#sh httpResponse 2>response.txt && cat response.txt
2021-04-05 13:49:25 URL: http://osmc/ 200 OK
2021-04-05 13:49:25 URL: http://microknoppix/ 200 OK
Since wget puts out on stderr 2>response.txt leads to right output.
The && is like the then and is executed only if httpResponse succeeded.
You can do this without cat and a pipeline. Use netcat and parse the first line with sed:
while read -r subdomain; do
echo -n "$subdomain: "
printf "GET / HTTP/1.1\nHost: %s\n\n" "$subdomain" | \
nc "$subdomain" 80 | sed -n 's/[^ ]* //p;q'
done < 'subdomains.txt'
subdomains.txt:
www.stackoverflow.com
www.google.com
output:
www.stackoverflow.com: 301 Moved Permanently
www.google.com: 200 OK

tail -f to several files piped to "while read" gives unexpected results

I wrote a quick bash script to monitor several log files and send their contents to a Mattermost chat. This is the script:
#!/bin/bash
# HOOKS
hook1=...#Mattermost hook
hook2=...#Mattermost hook
hook3=...#Mattermost hook
date=$1
year=$2
unbuffer tail -f -n0 /var/log/$year/hook1_*$date.log \
/var/log/$year/hook2_*$date.log \
/var/log/$year/hook3_*$date.log | grep "ERROR" | while read -r msg
do
if [[ $msg = *"hook1"* ]]
then
hook=$hook1
elif [[ $msg = *"hook2"* ]]
then
hook=$hook2
else
hook=$hook3
fi
msg=$(echo "$msg" | sed -r "s/\"/\\\\\"/g")
curl -i -X POST --data-urlencode "payload={\"text\": \"$msg\"}" $hook
done
The sed part escapes quotes so I can replace the message into curl.
However, several problems arise during execution:
There seems to be 2 processes in htop after I run the script once.
Sometimes messages are not sent to Mattermost as they appear, but instead they come out in batches.
Messages sometimes appear cut and some content is lost.

Keep track of execution time in a bash script and terminate current command if it takes too long

I'm trying to create a bash script to download files en masse from a certain website.
Their download links are sequential - e.g. it's just id=1, id=2, id=3 all the way up to 660000. The only requirement is that you have to be logged in, which makes this a bit harder. Oh, and the login will randomly time out after a few hours so I have to log back in.
Here's my current script, which works well about 99% of the time.
#!/bin/sh
cd downloads
for i in `seq 1 660000`
do
lastname=""
echo "Downloading file $i"
echo "Downloading file $i" >> _downloadinglog.txt
response=$(curl --write-out %{http_code} -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[sample URL to make sure cookie is still logged in]")
if ! [ $response -eq 200 ]
then
echo "Cookie didn't work, trying to re-log in..."
curl -d "userid=[USERNAME]" -d "pwd=[PASSWORD]" -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[login URL]"
response=$(curl --write-out %{http_code} -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[sample URL again]")
if ! [ $response -eq 200 ]
then
echo "Something weird happened?? Response code $response. Logging in didn't fix issue, fix then resume from $(($i - 1))"
echo "Something weird happened?? Response code $response. Logging in didn't fix issue, fix then resume from $(($i - 1))" >> _downloadinglog.txt
exit 0
fi
echo "Downloading file $(($i - 1)) again incase cookie expiring caused it to fail"
echo "Downloading file $(($i - 1)) again incase cookie expiring caused it to fail" >> _downloadinglog.txt
lastname=$(curl --write-out %{filename_effective} -O -J -b _cookies.txt -c _cookies.txt "[URL to download files]?id=$(($i - 1))")
echo "id $(($i - 1)) = $lastname" >> _downloadinglog.txt
lastname=""
echo "Downloading file $i"
echo "Downloading file $i" >> _downloadinglog.txt
fi
lastname=$(curl --write-out %{filename_effective} -O -J -b _cookies.txt -c _cookies.txt "[URL to download files]?id=$i")
echo "id $i = $lastname" >> _downloadinglog.txt
done
So basically what I have it doing is attempting to download a random file before moving to the next file in the set. If the download fails, we assume the login cookie is no longer valid and tell curl to log me back in.
This works great, and I was able to get several thousand files from it. But what would happen is - either my router goes down for a second or two, or THEIR site goes down for a minute or two, and curl will just sit there thinking it's downloading for hours. I once came back to it literally spending 24 hours on the same file. It doesn't seem to have the ability to know if the transfer timed out in the middle - only if it can't START the transfer.
I know there are ways to terminate execution of a command if you combine it with "sleep", but since this has to be "smart" and restart from where it left off, I can't just kill the whole script.
Any suggestions? I'm open to using something other than curl if I can use it to login via a terminal command.
You can try using the curl options --connect-timeout or --max-time .
--max-time should be your pick.
From manual :
--max-time
Maximum time in seconds that you allow the whole operation to take. This is useful for preventing your batch jobs from hanging for hours due to slow networks or links going down. Since 7.32.0, this option accepts decimal values, but the actual timeout will decrease in accuracy as the specified timeout increases in decimal precision. See also the --connect-timeout option.
If this option is used several times, the last one will be used.
Then capture the result of the command in a var and process further based on the result.

Shell script with Wget - If else nested inside for loop

I'm trying to make a shell script that reads a list of download URLs to find if they're still active. I'm not sure what's wrong with my current script, (I'm new to this) and any pointers would be a huge help!
user#pc:~/test# cat sites.list
http://www.google.com/images/srpr/logo3w.png
http://www.google.com/doesnt.exist
notasite
Script:
#!/bin/bash
for i in `cat sites.list`
do
wget --spider $i -b
if grep --quiet "200 OK" wget-log; then
echo $i >> ok.txt
else
echo $i >> notok.txt
fi
rm wget-log
done
As is, the script outputs everything to notok.txt - (the first google site should go to ok.txt).
But if I run:
wget --spider http://www.google.com/images/srpr/logo3w.png -b
And then do:
grep "200 OK" wget-log
It greps the string without any problems. What noob mistake did I make with the syntax? Thanks m8s!
The -b option is sending wget to the background, so you're doing the grep before wget has finished.
Try without the -b option:
if wget --spider $i 2>&1 | grep --quiet "200 OK" ; then
There are a few issues with what you're doing.
Your for i in will have problems with lines that contain whitespace. Better to use while read to read individual lines of a file.
You aren't quoting your variables. What if a line in the file (or word in a line) starts with a hyphen? Then wget will interpret that as an option. You have a potential security risk here, as well as an error.
Creating and removing files isn't really necessary. If all you're doing is checking whether a URL is reachable, you can do that without temp files and the extra code to remove them.
wget isn't necessarily the best tool for this. I'd advise using curl instead.
So here's a better way to handle this...
#!/bin/bash
sitelist="sites.list"
curl="/usr/bin/curl"
# Some errors, for good measure...
if [[ ! -f "$sitelist" ]]; then
echo "ERROR: Sitelist is missing." >&2
exit 1
elif [[ ! -s "$sitelist" ]]; then
echo "ERROR: Sitelist is empty." >&2
exit 1
elif [[ ! -x "$curl" ]]; then
echo "ERROR: I can't work under these conditions." >&2
exit 1
fi
# Allow more advanced pattern matching (for case..esac below)
shopt -s globstar
while read url; do
# remove comments
url=${url%%#*}
# skip empty lines
if [[ -z "$url" ]]; then
continue
fi
# Handle just ftp, http and https.
# We could do full URL pattern matching, but meh.
case "$url" in
#(f|ht)tp?(s)://*)
# Get just the numeric HTTP response code
http_code=$($curl -sL -w '%{http_code}' "$url" -o /dev/null)
case "$http_code" in
200|226)
# You'll get a 226 in ${http_code} from a valid FTP URL.
# If all you really care about is that the response is in the 200's,
# you could match against "2??" instead.
echo "$url" >> ok.txt
;;
*)
# You might want different handling for redirects (301/302).
echo "$url" >> notok.txt
;;
esac
;;
*)
# If we're here, we didn't get a URL we could read.
echo "WARNING: invalid url: $url" >&2
;;
esac
done < "$sitelist"
This is untested. For educational purposes only. May contain nuts.

Bash | curl | curls 2 URL's then stops

I am trying to write a simple bash script that will use a list from a text document and curl each URL that is on the list in order to see what the contents of each URL is. It allows me to cURL 2 sites and creates the text documents for the rest however it only downloads the first 2. I have already manage to write the script that pulls there IP's and places them in a seperate file using the grep command. At first i tried
#!/bin/bash
for var in `cat host.txt`; do
curl -s $var >> /tmp/ping/html/$var.html
done
I have tried with and without the silent switch. I then tried the following:
#!/bin/bash
for var in `head -2 host.txt`; do
curl $var >> /tmp/ping/html/$var.html
wait
done
for var in `head -4 host.txt | tail -2`; do
curl $var >> /tmp/ping/html/$var.html
done
This would try and do them all at the same time again stopping after 2
#!/bin/bash
for var in `head -2 host.txt`; do
curl $var >> /tmp/ping/html/$var.html
done
wait
for var in `head -4 host.txt | tail -2`; do
curl $var >> /tmp/ping/html/$var.html
done
This would do the same, I am new to bash scripting and only know some of the basics, any help would be appreciated
Start with the simple: verify that you are in fact iterating over the entire list:
# This is the recommended way to iterate over the file. See
# http://mywiki.wooledge.org/BashFAQ/001
while read -r var; do
echo "$var"
done < hosts.txt
Then add in the call to curl, checking its exit status
while read -r var; do
echo "$var"
curl "$var" >> /tmp/ping/html/$var.html || echo "curl failed: $?"
done < hosts.txt
You pipe into $var, which could result in a wrong filename, because of the two slashes in the URL. Additionally i would quote the URL. For Example it works with the basename of the URL.
#!/bin/bash
for var in `cat host.txt`; do
name=$(basename $var)
curl -v -s "$var" -o "/tmp/ping/html/$name.html"
done
You may also want to skip blank lines and Comments (#)
#!/bin/bash
file="host.txt"
curl="curl"
while read -r line
do
[[ $line = \#* ]] || [[ -z "${line}" ]] && continue
filename=$(basename $line)
$curl -s "$line" >> "/tmp/ping/html/$filename.html"
done < "$file"

Resources