Keep track of execution time in a bash script and terminate current command if it takes too long - linux

I'm trying to create a bash script to download files en masse from a certain website.
Their download links are sequential - e.g. it's just id=1, id=2, id=3 all the way up to 660000. The only requirement is that you have to be logged in, which makes this a bit harder. Oh, and the login will randomly time out after a few hours so I have to log back in.
Here's my current script, which works well about 99% of the time.
#!/bin/sh
cd downloads
for i in `seq 1 660000`
do
lastname=""
echo "Downloading file $i"
echo "Downloading file $i" >> _downloadinglog.txt
response=$(curl --write-out %{http_code} -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[sample URL to make sure cookie is still logged in]")
if ! [ $response -eq 200 ]
then
echo "Cookie didn't work, trying to re-log in..."
curl -d "userid=[USERNAME]" -d "pwd=[PASSWORD]" -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[login URL]"
response=$(curl --write-out %{http_code} -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[sample URL again]")
if ! [ $response -eq 200 ]
then
echo "Something weird happened?? Response code $response. Logging in didn't fix issue, fix then resume from $(($i - 1))"
echo "Something weird happened?? Response code $response. Logging in didn't fix issue, fix then resume from $(($i - 1))" >> _downloadinglog.txt
exit 0
fi
echo "Downloading file $(($i - 1)) again incase cookie expiring caused it to fail"
echo "Downloading file $(($i - 1)) again incase cookie expiring caused it to fail" >> _downloadinglog.txt
lastname=$(curl --write-out %{filename_effective} -O -J -b _cookies.txt -c _cookies.txt "[URL to download files]?id=$(($i - 1))")
echo "id $(($i - 1)) = $lastname" >> _downloadinglog.txt
lastname=""
echo "Downloading file $i"
echo "Downloading file $i" >> _downloadinglog.txt
fi
lastname=$(curl --write-out %{filename_effective} -O -J -b _cookies.txt -c _cookies.txt "[URL to download files]?id=$i")
echo "id $i = $lastname" >> _downloadinglog.txt
done
So basically what I have it doing is attempting to download a random file before moving to the next file in the set. If the download fails, we assume the login cookie is no longer valid and tell curl to log me back in.
This works great, and I was able to get several thousand files from it. But what would happen is - either my router goes down for a second or two, or THEIR site goes down for a minute or two, and curl will just sit there thinking it's downloading for hours. I once came back to it literally spending 24 hours on the same file. It doesn't seem to have the ability to know if the transfer timed out in the middle - only if it can't START the transfer.
I know there are ways to terminate execution of a command if you combine it with "sleep", but since this has to be "smart" and restart from where it left off, I can't just kill the whole script.
Any suggestions? I'm open to using something other than curl if I can use it to login via a terminal command.

You can try using the curl options --connect-timeout or --max-time .
--max-time should be your pick.
From manual :
--max-time
Maximum time in seconds that you allow the whole operation to take. This is useful for preventing your batch jobs from hanging for hours due to slow networks or links going down. Since 7.32.0, this option accepts decimal values, but the actual timeout will decrease in accuracy as the specified timeout increases in decimal precision. See also the --connect-timeout option.
If this option is used several times, the last one will be used.
Then capture the result of the command in a var and process further based on the result.

Related

Time out a curl command shell script centOS

checkServer(){
response=$(curl --connect-timeout 10 --write-out %{http_code} --silent --output /dev/null localhost:8080/patuna/servicecheck)
if [ "$response" = "200" ];
then echo "`date --rfc-3339=seconds` - Server is healthy, up and running"
return 0
else
echo "`date --rfc-3339=seconds` - Server is not healthy(response code - $response ), server is going to restrat"
startTomcat
fi
}
Here i want to time out the curl command but it dose not work. in centos7 Shell scrit. what i simply needs to do is timeout the curl command
ERROR code is curl: option --connect-timeout=: is unknown
checkServer(){
response=$(curl --max-time 20 --connect-timeout 0 --write-out %{http_code} --silent --output /dev/null localhost:8080/patuna/servicecheck)
if [ "$response" = "200" ];
then echo "`date --rfc-3339=seconds` - Server is healthy, up and running"
return 0
else
echo "`date --rfc-3339=seconds` - Server is not healthy(response code - $response ), server is going to restrat"
startTomcat
fi
}
You can try the option of --max-time.
Maximum time in seconds that you allow the whole operation to take. This is useful for preventing your batch jobs from hanging for hours due to slow networks or links
going down. Since 7.32.0, this option accepts decimal values, but the actual timeout will decrease in accuracy as the specified timeout increases in decimal precision.
If you just want to check the http status code. You might want to check out the --head option.
I suggest using --silent with --show-error at the same time in case that you might want to know the error message.

How to check status of URLs from text file using bash shell script

I have to check the status of 200 http URLs and find out which of these are broken links. The links are present in a simple text file (say URL.txt present in my ~ folder). I am using Ubuntu 14.04 and I am a Linux newbie. But I understand the bash shell is very powerful and could help me achieve what I want.
My exact requirement would be to read the text file which has the list of URLs and automatically check if the links are working and write the response to a new file with the URLs and their corresponding status (working/broken).
I created a file "checkurls.sh" and placed it in my home directory where the urls.txt file is also located. I gave execute privileges to the file using
$chmod +x checkurls.sh
The contents of checkurls.sh is given below:
#!/bin/bash
while read url
do
urlstatus=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "$url" )
echo "$url $urlstatus" >> urlstatus.txt
done < $1
Finally, I executed it from command line using the following -
$./checkurls.sh urls.txt
Voila! It works.
#!/bin/bash
while read -ru 4 LINE; do
read -r REP < <(exec curl -IsS "$LINE" 2>&1)
echo "$LINE: $REP"
done 4< "$1"
Usage:
bash script.sh urls-list.txt
Sample:
http://not-exist.com/abc.html
https://kernel.org/nothing.html
http://kernel.org/index.html
https://kernel.org/index.html
Output:
http://not-exist.com/abc.html: curl: (6) Couldn't resolve host 'not-exist.com'
https://kernel.org/nothing.html: HTTP/1.1 404 Not Found
http://kernel.org/index.html: HTTP/1.1 301 Moved Permanently
https://kernel.org/index.html: HTTP/1.1 200 OK
For everything, read the Bash Manual. See man curl, help, man bash as well.
What about to add some parallelism to the accepted solution. Lets modify the script chkurl.sh to be little easier to read and to handle just one request at a time:
#!/bin/bash
URL=${1?Pass URL as parameter!}
curl -o /dev/null --silent --head --write-out "$URL %{http_code} %{redirect_url}\n" "$URL"
And now you check your list using:
cat URL.txt | xargs -P 4 -L1 ./chkurl.sh
This could finish the job up to 4 times faster.
Herewith my full script that checks URLs listed in a file passed as an argument e.g. 'checkurls.sh listofurls.txt'.
What it does:
check url using curl and return HTTP status code
send email notifications when url returns other code than 200
create a temporary lock file for failed urls (file naming could be improved)
send email notification when url becoms available again
remove lock file once url becomes available to avoid further notifications
log events to a file and handle increasing log file size (AKA log
rotation, uncomment echo if code 200 logging required)
Code:
#!/bin/sh
EMAIL=" your#email.com"
DATENOW=`date +%Y%m%d-%H%M%S`
LOG_FILE="checkurls.log"
c=0
while read url
do
((c++))
LOCK_FILE="checkurls$c.lock"
urlstatus=$(/usr/bin/curl -H 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code}' "$url" )
if [ "$urlstatus" = "200" ]
then
#echo "$DATENOW OK $urlstatus connection->$url" >> $LOG_FILE
[ -e $LOCK_FILE ] && /bin/rm -f -- $LOCK_FILE > /dev/null && /bin/mail -s "NOTIFICATION URL OK: $url" $EMAIL <<< 'The URL is back online'
else
echo "$DATENOW FAIL $urlstatus connection->$url" >> $LOG_FILE
if [ -e $LOCK_FILE ]
then
#no action - awaiting URL to be fixed
:
else
/bin/mail -s "NOTIFICATION URL DOWN: $url" $EMAIL <<< 'Failed to reach or URL problem'
/bin/touch $LOCK_FILE
fi
fi
done < $1
# REMOVE LOG FILE IF LARGER THAN 100MB
# alow up to 2000 lines average
maxsize=120000
size=$(/usr/bin/du -k "$LOG_FILE" | /bin/cut -f 1)
if [ $size -ge $maxsize ]; then
/bin/rm -f -- $LOG_FILE > /dev/null
echo "$DATENOW LOG file [$LOG_FILE] has been recreated" > $LOG_FILE
else
#do nothing
:
fi
Please note that changing order of listed urls in text file will affect any existing lock files (remove all .lock files to avoid confusion). It would be improved by using url as file name but certain characters such as : # / ? & would have to be handled for operating system.
I recently released deadlink, a command-line tool for finding broken links in files. Install with
pip install deadlink
and use as
deadlink check /path/to/file/or/directory
or
deadlink replace-redirects /path/to/file/or/directory
The latter will replace permanent redirects (301) in the specified files.
Example output:
if your input file contains one url per line you can use a script to read each line, then try to ping the url, if ping success then the url is valid
#!/bin/bash
INPUT="Urls.txt"
OUTPUT="result.txt"
while read line ;
do
if ping -c 1 $line &> /dev/null
then
echo "$line valid" >> $OUTPUT
else
echo "$line not valid " >> $OUTPUT
fi
done < $INPUT
exit
ping options :
-c count
Stop after sending count ECHO_REQUEST packets. With deadline option, ping waits for count ECHO_REPLY packets, until the timeout expires.
you can use this option as well to limit waiting time
-W timeout
Time to wait for a response, in seconds. The option affects only timeout in absense
of any responses, otherwise ping waits for two RTTs.
curl -s -I --http2 http://$1 >> fullscan_curl.txt | cut -d: -f1 fullscan_curl.txt | cat fullscan_curl.txt | grep HTTP >> fullscan_httpstatus.txt
its work me

Bash Ping Site Check Script. (Longer Timeout)

I am currently using a Raspberry PI as a ping server and this is the script that I am using checking for a ok response.
I'm not familiar with bash scripting so it's a bit of a beginner question with the curl call, is there a way to increase the timeout, as it keeps reporting false down websites.
#!/bin/bash
SITESFILE=/sites.txt #list the sites you want to monitor in this file
EMAILS=" " #list of email addresses to receive alerts (comma separated)
while read site; do
if [ ! -z "${site}" ]; then
CURL=$(curl -s --head $site)
if echo $CURL | grep "200 OK" > /dev/null
then
echo "The HTTP server on ${site} is up!"
sleep 2
else
MESSAGE="This is an alert that your site ${site} has failed to respond 200 OK."
for EMAIL in $(echo $EMAILS | tr "," " "); do
SUBJECT="$site (http) Failed"
echo "$MESSAGE" | mail -s "$SUBJECT" $EMAIL
echo $SUBJECT
echo "Alert sent to $EMAIL"
done
fi
fi
done < $SITESFILE
Yes, man curl:
--connect-timeout <seconds>
Maximum time in seconds that you allow the connection to the server to take.
This only limits the connection phase, once curl has connected this option is
of no more use. See also the -m, --max-time option.
You can also consider using ping to test the connection before calling curl. Something with a ping -c2 will give you 2 pings to test the connection. Then just check the return from ping (i.e. [[ $? -eq 0 ]] means ping succeeded, then connect with curl)
Also you can use [ -n ${site} ] (site is set) instead of [ ! -z ${site} ] (site is not-unset). Additionally, you will generally want to use the [[ ]] test keywords instead of single [ ] for test constructs. For ultimate portability, just use test -n "${site}" (always double-quote when using test.
I think you need this option --max-time <seconds>
-m/--max-time <seconds>
Maximum time in seconds that you allow the whole operation to take. This is useful for preventing your batch jobs from hanging for hours
due to slow networks or links going down.
--connect-timeout <seconds>
Maximum time in seconds that you allow the connection to the server to take. This only limits the connection phase, once curl has connected this option is of no more use.

scp: how to find out that copying was finished

I'm using scp command to copy file from one Linux host to another.
I run scp commend on host1 and copy file from host1 to host2. File is quite big and it takes for some time to copy it.
On host2 file appears immediately as soon as copying was started. I can do everything with this file even if copying is still in progress.
Is there any reliable way to find out if copying was finished or not on host2?
Off the top of my head, you could do something like:
touch tinyfile
scp bigfile tinyfile user#host:
Then when tinyfile appears you know that the transfer of bigfile is complete.
As pointed out in the comments, this assumes that scp will copy the files one by one, in the order specified. If you don't trust it, you could do them one by one explicitly:
scp bigfile user#host:
scp tinyfile user#host:
The disadvantage of this approach is that you would potentially have to authenticate twice. If this were an issue you could use something like ssh-agent.
On sending side (host1) use script like this:
#!/bin/bash
echo 'starting transfer'
scp FILE USER#DST_SERVER:DST_PATH
OUT=$?
if [ $OUT = 0 ]; then
echo 'transfer successful'
touch successful
scp successful USER#DST_SERVER:DST_PATH
else
echo 'transfer faild'
fi
On receiving side (host2) make script like this:
#!/bin/bash
SLEEP_TIME=30
MAX_CNT=10
CNT=0
while [[ ! -e successful && $CNT < $MAX_CNT ]]; do
((CNT++))
sleep($SLEEP_TIME);
done;
if [[ -e successful ]]; then
echo 'successful'
rm successful
# do somethning with FILE
fi
With CNT and MAX_CNT you disable endless loop (in case file successful isn't transferred).
Product MAX_CNT and SLEEP_TIME should be equal or greater expected transfer time. In my example expected transfer time is less than 300 seconds.
A checksum (md5sum, sha256sum ,sha512sum) of the local and remote files would tell you if they're identical.
For the situation where you don't have SSH access to the remote system - like an FTP server - you can download the file after it's uploaded and compare the checksums. I do this for files I send from production scripts at work. Below is a snippet from the script in which I do this.
MD5SRC=$(md5sum $LOCALFILE | cut -c 1-32)
MD5TESTFILE=$(mktemp -p /ramdisk)
curl \
-o $MD5TESTFILE \
-sS \
-u $FTPUSER:$FTPPASS \
ftp://$FTPHOST/$REMOTEFILE
MD5DST=$(md5sum $MD5TESTFILE | cut -c 1-32)
if [ "$MD5SRC" == "$MD5DST" ]
then
echo "+Local and Remote files match!"
else
echo "-Local and Remote files don't match"
fi
if you use inotify-tools,
then the solution will looks like this:
while ! inotifywait -e close $(dirname ${bigfile_fullname}) 2>/dev/null | \
grep -Eo "CLOSE $(basename ${bigfile_fullname})$">/dev/null
do true
done
echo "File ${bigfile_fullname} closed"
After some investigation, and discussion of the problem on other forums I have found one more solution. Maybe it can help somebody.
There is a command "lsof". It lists open files. During copying the file will be opened, so the command
lsof | grep filename
will return non empty result.
So you might want to make a while loop to wait until lsof returns nothing and proceed with your task.
Example:
# provide your file name here
f=<nameOfYourFile>
lsofresult=`lsof | grep $f | wc -l`
while [ $lsofresult != 0 ]; do
echo still copying file $f...
sleep 5
lsofresult=`lsof | grep $f | wc -l`
done; echo copying file $f is finished: `ls $f`
For the duplicate question, How to check if file has been scp 100% to the remote location , which was for an expect script, to know if a file is transferred completely, we can add expect 100% .. .. i.e something like this ...
expect -c "
set timeout 1
spawn scp user#$REMOTE_IP:/tmp/my.file user#$HOST_IP:/home/.
expect yes/no { send yes\r ; exp_continue }
expect password: { send $SCP_PASSWORD\r }
expect 100%
sleep 1
exit
"
if [ -f "/home/my.file" ]; then
echo "Success"
fi
If avoiding a second SSH handshake is important, you can use something like the following:
ssh host cat \> bigfile \&\& touch complete < bigfile
Then wait for the "complete" file to get created on the remote end.

wget download not completing all pages

I need to download over 30k pages on Linux and imagined I could do that with a simple bash script + wget, here is what I came up with:
#!/bin/bash
start_time=$(date +%s)
for i in {1..30802}
do
echo "Downloading page http://www.domain.com/page:$i"
wget "http://www.domain.com/page:$i" -q -o /dev/null -b -O pages/$i
running=$(ps -ef | grep wget | wc -l)
while [ $running -gt 1000 ]
do
running=$(ps -ef | grep wget | wc -l)
echo "Current running $running process."
sleep 1;
done
done
while [ $running -gt 1 ]
do
running=$(ps -ef | grep wget | wc -l)
echo "Waiting for all the process to end..."
sleep 10;
done
finish_time=$(date +%s)
echo "Time duration: $((finish_time - start_time)) secs."
Some pages are not being completely downloaded!
Since the above code will make 1k wget parallel running process and
wait until it lowers to add more process, could it be that I am
actually exhausting all the available internet link ?
How could I make this more reliable to make sure the page is actually
being properly downloaded ?
EDIT:
I heard that curl is a better option for downloading pages is that
true ?
Here is a possible solution to your situation:
1) Change the way you call wget to something like this:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad) &
2) When your script finishes, search for all *.bad files and relaunch the wget for each of them. Delete the corresponding .bad file before the new retry.
3) Do until no *.bad file exists.
That's the general idea. Hope that helped!
EDIT:
For the situation in which wget processes disappear, are killed or end abruptly, there is a possible refinement:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad && touch $i.ok) &
Then you can analyze if some page has been downloaded completely or wget failed to end.
EDIT 2:
After some testings and digging, I've discovered that my former proposal was flawed. The order of the conditionals must be interchanged:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i && touch $i.ok || touch $i.bad) &
So,
If the download is executed correctly by wget (i.e. it finished with an OK return code) then there must be two files: the downloaded page and the .ok file.
If the download fails (i.e. wget returns a KO return code), then there must be the .bad file, and perhaps a partial download of the page.
In any case, only the .ok files are significant: they say that the download was finished correctly (from the wget point of view, and I will discuss this later).
If no .ok file is found for an specific page, then surely it has not been downloaded, so it must be retried.
Then, we get to the most delicate part of your procedure: what happens if a web server, as a response to that big number of requests, cancels those he cannot serve with an HTTP 200 response and a zero content length? That would be a good technique to avoid web copying or some kind of server attack.
If that's the case, you must take a look at the pattern of the responses. There will be an .ok file, but perhaps the file size of the downloaded page will be zero.
You can detect those zero-length downloads with:
filesize=$(cat $i.html | wc -c)
And then add some logic to the former procedure of .ok and .bad files:
retry=0
if [ -f $i.bad ]
then
retry=1
elif [ -f $i.ok ]
then
if [ $filesize -eq 0 ]
then
retry=1
fi
else
retry=1
fi
if [ $retry -eq 1 ]
then
# retry the download
fi
Hope this helped!
I don't know what kind of connection you have, high number of current connections leads to packet loss. Also consider what kind of connection the server has. If this is not an in-house server, the party that hosts the server might think this is a denial of service attack and filter your IP. It is more reliable just do it one by one. The bottle neck is almost always the internet connection, you can't do it any faster.

Resources