wget download not completing all pages

wget download not completing all pages - linux

I need to download over 30k pages on Linux and imagined I could do that with a simple bash script + wget, here is what I came up with:
#!/bin/bash
start_time=$(date +%s)
for i in {1..30802}
do
echo "Downloading page http://www.domain.com/page:$i"
wget "http://www.domain.com/page:$i" -q -o /dev/null -b -O pages/$i
running=$(ps -ef | grep wget | wc -l)
while [ $running -gt 1000 ]
do
running=$(ps -ef | grep wget | wc -l)
echo "Current running $running process."
sleep 1;
done
done
while [ $running -gt 1 ]
do
running=$(ps -ef | grep wget | wc -l)
echo "Waiting for all the process to end..."
sleep 10;
done
finish_time=$(date +%s)
echo "Time duration: $((finish_time - start_time)) secs."
Some pages are not being completely downloaded!
Since the above code will make 1k wget parallel running process and
wait until it lowers to add more process, could it be that I am
actually exhausting all the available internet link ?
How could I make this more reliable to make sure the page is actually
being properly downloaded ?
EDIT:
I heard that curl is a better option for downloading pages is that
true ?

Here is a possible solution to your situation:
1) Change the way you call wget to something like this:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad) &
2) When your script finishes, search for all *.bad files and relaunch the wget for each of them. Delete the corresponding .bad file before the new retry.
3) Do until no *.bad file exists.
That's the general idea. Hope that helped!
EDIT:
For the situation in which wget processes disappear, are killed or end abruptly, there is a possible refinement:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i || touch $i.bad && touch $i.ok) &
Then you can analyze if some page has been downloaded completely or wget failed to end.
EDIT 2:
After some testings and digging, I've discovered that my former proposal was flawed. The order of the conditionals must be interchanged:
(wget "http://www.domain.com/page:$i" -q -o /dev/null -O pages/$i && touch $i.ok || touch $i.bad) &
So,
If the download is executed correctly by wget (i.e. it finished with an OK return code) then there must be two files: the downloaded page and the .ok file.
If the download fails (i.e. wget returns a KO return code), then there must be the .bad file, and perhaps a partial download of the page.
In any case, only the .ok files are significant: they say that the download was finished correctly (from the wget point of view, and I will discuss this later).
If no .ok file is found for an specific page, then surely it has not been downloaded, so it must be retried.
Then, we get to the most delicate part of your procedure: what happens if a web server, as a response to that big number of requests, cancels those he cannot serve with an HTTP 200 response and a zero content length? That would be a good technique to avoid web copying or some kind of server attack.
If that's the case, you must take a look at the pattern of the responses. There will be an .ok file, but perhaps the file size of the downloaded page will be zero.
You can detect those zero-length downloads with:
filesize=$(cat $i.html | wc -c)
And then add some logic to the former procedure of .ok and .bad files:
retry=0
if [ -f $i.bad ]
then
retry=1
elif [ -f $i.ok ]
then
if [ $filesize -eq 0 ]
then
retry=1
fi
else
retry=1
fi
if [ $retry -eq 1 ]
then
# retry the download
fi
Hope this helped!

I don't know what kind of connection you have, high number of current connections leads to packet loss. Also consider what kind of connection the server has. If this is not an in-house server, the party that hosts the server might think this is a denial of service attack and filter your IP. It is more reliable just do it one by one. The bottle neck is almost always the internet connection, you can't do it any faster.

Related

clear my script logs every 10 second

I have script with name : run.sh
This is my script code :
#!/usr/bin/env bash
install() {
sudo apt-get update
sudo apt-get upgrade
}
if [ "$1" = "install" ]; then
install
else
if [ ! -f ./tg/tgcli ]; then
echo "tg not found"
echo "Run $0 install"
exit 1
fi
#sudo service redis-server restart
#./tg/tgcli -s ./bot/bot.lua -l 1 -E $#
./tg/tgcli -s ./bot/bot.lua $#
fi
and when run this script give me output like this every second :
[09:54] 2014 Hello
[09:55] 2014 Hi
[09:57] 2014 How Are you ?
and many like this (thousands in hour !)
and my server get slow in 5 hour.
i check print commands in bot.lua but there are no way to remove print it.
can you add some codes to clear my script logs every 10 second ?
Thanks a lot.
My Script Output Doesn't Save Anywhere and Just Show me in terminal
I want a code such as clear command on linux terminal , clear my script logs every 10 minute or 5 minute.
After 5 day of script running i can (sometimes can't) login my server and my server get very slow and i must wait 3 or 5 minute to login my server and this amazing after login my server my server again get fast !
and i forgot say i use byobu screen for run my scripts and I think screen get my server slow down.

I don't think that something as simple as this would cause your server to slow down, but you can add a check to your script to calculate the size or line count of your log file every time it runs.
This function assumes you are redirecting your output to a log file. Set the variables to whatever makes the most sense.
log_check() {
line_count=$(wc -l $log_file | awk '{print $1}')
size_check=$(du -ax $log_file | awk '{print $1}')
max_file_size="1500"
max_file_length="1000"
if [[ $line_count >= $max_file_length || $size_check >= $max_file_size ]]; then
echo "" > $log_file
fi
}
I would also recommend using [[ ]] over [ ] since this is a bash script, as long as you don't plan in it being posix compliant and only plan on using it with bash [[]] is always better than [].
EDIT:
Since you are logging output to the terminal and not a file you can literally use the clear command in your script.
Try this out and see how the functionality works
for i in {1..20}; do
echo $i
if (( i == 10 )); then
clear
fi
done
I'm assuming your code has a loop somewhere, if not it will be a bit more complex to clear the terminal session. I'm not really sure what part of your code is actually printing anything to stdout, I'm guessing it's this piece here
./tg/tgcli -s ./bot/bot.lua $#
You could try something like this, which will background your initial process and then run clear every 60 seconds to clear the terminal window. Is there any reason you're not writing the output to a log file? That alone could solve some of your issues as well.
#!/bin/bash
./tg/tgcli -s ./bot/bot.lua $# &
pid="$!"
check_pid() {
ps -ef |grep "$pid"|grep -v 'grep' &>/dev/null
}
cnt=1
until ! check_pid; do
if (( cnt == 6 )); then
clear
cnt=1
fi
sleep 10
((cnt++))
done

Keep track of execution time in a bash script and terminate current command if it takes too long

I'm trying to create a bash script to download files en masse from a certain website.
Their download links are sequential - e.g. it's just id=1, id=2, id=3 all the way up to 660000. The only requirement is that you have to be logged in, which makes this a bit harder. Oh, and the login will randomly time out after a few hours so I have to log back in.
Here's my current script, which works well about 99% of the time.
#!/bin/sh
cd downloads
for i in `seq 1 660000`
do
lastname=""
echo "Downloading file $i"
echo "Downloading file $i" >> _downloadinglog.txt
response=$(curl --write-out %{http_code} -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[sample URL to make sure cookie is still logged in]")
if ! [ $response -eq 200 ]
then
echo "Cookie didn't work, trying to re-log in..."
curl -d "userid=[USERNAME]" -d "pwd=[PASSWORD]" -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[login URL]"
response=$(curl --write-out %{http_code} -b _cookies.txt -c _cookies.txt --silent --output /dev/null "[sample URL again]")
if ! [ $response -eq 200 ]
then
echo "Something weird happened?? Response code $response. Logging in didn't fix issue, fix then resume from $(($i - 1))"
echo "Something weird happened?? Response code $response. Logging in didn't fix issue, fix then resume from $(($i - 1))" >> _downloadinglog.txt
exit 0
fi
echo "Downloading file $(($i - 1)) again incase cookie expiring caused it to fail"
echo "Downloading file $(($i - 1)) again incase cookie expiring caused it to fail" >> _downloadinglog.txt
lastname=$(curl --write-out %{filename_effective} -O -J -b _cookies.txt -c _cookies.txt "[URL to download files]?id=$(($i - 1))")
echo "id $(($i - 1)) = $lastname" >> _downloadinglog.txt
lastname=""
echo "Downloading file $i"
echo "Downloading file $i" >> _downloadinglog.txt
fi
lastname=$(curl --write-out %{filename_effective} -O -J -b _cookies.txt -c _cookies.txt "[URL to download files]?id=$i")
echo "id $i = $lastname" >> _downloadinglog.txt
done
So basically what I have it doing is attempting to download a random file before moving to the next file in the set. If the download fails, we assume the login cookie is no longer valid and tell curl to log me back in.
This works great, and I was able to get several thousand files from it. But what would happen is - either my router goes down for a second or two, or THEIR site goes down for a minute or two, and curl will just sit there thinking it's downloading for hours. I once came back to it literally spending 24 hours on the same file. It doesn't seem to have the ability to know if the transfer timed out in the middle - only if it can't START the transfer.
I know there are ways to terminate execution of a command if you combine it with "sleep", but since this has to be "smart" and restart from where it left off, I can't just kill the whole script.
Any suggestions? I'm open to using something other than curl if I can use it to login via a terminal command.

You can try using the curl options --connect-timeout or --max-time .
--max-time should be your pick.
From manual :
--max-time
Maximum time in seconds that you allow the whole operation to take. This is useful for preventing your batch jobs from hanging for hours due to slow networks or links going down. Since 7.32.0, this option accepts decimal values, but the actual timeout will decrease in accuracy as the specified timeout increases in decimal precision. See also the --connect-timeout option.
If this option is used several times, the last one will be used.
Then capture the result of the command in a var and process further based on the result.

Is there a way to perform a "tail -f" from an url?

I currently use tail -f to monitor a log file: this way I get an autorefreshing console monitoring a web server.
Now, said webserver was moved to another host and I have no shell privileges for that.
Nevertheless I have a .txt network path, which in the end is a log file which is constantly updated.
So, I'd like to do something like tail -f, but on that url.
Would it be possible?In the end "in linux everything is a file" so..

You can do auto-refresh with help of watch combined with wget.
It won't show history, like tail -f, rather update screen like top.
Example of command, that shows content on file.txt on the screen, and update output every five seconds:
watch -n 5 wget -qO- http://fake.link/file.txt
Also, you can output n last lines, instead of the whole file:
watch -n 5 "wget -qO- http://fake.link/file.txt | tail"
In case if you still need behaviour like "tail -f" (with keeping history), I think you need to write a script that will download log file each time period, compare it to previous downloaded version, and then print new lines. Should be quite easy.

I wrote a simple bash script to fetch URL content each 2 seconds and compare with local file output.txt then append the diff to the same file
I wanted to stream AWS amplify logs in my Jenkins pipeline
while true; do comm -13 --output-delimiter="" <(cat output.txt) <(curl -s "$URL") >> output.txt; sleep 2; done
don't forget to create empty file output.txt file first
: > output.txt
view the stream :
tail -f output.txt
original comment : https://stackoverflow.com/a/62347827/2073339
UPDATE:
I found better solution using wget here:
while true; do wget -ca -o /dev/null -O output.txt "$URL"; sleep 2; done
https://superuser.com/a/514078/603774

I've made this small function and added it to the .*rc of my shell. This uses wget -c, so it does not re-download the whole page:
# Poll logs continuously over HTTP
logpoll() {
FILE=$(mktemp)
echo "———————— LOGPOLLING TO $FILE ————————"
tail -f $FILE &
tail_pid=$!
bg %1
stop=0
trap "stop=1" SIGINT SIGTERM
while [ $stop -ne 1 ]; do wget -co /dev/null -O $FILE "$1"; sleep 2; done
echo "——————————— LOGPOLL DONE ————————————"
kill $tail_pid
rm $FILE
trap - SIGINT SIGTERM
}
Explanation:
Create a temporary logfile using mktemp and save its path to $FILE
Make tail -f output the logfile continuously in the background
Make ctrl+c set stop to 1 instead of exiting the function
Loop until stop bit is set, i.e. until the user presses ctrl+c
wget given URL in a loop every two seconds:
-c - "continue getting partially downloaded file", so that wget continues instead of truncating the file and downloading again
-o /dev/null - wget's log messages shall be thrown into the void
-O $FILE - output the contents to the temp logfile we've created
Clean up after yourself: kill the tail -f, delete the temporary logfile, unset the signal handlers.

The proposed solutions periodically download the full file.
To avoid that I've created a package and published in NPM that does a HEAD request ( getting the size of the file ) and requesting only the last bytes.
Check it out and let me know if you need any help.
https://www.npmjs.com/package/#imdt-os/url-tail

How to check status of URLs from text file using bash shell script

I have to check the status of 200 http URLs and find out which of these are broken links. The links are present in a simple text file (say URL.txt present in my ~ folder). I am using Ubuntu 14.04 and I am a Linux newbie. But I understand the bash shell is very powerful and could help me achieve what I want.
My exact requirement would be to read the text file which has the list of URLs and automatically check if the links are working and write the response to a new file with the URLs and their corresponding status (working/broken).

I created a file "checkurls.sh" and placed it in my home directory where the urls.txt file is also located. I gave execute privileges to the file using
$chmod +x checkurls.sh
The contents of checkurls.sh is given below:
#!/bin/bash
while read url
do
urlstatus=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "$url" )
echo "$url $urlstatus" >> urlstatus.txt
done < $1
Finally, I executed it from command line using the following -
$./checkurls.sh urls.txt
Voila! It works.

#!/bin/bash
while read -ru 4 LINE; do
read -r REP < <(exec curl -IsS "$LINE" 2>&1)
echo "$LINE: $REP"
done 4< "$1"
Usage:
bash script.sh urls-list.txt
Sample:
http://not-exist.com/abc.html
https://kernel.org/nothing.html
http://kernel.org/index.html
https://kernel.org/index.html
Output:
http://not-exist.com/abc.html: curl: (6) Couldn't resolve host 'not-exist.com'
https://kernel.org/nothing.html: HTTP/1.1 404 Not Found
http://kernel.org/index.html: HTTP/1.1 301 Moved Permanently
https://kernel.org/index.html: HTTP/1.1 200 OK
For everything, read the Bash Manual. See man curl, help, man bash as well.

What about to add some parallelism to the accepted solution. Lets modify the script chkurl.sh to be little easier to read and to handle just one request at a time:
#!/bin/bash
URL=${1?Pass URL as parameter!}
curl -o /dev/null --silent --head --write-out "$URL %{http_code} %{redirect_url}\n" "$URL"
And now you check your list using:
cat URL.txt | xargs -P 4 -L1 ./chkurl.sh
This could finish the job up to 4 times faster.

Herewith my full script that checks URLs listed in a file passed as an argument e.g. 'checkurls.sh listofurls.txt'.
What it does:
check url using curl and return HTTP status code
send email notifications when url returns other code than 200
create a temporary lock file for failed urls (file naming could be improved)
send email notification when url becoms available again
remove lock file once url becomes available to avoid further notifications
log events to a file and handle increasing log file size (AKA log
rotation, uncomment echo if code 200 logging required)
Code:
#!/bin/sh
EMAIL=" your#email.com"
DATENOW=`date +%Y%m%d-%H%M%S`
LOG_FILE="checkurls.log"
c=0
while read url
do
((c++))
LOCK_FILE="checkurls$c.lock"
urlstatus=$(/usr/bin/curl -H 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code}' "$url" )
if [ "$urlstatus" = "200" ]
then
#echo "$DATENOW OK $urlstatus connection->$url" >> $LOG_FILE
[ -e $LOCK_FILE ] && /bin/rm -f -- $LOCK_FILE > /dev/null && /bin/mail -s "NOTIFICATION URL OK: $url" $EMAIL <<< 'The URL is back online'
else
echo "$DATENOW FAIL $urlstatus connection->$url" >> $LOG_FILE
if [ -e $LOCK_FILE ]
then
#no action - awaiting URL to be fixed
:
else
/bin/mail -s "NOTIFICATION URL DOWN: $url" $EMAIL <<< 'Failed to reach or URL problem'
/bin/touch $LOCK_FILE
fi
fi
done < $1
# REMOVE LOG FILE IF LARGER THAN 100MB
# alow up to 2000 lines average
maxsize=120000
size=$(/usr/bin/du -k "$LOG_FILE" | /bin/cut -f 1)
if [ $size -ge $maxsize ]; then
/bin/rm -f -- $LOG_FILE > /dev/null
echo "$DATENOW LOG file [$LOG_FILE] has been recreated" > $LOG_FILE
else
#do nothing
:
fi
Please note that changing order of listed urls in text file will affect any existing lock files (remove all .lock files to avoid confusion). It would be improved by using url as file name but certain characters such as : # / ? & would have to be handled for operating system.

I recently released deadlink, a command-line tool for finding broken links in files. Install with
pip install deadlink
and use as
deadlink check /path/to/file/or/directory
or
deadlink replace-redirects /path/to/file/or/directory
The latter will replace permanent redirects (301) in the specified files.
Example output:

if your input file contains one url per line you can use a script to read each line, then try to ping the url, if ping success then the url is valid
#!/bin/bash
INPUT="Urls.txt"
OUTPUT="result.txt"
while read line ;
do
if ping -c 1 $line &> /dev/null
then
echo "$line valid" >> $OUTPUT
else
echo "$line not valid " >> $OUTPUT
fi
done < $INPUT
exit
ping options :
-c count
Stop after sending count ECHO_REQUEST packets. With deadline option, ping waits for count ECHO_REPLY packets, until the timeout expires.
you can use this option as well to limit waiting time
-W timeout
Time to wait for a response, in seconds. The option affects only timeout in absense
of any responses, otherwise ping waits for two RTTs.

curl -s -I --http2 http://$1 >> fullscan_curl.txt | cut -d: -f1 fullscan_curl.txt | cat fullscan_curl.txt | grep HTTP >> fullscan_httpstatus.txt
its work me

scp: how to find out that copying was finished

I'm using scp command to copy file from one Linux host to another.
I run scp commend on host1 and copy file from host1 to host2. File is quite big and it takes for some time to copy it.
On host2 file appears immediately as soon as copying was started. I can do everything with this file even if copying is still in progress.
Is there any reliable way to find out if copying was finished or not on host2?

Off the top of my head, you could do something like:
touch tinyfile
scp bigfile tinyfile user#host:
Then when tinyfile appears you know that the transfer of bigfile is complete.
As pointed out in the comments, this assumes that scp will copy the files one by one, in the order specified. If you don't trust it, you could do them one by one explicitly:
scp bigfile user#host:
scp tinyfile user#host:
The disadvantage of this approach is that you would potentially have to authenticate twice. If this were an issue you could use something like ssh-agent.

On sending side (host1) use script like this:
#!/bin/bash
echo 'starting transfer'
scp FILE USER#DST_SERVER:DST_PATH
OUT=$?
if [ $OUT = 0 ]; then
echo 'transfer successful'
touch successful
scp successful USER#DST_SERVER:DST_PATH
else
echo 'transfer faild'
fi
On receiving side (host2) make script like this:
#!/bin/bash
SLEEP_TIME=30
MAX_CNT=10
CNT=0
while [[ ! -e successful && $CNT < $MAX_CNT ]]; do
((CNT++))
sleep($SLEEP_TIME);
done;
if [[ -e successful ]]; then
echo 'successful'
rm successful
# do somethning with FILE
fi
With CNT and MAX_CNT you disable endless loop (in case file successful isn't transferred).
Product MAX_CNT and SLEEP_TIME should be equal or greater expected transfer time. In my example expected transfer time is less than 300 seconds.

A checksum (md5sum, sha256sum ,sha512sum) of the local and remote files would tell you if they're identical.
For the situation where you don't have SSH access to the remote system - like an FTP server - you can download the file after it's uploaded and compare the checksums. I do this for files I send from production scripts at work. Below is a snippet from the script in which I do this.
MD5SRC=$(md5sum $LOCALFILE | cut -c 1-32)
MD5TESTFILE=$(mktemp -p /ramdisk)
curl \
-o $MD5TESTFILE \
-sS \
-u $FTPUSER:$FTPPASS \
ftp://$FTPHOST/$REMOTEFILE
MD5DST=$(md5sum $MD5TESTFILE | cut -c 1-32)
if [ "$MD5SRC" == "$MD5DST" ]
then
echo "+Local and Remote files match!"
else
echo "-Local and Remote files don't match"
fi

if you use inotify-tools,
then the solution will looks like this:
while ! inotifywait -e close $(dirname ${bigfile_fullname}) 2>/dev/null | \
grep -Eo "CLOSE $(basename ${bigfile_fullname})$">/dev/null
do true
done
echo "File ${bigfile_fullname} closed"

After some investigation, and discussion of the problem on other forums I have found one more solution. Maybe it can help somebody.
There is a command "lsof". It lists open files. During copying the file will be opened, so the command
lsof | grep filename
will return non empty result.
So you might want to make a while loop to wait until lsof returns nothing and proceed with your task.
Example:
# provide your file name here
f=<nameOfYourFile>
lsofresult=`lsof | grep $f | wc -l`
while [ $lsofresult != 0 ]; do
echo still copying file $f...
sleep 5
lsofresult=`lsof | grep $f | wc -l`
done; echo copying file $f is finished: `ls $f`

For the duplicate question, How to check if file has been scp 100% to the remote location , which was for an expect script, to know if a file is transferred completely, we can add expect 100% .. .. i.e something like this ...
expect -c "
set timeout 1
spawn scp user#$REMOTE_IP:/tmp/my.file user#$HOST_IP:/home/.
expect yes/no { send yes\r ; exp_continue }
expect password: { send $SCP_PASSWORD\r }
expect 100%
sleep 1
exit
"
if [ -f "/home/my.file" ]; then
echo "Success"
fi

If avoiding a second SSH handshake is important, you can use something like the following:
ssh host cat \> bigfile \&\& touch complete < bigfile
Then wait for the "complete" file to get created on the remote end.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

wget download not completing all pages - linux

Related

clear my script logs every 10 second

Keep track of execution time in a bash script and terminate current command if it takes too long

Is there a way to perform a "tail -f" from an url?

How to check status of URLs from text file using bash shell script

scp: how to find out that copying was finished

Categories

Resources