How to make my script continue mirroring where it left off? - linux

I'm creating a script to download and mirror a site, URLs are taken from a .txt file. The script is supposed to run daily for a few hours, so I need to get it to continue mirroring where it left off.
Here is the script:
# Created by Salik Sadruddin Merani
# email: ssm14293#gmail.com
# site: http://www.dragotech-innovations.tk
clear
echo ' Created by: Salik Sadruddin Merani'
echo ' email: ssm14293#gmail.com'
echo ' site: http://www.dragotech-innovations.tk'
echo
echo ' Info:'
echo ' This script will use the URLs provided in the File "urls.txt"'
echo ' Info: Logs will be saved in logfile.txt'
echo ' URLs are taken from the urls.txt file'
#
url=`< ./urls.txt`
useragent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'
echo ' Mozilla Firefox User agent will be used'
cred='log=abc#123.org&pwd=abc123&wp-submit=Log In&redirect_to=http://abc#123.org/wp-admin/&testcookie=1'
echo ' Loaded Credentails'
echo ' Logging In'
wget --save-cookies cookies.txt --post-data ${cred} --keep-session-cookies http://members.ebenpagan.com/wp-login.php --delete-after
OIFS=$IFS
IFS=','
arr2=$url
for x in $arr2
do
echo ' Loading Cookies'
wget --spider --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links --page-requisites ${x} -U ${useragent} -np --adjust-extension --continue -e robots=no --span-hosts --no-parent -o log-file-$x.txt
done
IFS=$OIFS
Problems with the script:
The script is not referencing its links correctly by making it referable to the file in the parent directory, please tell me about that.
The script is not resuming after being aborted even with the --continue option.

A smarter way to solve the problem is, work with two .txt files, let's affectionately call them "to_mirror.txt" and "mirrored.txt". Keep each URL on a single line. Declare in your script a variable of value 0, for example total_mirrored=0, it will be very important in our code. Therefore, every time the wget command is executed and, consequently, the site is mirrored, increment the value of the "total_mirrored" variable by +1.
Upon exiting the loop, "total_mirrored" will have any integer value.
Then you must extract the lines from "to_mirror.txt" in the range: first line up to "total_mirrored"; then attach this to "mirrored.txt".
After that delete the range from the file "to_mirror.txt".
In this case the sed command can help you, see my example:
sed -n "1,$total_mirrored p" to_mirror.txt >> mirrored.txt && sed -i "1,$total_mirrored d" to_mirror.txt
You can learn a lot about the sed command by running man sed in your terminal, so I won't explain here what each option does as it's redundant.
But know that:
>> appends the existing file, or creates a file if the file of the mentioned name is not present in the directory. && A && B — run B only if A succeeded.

The --continue flag in wget will attempt to resume the downloading of a single file in the current directory. Please refer to the man page of wget for more info. It is quite detailed.
What you need is resuming the mirroring/downloading from where the script previously left off.
So, its more of a modification of script than some setting in wget. I can suggest a way to do that, but mind you, you can use a different approach as well.
Modify the urls.txt file to have one URL per line. Then refer this pseudocode:
get the URL from the file
if (URL ends with a token #DONE), continue
else, wget command
append a token #DONE to the end of the URL in the file
This way, you will know which URL to continue from, the next time you run the script. All URLs that have a "#DONE" at the end will be skipped, and the rest will be downloaded.

Related

Downloading most recent file using curl

I need to download most recent file based on the file creation time from the remote site using curl. How can I achieve it?
These are files in remote site
user-producer-info-etl-2.0.0-20221213.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221212.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221214.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221215.111513-53-exec.jar
Above user-producer-info-etl-2.0.0-20221215.111513-53-exec.jar is the most recent file that I want to download? How can I achieve it?
Luckily for you, file names contains dates that are alphabetically sortable !
I don't know where you are so I'm guessing you have at least a shell and I propose this bash answer:
First get the last file name
readonly endpoint="https://your-gitlab.local"
# Get the last filename
readonly most_recent_file="$(curl -s "${endpoint}/get_list_uri"|sort|tail -n 1)"
# Download it
curl -LOs curl "${endpoint}/get/${most_recent_file}"
You will obviously need to replace urls accordingly but I'm sure you get the idea
-L : follow HTTP 302 redirects
-O : download file to local dir keeping the name as is
-s : silent don't show network times and stuff
you can also specify another local name with -o <the_most_recent_file>
for more info:
man curl
hth

How does one create a wrapper around a program?

I want to learn to create a wrapper around a program in linux. How does one do this? A tutorial reference web-page/link or example will do. To clarify what I want to learn, I will explain with an example.
I use vim for editing text files. And use rcs as my simple revision control system. rcs allows you to check-in and checkout-files. I would like to create a warpper program named vir which when I type in the shell as:
$ vir temp.txt
will load the file temp.txt into rcs with ci -u temp.txt and then allows me to edit the file using vim.
When I get out and go back in, It will need to check out the file first, using ci -u temp.txt and allow me to edit the file as one normally does with vim, and then when I save and exit, it should check-in the file using co -u temp.txt and as part of that I should be able to add a version control comment.
Basically, all I want to be doing on the command line is:
$ vir temp.txt
as one would with vim. And the wrapper should take care of the version control for me.
Take a look at rcsvers.vim, a vim plugin for automatically saving versions in RCS; you could modify that. There are also other RCS plugins for vim at vim.org
I have a wrapper to enhance the ping command (using zsh) it could, maybe help you:
# ping command wrapper - Last Change: out 27 2019 18:47
# source: https://www.cyberciti.biz/tips/unix-linux-bash-shell-script-wrapper-examples.html
ping(){
# Name: ping() wrapper
# Arg: (url|domain|ip)
# Purpose: Send ping request to domain by removing urls, protocol, username:pass using system /usr/bin/ping
local array=( $# ) # get all args in an array
local host=${array[-1]} # get the last arg
local args=${array[1,-2]} # get all args before last arg in $#
#local _ping="/usr/bin/ping"
local _ping="/bin/ping"
local c=$(_getdomainnameonly "$host")
[ "$host" != "$c" ] && echo "Sending ICMP ECHO_REQUEST to \"$c\"..."
# pass args and host
# $_ping $args $c
# default args for ping
$_ping -n -c 2 -i 1 -W1 $c
}
_getdomainnameonly(){
# Name: _getdomainnameonly
# Arg: Url/domain/ip
# Returns: Only domain name
# Purpose: Get domain name and remove protocol part, username:password and other parts from url
# get url
local h="$1"
# upper to lowercase
local f="${h:l}"
# remove protocol part of hostname
f="${f#http://}"
f="${f#https://}"
f="${f#ftp://}"
f="${f#scp://}"
f="${f#scp://}"
f="${f#sftp://}"
# Remove username and/or username:password part of hostname
f="${f#*:*#}"
f="${f#*#}"
# remove all /foo/xyz.html*
f=${f%%/*}
# show domain name only
echo "$f"
}
What it hides the local ping using a function called "ping", so if your script has precedence on your path it will find at first the function ping. Then inside the script I define an internal variable called ping that points out to the real ping command:
local _ping="/bin/ping"
You can also notice that the args are stored in one array.

Why shell output redirect to a random name file?

I write a crontab mission to make 3 POST request every 10 minutes by cURL, and here is pseudo:
#!/bin/sh
echo `date` >>/tmp/log
curl $a >>/tmp/log
curl $b >>/tmp/log
curl $c >>/tmp/log
That is all the code, but after the first echo to my /tmp/log, other output was saved in random file name like "A6E0U9~D", it doesn't happen all the time, I got no clues why.:(
PS. I don't use "$a", I use a raw string which copy from CHROME Dev Tool, and one of them is added below. And every single line's output is good, the only problem is some of the output was redirected to a random name file.
the cURL link is deleted because it contained my login cookie
Not really a solution, but you can redirect the output of everything at once, rather than repeatedly appending to the same file.
#!/bin/sh
{
date
curl ...
curl ...
curl ...
} > /tmp/log
The benefit here is that all the output will appear in the same file, whether that file is /tmp/log or an oddly named file. If you still end up with another file aside from /tmp/log, then you know there must be a problem with one of the curl calls.
(Note that capturing and re-printing the output of date is redundant.)
In order to run each curl in parallel, you'll need to save the output from each, and concatenate them once all have finished.
#!/bin/sh
{
date
tmp1=$(mktemp) && curl ... > "$tmp1" &
tmp2=$(mktemp) && curl ... > "$tmp2" &
tmp3=$(mktemp) && curl ... > "$tmp3" &
wait
cat "$tmp1" "$tmp2" "$tmp3"
} > /tmp/log
rm "$tmp1" "$tmp2" "$tmp3"

I get a scheme missing error with cron

when I use this to download a file from an ftp server:
wget ftp://blah:blah#ftp.haha.com/"$(date +%Y%m%d -d yesterday)-blah.gz" /myFolder/Documents/"$(date +%Y%m%d -d yesterday)-blah.gz"
It says "20131022-blah.gz saved" (it downloads fine), however I get this:
/myFolder/Documents/20131022-blah.gz: Scheme missing (I believe this error prevents it from saving the file in /myFolder/Documents/).
I have no idea why this is not working.
Save the filename in a variable first:
OUT=$(date +%Y%m%d -d yesterday)-blah.gz
and then use -O switch for output file:
wget ftp://blah:blah#ftp.haha.com/"$OUT" -O /myFolder/Documents/"$OUT"
Without the -O, the output file name looks like a second file/URL to fetch, but it's missing http:// or ftp:// or some other scheme to tell wget how to access it. (Thanks #chepner)
If wget takes time to download a big file then minute will change and your download filename will be different from filename being saved.
In my case I had it working with the npm module http-server.
And discovered that I simply had a leading space before http://.
So this was wrong " http://localhost:8080/archive.zip".
Changed to working solution "http://localhost:8080/archive.zip".
In my case I used in cpanel:
wget https://www.blah.com.br/path/to/cron/whatever

Use crontab job send mail, The email text turns to an attached file which named ATT00001.bin

I want to analysis some data in one linux server,then send the it as Email text to my Email account , But when i execute this shell scripts in shell command, It works well, Weird is that when i put all the procedure into crontab job, The Email text will turns to an attached file, Can someone help?
#* * * * * sh -x /opt/bin/exec.sh >> /opt/bin/mailerror 2>&1
/* exec.sh */
#/bin/sh
cd /opt/bin
./analysis.sh > test
mail -s "Today's Weather" example#example.com < test
But when i execute exec.sh in shell command line directly, The Email will get text, Can someone explain it for me, grate thanks.
Ran into the same problem myself, only I'm piping text output into mailx - Heirloom mailx 12.4 7/29/08
When running the script on the command line the email came out as normal email with a text body.
However, when I ran the exact same script via crontab the body of the email came as an attachment - ATT00001.BIN (Outlook), application/octet-stream (mutt) or "noname" (Gmail).
Took some research to figure this out, but here goes:
Problem
Mailx will, if it encounters unknown / control characters in text input, convert it into an attachment with application/octet-stream mime-type set.
From the man page:
for any file that contains formatting characters other than newlines and horizontal tabulators
So you need to remove those control characters, which can be done with i.e. tr
echo "$Output" | /usr/bin/tr -cd '\11\12\15\40-\176' | mail ...
However since I had Norwegian UTF8 characters: æøå - the list expand, and you don't really want to maintain such a list, and I need the norwegian characters.
And inspecting the attachment I found I had only \r, \n the "regular" ASCII characters in range 32-176 - all printable and 184 and 195 --> UTF8
Sollution
Explicitly set the locale in your script:
LANG="en_US.UTF8" ; export LANG
Run export in your shell - or setenv if you run csh or tcsh to determine what your locale is set to.
Explanation
Mailx - when run in your shell - with LANG set to .UTF8, will correctly identify the UTF8 chars and continue.
When run in crontab LANG is not set, and default to LANG=C, since by default crontab will run only a restricted set of environment variables (system dependant).
mailx (or other programs) will then not recognize UTF8 characters and determine that the input containes unknown control characters.
My issue was UTF8 characters, yours could be other control characters in your input. Run it through hexdump or od -c, but since it works OK in a regular shell I'm suspecting LANG issues.
References:
linux mail < file.log has Content-Type: application/octet-stream (a noname attachment in Gmail)
http://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix
I had this same issue and none of the above fixed the problem. Moving the extra return in the file fixed the issue for me:
cat logfile | tr -d \\r | mailx -s'the logfile' to-me#.....
Thanks to this forum:
https://forums.opensuse.org/showthread.php/445955-mailx-creates-unwanted-attachment
Make sure you change this in your script
#/bin/sh
to be replaced by
#!/bin/sh
Coming to the problem
Your script assumes that it is being run from a particular directory (note that almost every path is a relative path, not an absolute path). cron happens to be running it from another directory.
The Fix for text appearing on email
mydir=$(dirname "$0") && cd "${mydir}" || exit 1
./opt/bin/analysis.sh > test
mail -s "Today's Weather" example#example.com < /opt/bin/test
Explanation
$0 is the (possibly relative) filename of the shell script being executed. Given a filename, the dirname command returns the directory containing the filename.
So, that line changes directories to the directory containing the script or exits with an error code if either dirname or cd fails.
OR try to have full path like
./opt/bin/analysis.sh > test
mail -s "Today's Weather" example#example.com < /opt/bin/test
Note: The same problem is discussed earlier here
FOLLOW UP:
Try to remove
sh -x /opt/bin/exec.sh >> /opt/bin/mailerror 2>&1
and instead use
sh /opt/bin/exec.sh 2>&1 >> /opt/bin/mailerror
FOLLOW UP
You have to restart cron for changes to take effect if you do not use the crontab command to edit the file.
crontab -l > oldcrontab
cp oldcrontab newcrontab
echo "$newline" >> newcrontab
crontab < newcrontab
In my case, the cron was not a shell script but a PHP script (so I couldn't put the export LANG thing):
0 9 * * * apache php /test/myscript.php | mail -s "CRON - myscript" foo#bar.com
Solution:
In order to fix the same issue (content is mailed as attachment instead of body), I add LANG=fr_FR.UTF-8 at the beginning of the cron file:
MAILTO=vme1.etc-crond-backoffice-conf
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin
LANG=fr_FR.UTF-8
0 9 * * * apache php /test/myscript.php | mail -s "CRON - myscript" foo#bar.com
NB: puting LANG=fr_FR.UTF-8 in the /etc/environment file and restarting cron service worked too.
Reference:
Set LANG in crontab https://www.logikdev.com/2010/02/02/locale-settings-for-your-cron-job/

Resources