wget newest file in another server's folder - linux

I have an automatic backup of a file running on a cronjob. It outputs into a folder, let's call /backup, and appends a timestamp to each file, every hour, like so:
file_08_07_2013_01_00_00.txt, file_08_07_2013_02_00_00.txt, etc.
I want to download these to another server, to keep as a separate backup. I normally just use wget and download a specific file, but was wondering how I could automate this, ideally every hour it would download the most recent file.
What would I need to look into to set this up?
Thanks!

wget can handle that, just enable time-stamping. I'm not even going to attempt my own explanation, here's a direct quote from the manual:
The usage of time-stamping is simple. Say you would like to download a
file so that it keeps its date of modification.
wget -S http://www.gnu.ai.mit.edu/
A simple ls -l shows that the time stamp on the local file equals the state of the Last-Modified
header, as returned by the server. As you can see, the time-stamping
info is preserved locally, even without ā€˜-Nā€™ (at least for http).
Several days later, you would like Wget to check if the remote file
has changed, and download it if it has.
wget -N http://www.gnu.ai.mit.edu/
Wget will ask the server for the last-modified date. If the local file has the same timestamp as
the server, or a newer one, the remote file will not be re-fetched.
However, if the remote file is more recent, Wget will proceed to fetch
it.

Related

lftp mirror with prefix

I did a partial download of a directory full of logs from a remote ftp server.
For example mget *201610* this will pull down all the logs with 201610 as apart of the name.
What I would like to do is continue the download since I only have about 100 of the 400 files for that month. Since I already downloaded a lot of the files mirror -c would be the best option as it only downloads the stuff I don't have.
Question:
how do I incorporate mirror -c for only the subset of files that have 201610 in them? I dont want to start downloading all the files for other months just the missing ones from 201610
Thanks
Use -I option of mirror, it allows to specify a wildcard for files to be included.

How to overwrite ".listing" file when using "wget" command

I have a generic script that uses wget to download the file (passed as parameter to the script) from FTP server. The script always downloads the files into the same local folder. The problem I am running into is that .listing file created by wget gets deleted by default so if the script is called in parallel for different files, whichever process gets to delete the .listing file succeeds and the rest fail.
So I tried to use --no-remove-listing along with wget command, but then I get the error:
File ".listing" already there; not retrieving.
I looked at another post but as mentioned in the comments by original poster, the question hasn't been answered even though it is marked so.
One option I was thinking about is to change the script to create subdirectory with filename and download the file there. But since it is a large script, I was trying to see if there is an easier option to just change wget command.

wget to download new wildcard files and overwrite old ones

I'm currently using wget to download specific files from a remote server. The files are updated every week, but always have the same file names. e.g new upload file1.jpg will replace local file1.jpg
This is how I am grabbing them, nothing fancy :
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/file1.jpg
This downloads file1.jpg from the remote server if it is newer than the local version then overwrites the local one with the new one.
Trouble is, I'm doing this for over 100 files every week and have set up cron jobs to fire the 100 different download scripts at specific times.
Is there a way I can use a wildcard for the file name and have just one script that fires every 5 minutes for example?
Something like....
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/*.jpg
Will that work? Will it check the local folder for all current file names, see what is new and then download and overwrite only the new ones? Also, is there any danger of it downloading partially uploaded files on the remote server?
I know that some kind of file sync script between servers would be a better option but they all look pretty complicated to set up.
Many thanks!
You can specify the files to be downloaded one by one in a text file, and then pass that file name using option -i or --input-file.
e.g. contents of list.txt:
http://xx.xxx.xxx.xxx/remote/files/file1.jpg
http://xx.xxx.xxx.xxx/remote/files/file2.jpg
http://xx.xxx.xxx.xxx/remote/files/file3.jpg
....
then
wget .... --input-file list.txt
Alternatively, If all your *.jpg files are linked from a particular HTML page, you can use recursive downloading, i.e. let wget follow links on your page to all linked resources. You might need to limit the "recursion level" and file types in order to prevent downloading too much. See wget --help for more info.
wget .... --recursive --level=1 --accept=jpg --no-parent http://.../your-index-page.html

Keep files updated from remote server

I have a server at hostname.com/files. Whenever a file has been uploaded I want to download it.
I was thinking of creating a script that constantly checked the files directory. It would check the timestamp of the files on the server and download them based on that.
Is it possible to check the files timestamp using a bash script? Are there better ways of doing this?
I could just download all the files in the server every 1 hour. Would it therefore be better to use a cron job?
If you have a regular interval at which you'd like to update your files, yes, a cron job is probably your best bet. Just write a script that does the checking and run that at an hourly interval.
As #Barmar commented above, rsync could be another option. Put something like this in the crontab and you should be set:
# min hour day month day-of-week user command
17 * * * * user rsync -av http://hostname.com/ >> rsync.log
would grab files from the server in that location and append the details to rsync.log on the 17th minute of every hour. Right now, though, I can't seem to get rsync to get files from a webserver.
Another option using wget is:
wget -Nrb -np -o wget.log http://hostname.com/
where -N re-downloads only files newer than the timestamp on the local version, -b sends
the process to the background, -r recurses into directories and -o specifies a log file. This works from an arbitrary web server. -np makes sure it doesn't go up into a parent directory, effectively spidering the entire server's content.
More details, as usual, will be in the man pages of rsync or wget.

checksum remote file

Is there a way to get a program I can run via the command line that would do a checksum of a remote file? For instance get a checksum of https://stackoverflow.com/opensearch.xml
I want to be able get an update of when a new rss/xml entry is available. I was thinking I could do a checksum of a file every once in a while and if it is different then there must be an update. I'm looking to write a shell script that checks new rss/xml data.
A quick way to do this with curl is to pipe the output to sha1sum as follows:
curl -s http://stackoverflow.com/opensearch.xml|sha1sum
In order to make a checksum on the file, you'll have to download it first.
Instead of this, use If-Modified-Since in your request headers, and server will respond with 304 not modified header and without content, if the file is not changed, or with the content of the file, if it was changed. You may be interested also in checking for ETag support on the server.
If downloading the file is not a problem, you can use md5_file to get md5 checksum of the file
curl
curl has an '-z' option:
-z/--time-cond <date expression>|<file>
(HTTP/FTP) Request a file that has been modified later
than the given time and date, or one that has been modified before
that time. The <date expression> can be all sorts of date strings
or if it doesn't match any internal ones, it is taken as a filename
and tries to get the modification date (mtime) from <file> instead.
See the curl_getdate(3) man pages for date expression details.
So what you can do is:
$ curl http://stackoverflow.com/opensearch.xml -z opensearch.xml -o opensearch.xml
This will do actual download if remote file is younger than the local one (local file may absent - in this case it will be downloaded). Which seems to be exactly what you need...
wget
wget also has an option to track timestamps - -N
When running Wget with -N, with or without -r or -p, the decision as to whether
or not to download a newer copy of a file depends on the local and remote
timestamp and size of the file.
-N, --timestamping Turn on time-stamping.
So in case wget one can use:
$ wget -N http://stackoverflow.com/opensearch.xml
You can try this under your bash:
wget <http://your file link>
md5sum <your file name>
You should first examine the HTTP headers to see if the server itself is willing to tell you when the file is from; it's considered bad form to fetch the entire file if you don't need to.
Otherwise, you'll need to use something like wget or curl to fetch the file, so I really hope you don't plan to be working with anything large.

Resources