I get a scheme missing error with cron - linux

when I use this to download a file from an ftp server:
wget ftp://blah:blah#ftp.haha.com/"$(date +%Y%m%d -d yesterday)-blah.gz" /myFolder/Documents/"$(date +%Y%m%d -d yesterday)-blah.gz"
It says "20131022-blah.gz saved" (it downloads fine), however I get this:
/myFolder/Documents/20131022-blah.gz: Scheme missing (I believe this error prevents it from saving the file in /myFolder/Documents/).
I have no idea why this is not working.

Save the filename in a variable first:
OUT=$(date +%Y%m%d -d yesterday)-blah.gz
and then use -O switch for output file:
wget ftp://blah:blah#ftp.haha.com/"$OUT" -O /myFolder/Documents/"$OUT"
Without the -O, the output file name looks like a second file/URL to fetch, but it's missing http:// or ftp:// or some other scheme to tell wget how to access it. (Thanks #chepner)
If wget takes time to download a big file then minute will change and your download filename will be different from filename being saved.

In my case I had it working with the npm module http-server.
And discovered that I simply had a leading space before http://.
So this was wrong " http://localhost:8080/archive.zip".
Changed to working solution "http://localhost:8080/archive.zip".

In my case I used in cpanel:
wget https://www.blah.com.br/path/to/cron/whatever

Related

Downloading most recent file using curl

I need to download most recent file based on the file creation time from the remote site using curl. How can I achieve it?
These are files in remote site
user-producer-info-etl-2.0.0-20221213.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221212.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221214.111513-53-exec.jar
user-producer-info-etl-2.0.0-20221215.111513-53-exec.jar
Above user-producer-info-etl-2.0.0-20221215.111513-53-exec.jar is the most recent file that I want to download? How can I achieve it?
Luckily for you, file names contains dates that are alphabetically sortable !
I don't know where you are so I'm guessing you have at least a shell and I propose this bash answer:
First get the last file name
readonly endpoint="https://your-gitlab.local"
# Get the last filename
readonly most_recent_file="$(curl -s "${endpoint}/get_list_uri"|sort|tail -n 1)"
# Download it
curl -LOs curl "${endpoint}/get/${most_recent_file}"
You will obviously need to replace urls accordingly but I'm sure you get the idea
-L : follow HTTP 302 redirects
-O : download file to local dir keeping the name as is
-s : silent don't show network times and stuff
you can also specify another local name with -o <the_most_recent_file>
for more info:
man curl
hth

Wget Macro for downloading multiple URLS?

(NOTE: You need at least 10 reputation to post more than 2 links. I had to remove the http and urls, but it's still understandable i hope!)
Hello!
I am trying to Wget an entire website for personal educational use. Here's what the URL looks like:
example.com/download.php?id=1
i want to download all the pages from 1 to the last page which is 4952
so the first URL is:
example.com/download.php?id=1
and the second is
example.com/download.php?id=4952
What would be the most efficient method to download the pages from 1 - 4952?
My current command is (it's working perfectly fine, the exact way i want it to):
wget -P /home/user/wget -S -nd --reject=.rar http://example.com/download.php?id=1
NOTE: The website has a troll and if you try to run the following command:
wget -P /home/user/wget -S -nd --reject=.rar --recursive --no-clobber --domains=example.com --no-parent http://example.com/download.php
it will download a 1000GB .rar file just to troll you!!!
Im new to linux, please be nice! just trying to learn!
Thank you!
Notepadd++ =
your URL + Column Editor = Massive list of all urls
Wget -I your_file_with_all_urls = Success!
thanks to Barmar

Downloading json file from json file curl

I have a json file with the structure seen below:
{
url: "https://mysite.com/myjsonfile",
version_number: 69,
}
This json file is accessed from mysite.com/myrootjsonfile
I want to run a load data script to access mysite.com/myrootjsonfile and load the json content from the url field using curl and save the resulting content to local storage.
This is my attempt so far.
curl -o assets/content.json 'https://mysite.com/myrootjsonfile' | grep -Po '(?<="url": ")[^"]*'
unfortunately, instead of saving the content from mysite.com/myjsonfile its saving the content from above: mysite.com/myrootjsonfile. Can anyone point out what i might be doing wrong? Bear in mind in a completely new to curl. Thanks!
It is saving the content from myrootjsonfile because that is what you are telling curl to do - to save that file to the location assets/content.json, and then greping stdin, which is empty. You need to use two curl commands, one to download the root file (and process it to find the URL of the second), and the second to download the actual content you want. You can use command substitution for this:
my_url=$(curl https://mysite.com/myrootjsonfile | grep -Po '(?<=url: )[^,]*')
curl -o assets/content.json "$my_url"
I also changed the grep regex - this one matches a string of non-comma characters which follow after "url: ".
Assuming you wished to save the file to assets/content.json, note that flags are case sensitive.
Use -o instead of -O to redirect the output to assets/content.json.

Bash String Interpolation wget

I'm trying to download files using wget and for some reason I think I'm encountering some problems with string interpolation. I have a list of files to be downloaded that I have successfully parsed etc (no small feat for me) and would like to incorporate these into a for loop wget statement combo which downloads these files en masse.
Please forgive me the URLs won't work for you because the password and data have been changed.
I have tried single and double quotes as well as escaping a few characters in the URL (the &s and #s which I presume are at the marrow of this).
list of files
files.txt
/results/KIDFROST#LARAZA.com--B627-.txt.gz
/results/KIDFROST#LARAZA.com--B626-part002.csv.gz
/results/KIDFROST#LARAZA.com--B62-part001.csv.gz
wget statement works as a single command
wget -O files/$i "https://tickhistory.com/HttpPull/Download? user=KIDFROST#LARAZA.com&pass=RICOSUAVE&file=/results/KIDFROST#LARAZA.com--N6265-.txt.gz"
but a loop doesn't work
for i in `cat files.txt`; do
wget -O files/$i "https://tickhistory.com/HttpPull/Download? user=KIDFROST#LARAZA.com&pass=RICOSUAVE&file=$i"
done
Edit: Main issues was "files/$i" instead of "files$i" (see comments)
Don't use a for-loop/cat for reading files, use a while/read loop. Less problems with buffer-overflow/splitting/etc...
while read -r i; do
wget -O "files$i" "https://tickhistory.com/HttpPull/Download? user=KIDFROST#LARAZA.com&pass=RICOSUAVE&file=$i"
done < files.txt
Also make sure you quote variables to prevent expansion/splitting errors. Also not really sure why you're doing files/$i, since $i has full path.

How to make my script continue mirroring where it left off?

I'm creating a script to download and mirror a site, URLs are taken from a .txt file. The script is supposed to run daily for a few hours, so I need to get it to continue mirroring where it left off.
Here is the script:
# Created by Salik Sadruddin Merani
# email: ssm14293#gmail.com
# site: http://www.dragotech-innovations.tk
clear
echo ' Created by: Salik Sadruddin Merani'
echo ' email: ssm14293#gmail.com'
echo ' site: http://www.dragotech-innovations.tk'
echo
echo ' Info:'
echo ' This script will use the URLs provided in the File "urls.txt"'
echo ' Info: Logs will be saved in logfile.txt'
echo ' URLs are taken from the urls.txt file'
#
url=`< ./urls.txt`
useragent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'
echo ' Mozilla Firefox User agent will be used'
cred='log=abc#123.org&pwd=abc123&wp-submit=Log In&redirect_to=http://abc#123.org/wp-admin/&testcookie=1'
echo ' Loaded Credentails'
echo ' Logging In'
wget --save-cookies cookies.txt --post-data ${cred} --keep-session-cookies http://members.ebenpagan.com/wp-login.php --delete-after
OIFS=$IFS
IFS=','
arr2=$url
for x in $arr2
do
echo ' Loading Cookies'
wget --spider --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links --page-requisites ${x} -U ${useragent} -np --adjust-extension --continue -e robots=no --span-hosts --no-parent -o log-file-$x.txt
done
IFS=$OIFS
Problems with the script:
The script is not referencing its links correctly by making it referable to the file in the parent directory, please tell me about that.
The script is not resuming after being aborted even with the --continue option.
A smarter way to solve the problem is, work with two .txt files, let's affectionately call them "to_mirror.txt" and "mirrored.txt". Keep each URL on a single line. Declare in your script a variable of value 0, for example total_mirrored=0, it will be very important in our code. Therefore, every time the wget command is executed and, consequently, the site is mirrored, increment the value of the "total_mirrored" variable by +1.
Upon exiting the loop, "total_mirrored" will have any integer value.
Then you must extract the lines from "to_mirror.txt" in the range: first line up to "total_mirrored"; then attach this to "mirrored.txt".
After that delete the range from the file "to_mirror.txt".
In this case the sed command can help you, see my example:
sed -n "1,$total_mirrored p" to_mirror.txt >> mirrored.txt && sed -i "1,$total_mirrored d" to_mirror.txt
You can learn a lot about the sed command by running man sed in your terminal, so I won't explain here what each option does as it's redundant.
But know that:
>> appends the existing file, or creates a file if the file of the mentioned name is not present in the directory. && A && B — run B only if A succeeded.
The --continue flag in wget will attempt to resume the downloading of a single file in the current directory. Please refer to the man page of wget for more info. It is quite detailed.
What you need is resuming the mirroring/downloading from where the script previously left off.
So, its more of a modification of script than some setting in wget. I can suggest a way to do that, but mind you, you can use a different approach as well.
Modify the urls.txt file to have one URL per line. Then refer this pseudocode:
get the URL from the file
if (URL ends with a token #DONE), continue
else, wget command
append a token #DONE to the end of the URL in the file
This way, you will know which URL to continue from, the next time you run the script. All URLs that have a "#DONE" at the end will be skipped, and the rest will be downloaded.

Resources