Linux: WGET - scheme missing using -i option - linux

I am trying to download multiple files from yahoo finance using wget.
To do that i used a python script to generate a text file with all urls that i need.
When downloading a single file (a csv file) using the following code:
wget ichart.finance.yahoo.com/table.csv?s=BIOM3.SA&a=00&b=5&c=1900&d=04&e=21&f=2013&g=d&ignore=.csv
everything goes OK!
However, when the option -i is added and instead of reading the url directly, but instead reading it from the file, i get the error:
Invalid URL ichart.finance.yahoo.com/table.csv?s=BIOM3.SA&a=00&b=5&c=1900&d=04&e=21&f=2013&g=d&ignore=.csv: Scheme missing
The file that contains the urls is a text file with a single url in each line. The urls are exactly like the one in the first example, but with some different parameters.
Is there a way to correct this?
Thanks a lot for reading!!

To solve the problem I added double-quotes on the links and a web protocol. For example:
"http://ichart.finance.yahoo.com/table.csv?s=BIOM3.SA&a=00&b=5&c=1900&d=04&e=21&f=2013&g=d&ignore=.csv"

Related

How do I curl a URL with an unknown filename at the end?

I'm talking to a server that creates a new zip file daily, ex: (data-1234.zip). Every day the name of the previous zip is removed and a new one is created with an incremented number, ex: (data-1235.zip). The script will be run sporadically throughout the week but it's on a lab system where the user can't manually update the name with what's on the server.
The server only has one zip file in that directory, it's just a matter of getting the correct naming convention. There is, however a "data.ini" file in the folder as well, so something just searching by first name wouldn't necessarily work. I've seen posts similar to This question using regex but the file is currently on 10,609 and I'd rather not use expansion for potentially thousands of calls depending on access to modify the script in the coming years. I've been searching for something similar to "data-*.zip" but haven't had any luck.
Question was solved by changing commands and running
lftp https://download.companyname.com/product/data/ -e "mget data-*.zip; bye"
since lftp allows wildcards in the filename, unlike curl.

Log-in only once using wget multiple times on same ftp-server

Basically, I am using wget on a file containing multiple URLs. I notice that for each line the command I use:
wget -i list_of_urls
and for each row in "list_of_urls" wget does a log-in step to the FTP server that I'm downloading from. It does the log-in step automatically, without me entering any username and password. Each line produces the output
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|.21... connected.
Logging in as anonymous ... Logged in!
followed by the file downloading.
Is there any way to log in only for the first row and then using that login to download all the following rows? Since the URLs point to the same FTP server, only different files, it feels like logging in for each row is wasteful.
Edit: changed from "website" to "FTP server" since that was what I actually meant, thanks. Added a sample output of the log-in message.
After some fiddling around I think using the rsync protocol solved the problem. This works in this case since the file host has both ftp and rsync servers containing the same files. I then simply (for small file sizes) use
rsync $(tr '\n' ' ' <list_of_urls) /usrpath/
which was much faster than using wget on the ftps. I had to include the $(tr '\n' ' ' <list_of_urls) since the list of urls had end of line separation, but rsync takes space-separated files in the command line. It seems like the rsync protocol in this case only logs on once and then downloads all files, since it went much faster.
Another problem arises with this method when list_of_urls is very long, which I haven't solved yet.

WGET - how to download embedded pdf's that have a download button from a text file URL list? Is it possible?

Happy New Years!
I wanted to see if anybody has ever successfully downloaded embedded pdf file's from multiple url's contained in a .txt file for a website?
For instance;
I tried several combinations of wget -i urlist.txt (which downloads all the html files perfectly); however it doesn't also grab each html file's embedded .pdf?xxxxx <---- slug on the end of the .pdf?*
The exact example of this obstacle is the following:
This dataset I have placed all 2 pages of links into a url.txt:
https://law.justia.com/cases/washington/court-of-appeals-division-i/2014/
1 example URL within this dataset:
https://law.justia.com/cases/washington/court-of-appeals-division-i/2014/70147-9.html
The embedded pdf link is the following:
https://cases.justia.com/washington/court-of-appeals-division-i/2014-70147-9.pdf?ts=1419887549
The .pdf files are actually "2014-70147-9.pdf?ts=1419887549" .pdf?ts=xxxxxxxxxx
each one is different.
The URL list contains 795 links. Does anyone have a successful method to download every .html in my urls.txt while also downloading the .pdfxxxxxxxxxxxxxx file's also to go with the .html's ?
Thank you!
~ Brandon
Try using the following:
wget --level 1 --recursive --span-hosts --accept-regex 'https://law.justia.com/cases/washington/court-of-appeals-division-i/2014/.*html|https://cases.justia.com/washington/court-of-appeals-division-i/.*.pdf.*' --input-file=urllist.txt
Details about the options --level, --recursive, --span-hosts, --accept-regex, and --input-file can be found in wget documentation at https://www.gnu.org/software/wget/manual/html_node/index.html.
You will also need to know how regular expressions work. You can start at https://www.grymoire.com/Unix/Regular.html
You are looking for a web-scraper. Be careful to not break any rules if you ever use one.
You could also process the content you have received through wget using some string manipulation in a bash script.

bash script for creating and then downloading the links

Hellooo.
So I'm wanting to make a script for my girlfriend that uses an external file to append words to a URL, then download the links and iterate.
The awkward thing is she doesn't want to tell me too much (I suspect the result of using the script will be for my benefit :P), so I'm not certain about the function, kind of guessing.
The aim is to get the script to contain a base URL. The script will iterate through an external file that contains a list of word and then append each word to the link. Then the script will then open that link. Then iterate through, append, open etc.
Can someone help me out a bit with this? I'm a bit new to scripting.
Should I set up an external file to hold the base url and then refer to that as well?
I'm thinking somthing along the lines of:
url=$(grep * url.txt)
for i in $(cat file.txt);
do
>> $url
wget $url
done
What and how much do I need to change and add?
Thanks for any help.
I have a file named source which has below content in it :
which-2.16.tar.gz
which-2.17.tar.gz
which-2.21.tar.gz
I wrote a script named downloader with the below content :
#!/bin/bash
url="http://ftp.gnu.org/gnu/which" #source url
while read line
do
wget "$url/$line" #download url = source url + file name from the file
done <source #feeding filenames from the source file.
On running downloader will download the files mentioned in source file from the ftp site mentioned in url. Voila !!
I guess you could employ a similar concept.

SublimeText3 + pandown + pandoc: includes_paths not working

I'm using ST3+pandown+pandoc to convert markdown to PDF. I want to use pandown's includes_paths setting to avoid typing the path to my image directory every time. I haven't been able to get it to work, however. Here's a MWE:
I have a directory structure as follows:
text.markdown
test/img.pdf
In text.markdown, I have:
![](img.pdf)
I've got set includes_paths as follows in Pandown.sublime-settings:
"includes_paths":
[
"test/"
],
But, no dice. I've also tried with an absolute path, ./test, and test. Any ideas?
I think Pandown's includes_paths only applies to Pandoc's --include-in-header, --include-before-body and --include-after-body options, not image locations etc.
From Pandown.sublime-settings about includes_paths:
Pandoc apparently doesn't search for values for its --include
arguments anywhere but the working directory, which makes
working from a standard stylesheet or standard script
sort of tedious.
A workaround, using the graphicx package loaded in the YAML header and \graphicspath:
---
header-includes:
- \usepackage{graphicx}
---
\graphicspath{{test/}}
![](img.pdf)
Pandoc will say that it can't find img.pdf, but the image will be present in the final pdf.

Resources