Downloading PDF files from a remote website - linux

How can I download all the PDF files present in a website? Here I don't want to use wget command as that makes it manual and a lot time taking.

Related

Is there any way to download data from disk station to the server directly?

I have disk station with some large files (site with interactive file system where I can download some files and archives). I also have server with ssh access. Can I download file directy to the server without downloading it to my local machine and then using scp to the server?
Some people say to use
wget http://link-to-the-file
But I am not sure, that there is direct download link. Moreover you should specify language of achieve before downloading (confirming).
Note: I tried to use wget but can't understand is there downloading link.

Excel files downloaded from a GitHub repository become corrupt

This discussion on GitHub says that Excel files downloaded from a repository become corrupt because of the way the HTTP GET request is built.
But let's say I just need a button for dowloading a .xlsx file from there which is not corrupted. Is it possible somehow, without talking about HTTP requests?
You can use SSH to download in any case if you really don't want to use HTTP.
downloading file using SFTP with VBA

Download images with MediaWiki API?

Is it possible to download images from Wikipedia with MediaWiki API?
No, it is not possible to get the images via the API. Images in a MediaWiki are stored just in folders, not in a database and are not delivered dynamically (more information on that in the Manual:Image administration).
However, you can retrieve the URLs of those image files via the API. For example see the API:Allimages list or imageinfo property querymodules. Then you can download the files from those URLs with your favourite tool (curl, wget, whatever).
If your question is about downloading all images from Wikipedia, meta:data dumps would be a good start. You also may ask on the data-dump mailing list on how to sync with a repository like Wikimedia Commons.
Given you know the file's name, no need for the API.
This works:
# Curl download, -L: follow redirect, -C -: resume downloads, -O: keep filename
curl -L -C - -O http://commons.wikimedia.org/wiki/Special:FilePath/Example.jpg # Example.jpg

Is there a way to use a downloaded (xml-esque) file in an extension?

I'm attempting to write an emusic download manager extension for Chrome. This would essentially involve parsing the .emx file (xml-esque) downloaded from the site to get the temporary music file URL.
I've gone through everything I can find online about extensions working with downloads, but the problem comes in that the .emx file isn't directly accessible from the web. The link to download it seems to pass through a counter for your account so that it can deduct the cost, and then it pushes a file 0.emx to be downloaded. It doesn't seem to work with any hotlinking, even copying and pasting the button's destination URL into the same browser tab.
I think the only way to get the file is to wait for the 0.emx file to be pushed. Is there any way I can use an extension to intercept that file before it goes to the downloads folder or use it after it's already downloaded?
Thanks!

What's the best way to save a complete webpage on a linux server?

I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all to work in the same directory?
I've thought about using curl, but I'm unsure of how to do all of this. Also, will I maybe need PHP-DOM?
Is there a way to use firefox on the server and copy the temp files after the address has been loaded or similar?
Any and all input welcome.
Edit:
It seems as though wget is 'not' going to work as the files need to be rendered. I have firefox installed on the server, is there a way to load the url in firefox and then grab the temp files and clear the temp files after?
wget can do that, for example:
wget -r http://example.com/
This will mirror the whole example.com site.
Some interesting options are:
-Dexample.com: do not follow links of other domains
--html-extension: renames pages with text/html content-type to .html
Manual: http://www.gnu.org/software/wget/manual/
Use following command:
wget -E -k -p http://yoursite.com
Use -E to adjust extensions. Use -k to convert links to load the page from your storage. Use -p to download all objects inside the page.
Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.
If all the content in the web page was static, you could get around this issue with something like wget:
$ wget -r -l 10 -p http://my.web.page.com/
or some variation thereof.
Since you also have dynamic pages, you cannot in general archive such a web page using wget or any simple HTTP client. A proper archive needs to incorporate the contents of the backend database and any server-side scripts. That means that the only way to do this properly is to copy the backing server-side files. That includes at least the HTTP server document root and any database files.
EDIT:
As a work-around, you could modify your webpage so that a suitably priviledged user could download all the server-side files, as well as a text-mode dump of the backing database (e.g. an SQL dump). You should take extreme care to avoid opening any security holes through this archiving system.
If you are using a virtual hosting provider, most of them provide some kind of Web interface that allows backing-up the whole site. If you use an actual server, there is a large number of back-up solutions that you could install, including a few Web-based ones for hosted sites.
What's the best way to save a complete webpage on a linux server?
I tried couple of tools curl, wget included but nothing works up to my expectations.
Finally I found a tool to save a complete webpage (images, scripts, linked pages.... everything included). Its written in rust named monolith. Take a look.
It do not save images and other scripts/ stylesheets as separate files but pack them in 1 html file.
For example
If I had to save https://nodejs.org/en/docs/es6 to a es6.html with all page requisites packed in one file then I had to run:
monolith https://nodejs.org/en/docs/es6 -o es6.html
wget -r http://yoursite.com
Should be sufficient and grab images/media. There are plenty of options you can feed it.
Note: I believe wget nor any other program supports downloading images specified through CSS - so you may need to do that yourself manually.
Here may be some useful arguments: http://www.linuxjournal.com/content/downloading-entire-web-site-wget

Resources