What's the best way to save a complete webpage on a linux server? - linux

I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all to work in the same directory?
I've thought about using curl, but I'm unsure of how to do all of this. Also, will I maybe need PHP-DOM?
Is there a way to use firefox on the server and copy the temp files after the address has been loaded or similar?
Any and all input welcome.
Edit:
It seems as though wget is 'not' going to work as the files need to be rendered. I have firefox installed on the server, is there a way to load the url in firefox and then grab the temp files and clear the temp files after?

wget can do that, for example:
wget -r http://example.com/
This will mirror the whole example.com site.
Some interesting options are:
-Dexample.com: do not follow links of other domains
--html-extension: renames pages with text/html content-type to .html
Manual: http://www.gnu.org/software/wget/manual/

Use following command:
wget -E -k -p http://yoursite.com
Use -E to adjust extensions. Use -k to convert links to load the page from your storage. Use -p to download all objects inside the page.
Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.

If all the content in the web page was static, you could get around this issue with something like wget:
$ wget -r -l 10 -p http://my.web.page.com/
or some variation thereof.
Since you also have dynamic pages, you cannot in general archive such a web page using wget or any simple HTTP client. A proper archive needs to incorporate the contents of the backend database and any server-side scripts. That means that the only way to do this properly is to copy the backing server-side files. That includes at least the HTTP server document root and any database files.
EDIT:
As a work-around, you could modify your webpage so that a suitably priviledged user could download all the server-side files, as well as a text-mode dump of the backing database (e.g. an SQL dump). You should take extreme care to avoid opening any security holes through this archiving system.
If you are using a virtual hosting provider, most of them provide some kind of Web interface that allows backing-up the whole site. If you use an actual server, there is a large number of back-up solutions that you could install, including a few Web-based ones for hosted sites.

What's the best way to save a complete webpage on a linux server?
I tried couple of tools curl, wget included but nothing works up to my expectations.
Finally I found a tool to save a complete webpage (images, scripts, linked pages.... everything included). Its written in rust named monolith. Take a look.
It do not save images and other scripts/ stylesheets as separate files but pack them in 1 html file.
For example
If I had to save https://nodejs.org/en/docs/es6 to a es6.html with all page requisites packed in one file then I had to run:
monolith https://nodejs.org/en/docs/es6 -o es6.html

wget -r http://yoursite.com
Should be sufficient and grab images/media. There are plenty of options you can feed it.
Note: I believe wget nor any other program supports downloading images specified through CSS - so you may need to do that yourself manually.
Here may be some useful arguments: http://www.linuxjournal.com/content/downloading-entire-web-site-wget

Related

how can i make a script to download files with dynamic links?

I have no experience using linux to make a script, and I need to make a script to download files (multiple files on the same page) whose download links are dynamic. What path should I take or what should I know to be able to do it?
If you are having page with links to resources you want to download I suggest trying Recursive Retrieval Options of GNU Wget, to limit download only to resource to which links are present in site you might set level to 1, i.e. do something like
wget -r -l 1 http://www.example.com

what does non-interactive command line tool means?

Guys im researching about WGET command in linux, (im very new to linux) and i found this statement which i dont understand
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
and what does
"without X-windows support means" too?
Also what i understand about wget is that it downloads something, but how come i can
wget http://google.com/
and see some weird text in the screen.
A little help here
Wget downloads content to a file. So the text you see in your terminal is just a job log.
Non interactive means that it doesn't prompt for any input while it works. You specify everything via command line parameters.
X (and related) handles GUI rendering. See http://en.wikipedia.org/wiki/X_Window_System for details.
easier to think of what wget DOESN'T do. Your typical browser reads a URL from a GUI interface, and when you click on it, the browser generates & sends a file request to retrieve an HTML file. It then translates the (text based) html source file, sends further requests for content like images etc., and renders the whole thing to GUI as a webpage.
Wget just sends the request & dowloads the file. It can be controlled to recursively fetch links in the source file, so you could download the whole internet with a few keystrokes XD.
It's useful by itself for grabbing graphic & audio files without having to sit through a point & click session. You can also pipe the html source through a custom sed or perl filter to extract data. (like going to the city transit page & converting schedule info to a spreadsheet format)

Download images with MediaWiki API?

Is it possible to download images from Wikipedia with MediaWiki API?
No, it is not possible to get the images via the API. Images in a MediaWiki are stored just in folders, not in a database and are not delivered dynamically (more information on that in the Manual:Image administration).
However, you can retrieve the URLs of those image files via the API. For example see the API:Allimages list or imageinfo property querymodules. Then you can download the files from those URLs with your favourite tool (curl, wget, whatever).
If your question is about downloading all images from Wikipedia, meta:data dumps would be a good start. You also may ask on the data-dump mailing list on how to sync with a repository like Wikimedia Commons.
Given you know the file's name, no need for the API.
This works:
# Curl download, -L: follow redirect, -C -: resume downloads, -O: keep filename
curl -L -C - -O http://commons.wikimedia.org/wiki/Special:FilePath/Example.jpg # Example.jpg

How tell wget to only download files within a specific path

I am trying to download the contents of a website, but since this website is in more than one language, I faced a problem. the structure of the site is as follows:
Mysite.com
fr.Mysite.com
ar.Mysite.com
...
I just want to download files in the Arabic domain (ar.Mysite.com), but using -r option it also downloads the contents of other languages. How can I say wget to recursively download the files, but just the files within a specific domain?
The --no-parent option does what you want, if I understand the question. It limits queries to those that lie underneath the "directory" specified by the original URL.

Coldfusion security issue...how to hide directory of files?

So, I decided to try to break my website...I googled my site by typing in site:mysite.com/whatever and behold, all of the users uploaded files were available for view under a specific directory.
What kind of script/ counter measure should I use to block these files from being viewed? I already have a script that checks the path and the logged in status, however this doesn't seem to be working. I've looked all over for solutions...but I can't quite find one. I'm using ColdFusion 8.
This isn't a ColdFusion issue so much as a web server configuration issue.
You should either:
configure your web server not to show a directory of files when using a URL without a filename (e.g., http://www.example.com/files/)
drop a blank default web document (index.html, index.htm, default.htm, index.cfm, whatever) into that directory so that it displays that document rather than the list of files. If you use index.cfm, it'll fire your Application.cfm/cfc in your file path and use whatever other security you've built.
(or, better, do both)
The best way to secure your file listings and the files themselves is to store them in another folder outside of the Web site root folder. You can then serve them up using CFDIRECTORY and CFCONTENT. The pages that display the files can check your access controls and only serve the files to those allowed to see them.

Resources