How to pre-compress very large html files

How to pre-compress very large html files - linux

I need to pre-compress some very large html/xml/json files (large data dumps) using either gzip or deflate. I never want to serve the files uncompressed. They are so large and repetitive that compression will probably work very very well, and while some older browsers cannot support decompression, my typical customers will not be using them (although it would be nice if I could generate some kind of 'hey you need to upgrade your browser' message)
I auto generate the files and I can easily generate .htaccess files to go along with each file type. Essentially what I want is some always on version of mod_gunzip. Because the files are large, and because I will be repeatedly serving them, I need a method that allows me to compress once, really well, on the command line.
I have found some information on this site and others about how to do this with gzip, but I wondered if someone could step me through how to do this with deflate. Bonus points for a complete answer that includes what my .htaccess file should look like, as well as the command line code I should use (GNU/Linux) to obtain optimal compression. Super bonus points for an answer that also addresses how to send "sorry no file for you" message to un-compliant browsers.
would be lovely if we could create a "precompression" tag to cover questions like this.
-FT

Edit: Found AddEncoding in mod_mime
This works:
<IfModule mod_mime.c>
<Files "*.html.gz">
ForceType text/html
</Files>
<Files "*.xml.gz">
ForceType application/xml
</Files>
<Files "*.js.gz">
ForceType application/javascript
</Files>
<Files "*.gz">
AddEncoding gzip .gz
</Files>
</IfModule>
The docs make it sound like only the AddEncoding should be needed, but I didn't get that to work.
Also, Lighttpd's mod_compression can compress and cache (the compressed) files.

If I were you, I would look at inbuilt filesystem compression instead of doing this at the apache layer.
On solaris zfs has transparent compression, use zfs compress to just compress the filesystem.
Similarly, windows can compress folders, apache will serve the content oblivious to the fact it's compressed on disk.
Linux has filesystems that do transparent compression also.

For the command line, compile zlib's zpipe: http://www.zlib.net/zpipe.c and then
zpipe < BIGfile.html > BIGfile.htmlz
for example.
Then using Zash's example, set up a filter to change the header. This should provide you with having RAW deflate files, which modern browsers probably support.
For another way to compress files, take a look at using pigz with zlib (-z) or PKWare zip (-K) compression options. Test if these work coming through with Content-Encoding set.

A quick way to compress content without dealing directly with moz_gzip/mod_defalte is using ob_gzhandler and modifying headers (before any output is send to the browser).
<?php
/* Replace CHANGE_ME with the correct mime type of your large file.
i.e: application/json
*/
ob_start ('ob_gzhandler');
header('Content-type: CHANGE_ME; charset: UTF-8');
header('Cache-Control: must-revalidate');
$offset = 60 * 60 * 2 ;
$ExpStr = 'Expires: ' . gmdate('D, d M Y H:i:s',time() + $offset) . ' GMT';
header($ExpStr);
/* Stuff to generate your large files here */

Related

Is it possible to download a file nested in a zip file, without downloading the entire zip file?

Is it possible to download a file nested in a zip file, without downloading the entire zip archive?
For example from a url that could look like:
https://www.any.com/zipfile.zip?dir1\dir2\ZippedFileName.txt

Depending on if you are asking whether there is a simple way of implementing this on the server-side or a way of using standard protocols so you can do it from the client-side, there are different answers:
Doing it with the server's intentional support
Optimally, you implement a handler on the server that accepts a query string to any file download similar to your suggestion (I would however include a variable name, example: ?download_partial=dir1/dir2/file
). Then the server can just extract the file from the ZIP archive and serve just that (maybe via a compressed stream if the file is large).
If this is the path you are going and you update the question with the technology used on the server, someone may be able to answer with suggested code.
But on with the slightly more fun way...
Doing it opportunistically if the server cooperates a little
There are two things that conspire to make this a bit feasible, but only worth it if the ZIP file is massive in comparison to the file you want from it.
ZIP files have a directory that says where in the archive each file is. This directory is present at the end of the archive.
HTTP servers optionally allow download of only a range of a response.
So, if we issue a HEAD request for the URL of the ZIP file: HEAD /path/file.zip we may get back a header Accept-Ranges: bytes and a header Content-Length that tells us the length of the ZIP file. If we have those then we can issue a GET request with the header (for example) Range: bytes=1000000-1024000 which would give us part of the file.
The directory of files is towards the end of the archive, so if we request a reasonable block from the end of the file then we will likely get the central directory included. We then look up the file we want, and know where it is located in the large ZIP file.
We can then request just that range from the server, and decompress the result...

How do I decompress the diagram data in a .drawio file with node.js and zlib?

Diagrams.net, previously and still more widely known as draw.io, is a popular tool for drawing diagrams of various kinds. It stores diagrams in an XML-based format that uses the file ending .drawio. The file content has the structure:
<mxfile {...}>
<diagram {...}>
{the-actual-diagram-content}
</diagram>
</mxfile>`
According to the documentation page Extracting the XML from mxfiles, the string {the-actual-diagram-content} contains the actual diagram data in compressed format, "compressed with the standard deflate process". I'd like to decompress this data in my node.js app to parse and modify it.
I have found an older, similar question on StackOverflow, which wants the same, but uses the libraries "atob", and later "pako". I'd like to achieve the same with the more standard "zlib" node.js module, which - if this is really "the standard deflate process" - should be possible.
However, all my attempts to "inflate" the compressed string fail. I have mostly tried variations of the following code, with different encodings ('base64', 'utf8') and methods ('inflateSync', 'unzipSync', 'gunzipSync'):
zlib.inflateSync(Buffer.from(string, 'base64')).toString();
All attempts fail with the error "Error: incorrect header check". I read this as "dude, seriously, you're using the wrong unzip algorithm for this". However, I cannot figure out what the right algorithm or settings are.
The sample string I'd like to decode is the following. Using the jgraph inflate/deflate tool, this uncompresses perfectly fine. However, the settings done there, "URL Encode", "Deflate", "Base64" sound to me exactly like what I am trying.
zVdbk6I6EP41Vp3z4BYXL/Ao3nV0VEYZfQsQITOBIEQu/voNAgrqrHtOzVbti5X+0t0kX/eXxJrYdeKhDzx7RkyIawJnxjWxVxOEZktmvymQZIAoNzLA8pGZQfwVUNEJ5iCXo0dkwqDiSAnBFHlV0CCuCw1awYDvk6jqtie4+lUPWPAOUA2A71ENmdTOUKnJXfERRJZdfJnn8hkHFM45ENjAJFEJEvs1sesTQrORE3chTrkreMniBl/MXhbmQ5f+TkCXT0gX48NHW1CSsVHXjta0nmcJAT7mG+7kq6VJQYFPjq4J0yxcTVQiG1GoesBIZyNWc4bZ1MHM4tkwoD75vFDFNqnkX4A+hfGXS+cvhLBGgsSB1E+YSxHQyjlMbuzoWhK+4NkulaPwA3kXWJfUV6LYIOfqP/Am3PGm/IW8ia3mX8abeMebSdIYesceNJkOcxNinUT9K6CcATaRsoOYWM8EAp92UsUz3CXu2c01b5AS42xygNLVn8sDMLJcNjYYtdBnAAY6xAowPq1zHbsEE/+ap1pbFuMn72Vjmxo/moXZi8uTvaSwYkTfi+WwcSmKWdeg1ChiMqJSdn7dFIxMcvQN+Fz8jDgL0mfN/mWT1bkfXFNqVxqtXjSVDzGgKKyu9VFX5ekXBLFdXHIL0o3wpZvGzPaYR5UPv5tEolhNJIg3iTISfpGocCT7fQArPmchXHj5/9po3GnjXhQYs4sPPj9OQOBlt+EexWmbPjhffEIBBfo5ddpY+Q1aQthVDsv2b51IX8v+voNKx9CjU+ibmqjOV2t/sf98SZvPS8qeBV46DCh0DYT/CTfzlR7MrF5PmC8SIz5MdfnN4UVrzlmdlql4q46i66/m3uq7mtdTvZ3jrFQ0Zp9S4EjXoqUgK+uX5cTtbS3TmDV36a4E5bhgxA0mW3u1w5PpSRuph+6SIW9HNIC0cewe9e0cxsJyA4VOe8v2qyznChq33wP8Ee3DE97iYWvIqXY74k/4oIKDGQta/LnmY9WBRjsaAaN90jSwWrNYHIZr5vGxZt8eeMTHLyCQArwGfZ2OY3Uh0tZouLKlYLya0FVfjjZyM3ZM5VVDn6d4ISvcECB5rYSOOEpCJA90I54dtp0X7Mubo77bACKoZiNgz0vFOxmKLr3tk7QLlNZorbYnQ6HhBtPe20Hy2QJUcR8vwlktvaUHvPc+VDYn1yFm0kY4lJfSXBxN9d3rsKmpO3VuyWa4kLr/PpfZQ9kAjEnUKd6e3I3U+MJGJL1v6vL5hDdROQOMPeAWl8udlP+kDMUHMhS/S4bVx0i98Q0yZOb1BZ25X/+GiP2f
What am I doing wrong?

Use zlib.inflateRawSync(). What you have there is a raw deflate stream, not a zlib stream.

Stream Poco Zip Compression to Poco HTTPServerResponse

I would like to directly compress a directory into a Poco::HTTPServerResponse stream. However, downloading the zip file produced by the following code leads to a corrupt archive. I do know that the below compression approach does work for locally created zip files as I have successfully done that much. What am I missing or is this simply not possible? (Poco v1.6.1)
std::string directory = "/tmp/data";
response.setStatusAndReason(HTTPResponse::HTTPStatus::HTTP_OK);
response.setKeepAlive(true);
response.setContentType("application/zip");
response.set("Content-Disposition","attachment; filename=\"data.zip\"");
Poco::Zip::Compress compress(response.send(),false);
compress.addRecursive(directory,
Poco::Zip::ZipCommon::CompressionMethod::CM_STORE,
Poco::Zip::ZipCommon::CompressionLevel::CL_MAXIMUM,
false, "data");
compress.close();

I use the same technique successfully, with only a slight difference:
The compression method and the compression level (CM and CL).
compress.addFile( cacheFile, Poco::DateTime(), currentFile.GetName(), Poco::Zip::ZipCommon::CM_DEFLATE, Poco::Zip::ZipCommon::CL_SUPERFAST );
A zip file corresponds to the DEFLATE algorithm, so when unzipping, your explorer/archive manager probably doesn't work out.
Either that, or it's pointless to use a MAXIMUM level on a STORE method (STORE non compressing by definition).
EDIT: Just tried it, actually, it's because CM_STORE internally uses headers (probably some kind of tar). Once your files have been added to the zip stream and you close it, Poco tries to order the header, and resets the position of the output stream to the start to write them.
Since it cannont be done on the HTTP output stream (your bytes are already sent!), it fails.
Switching to CM_DEFLATE should fix your problem.

Can one specify a file content-type to download using Wget?

I want to use wget to download files linked from the main page of a website, but I only want to download text/html files. Is it possible to limit wget to text/html files based on the mime content type?

I dont think they have implemented this yet. As it is still on there bug list.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=21148
You might have to do everything by file extension

Wget2 has this feature.
--filter-mime-type Specify a list of mime types to be saved or ignored`
### `--filter-mime-type=list`
Specify a comma-separated list of MIME types that will be downloaded. Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:
wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*
It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:
wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)
Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.
Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to bug-wget#gnu.org.

Add the header to the options
wget --header 'Content-type: text/html'

Apache says HOST can't be resolved when filetype is .doc

Directory contains about a dozen html files. Index.html contains links to all the others.
Same directory contains hundreds of Word files. HTML files contain links to the Word files.
All links are relative, i.e., no protocol, no host, no path, and no slash.
Click on a link to an HTML file, it works. Click on a link to a word doc, browser says it can't be found. To get more precise on the error, I used wget
Oversimplified version:
wget "http://Lang-Learn.us/RTR/Immigration.html"
gives me the file I asked for, but
wget "http://Lang-Learn.us/RTR/Al otro lado.doc"
tells me that Lang-Learn.us doesn't exist (400)
Same results if I use "lang-learn.us" instead. I did verify correct casing on the filenames themselves, and also tried escaping the spaces with %20 (didn't help, not that I expected it to after the host name message).
The actual session:
MBP:~ wgroleau$ wget "http://Lang-Learn.us/RTR/Immigration.html"
--2011-03-09 00:39:51-- http://lang-learn.us/RTR/Immigration.html
Resolving lang-learn.us... 208.109.14.87
Connecting to lang-learn.us|208.109.14.87|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `Immigration.html.2'
[ <=>
] 5,973 --.-K/s in 0s
2011-03-09 00:39:51 (190 MB/s) - `Immigration.html.2' saved [5973]
MBP:~ wgroleau$ wget "http://Lang-Learn.us/RTR/Al otro lado.doc"
--2011-03-09 00:40:11-- http://lang-learn.us/RTR/Al%20otro%20lado.doc
Resolving lang-learn.us... 208.109.14.87
Connecting to lang-learn.us|208.109.14.87|:80... connected.
HTTP request sent, awaiting response... 400 No Host matches server name lang-learn.us
2011-03-09 00:40:11 ERROR 400: No Host matches server name lang-learn.us.
The error looks like an issue with redirection or domain mapping,
but how could that be turned on or off by the file extension?
The hosting provider at first tried to tell me I don't know how to write HTML, but when I mentioned I've been in software for thirty years and web work for several, he put me on hold to find someone that actually knows something. Eventually they came back and said it's MY fault for not having the correct stuff in .htaccess
Setting aside the obvious retort about it being the hosting provider's job to put the correct stuff in httpd.conf, I made a couple of attempts. But 99% of my web work has been content in HTML/PHP/perl and I know nearly nothing about .htaccess
The following two attempts did NOT work:
AddType application/msword .doc
AddType application/octet-stream .doc
UPDATE: By using
<FilesMatch "\.html$">
ForceType application/octet-stream
</FilesMatch>
I verified that the server does allow .htaccess, but using .doc instead of HTML still gets that idiotic "ERROR 400: No Host matches server name lang-learn.us"

Finally, after hours with more than one "tech supporter," I got them to admit that they had made a configuration error. Besides telling me to use .htaccess, they had an earlier suggestion that I ask the client to convert his hundreds of Word files into HTML pages.
Since the provider is the one that screwed up, there technically is no answer to the question of what can I do to fix it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string