Can data be added to a file for better compression? - zip

If I understand the basic idea of ZIP compression correctly (and I think compression in general), compressed files are just patterns found in the original data expressed in shorter notation. Are there compression algorithms out there that insert junk/unimportant data to a file to add patterns where there once were none? Is that violating some file integrity rule, or even just diminishing returns?
Mostly I was thinking of adding whitespace to something that doesn't care about it, like an HTML file.
EDIT: a more concrete example would probably be better:
.class-a {
display: block;
color: #fff;
}
.class-b {display:block;color:#fff;}
Obviously minification (and reusing classes) would be the best practice here, but this is a question for how an algorithm could do things, not humans. Would adding any amount of whitespace to have the latter line match the former provide any use whatsoever?
EDITEDIT: This all sounds like some bizarre parody of lossy compression, now that I think about it. Gainy compression or some nonsense.

No, in principle adding more information to the file will increase the amount of information that must be included in the compressed file too, so the compressed file will be larger.
If there is the string AAA in a file, and if a repetition of that pattern is added, then the compressed file will have to include a representation of AAA plus something to say that the pattern is repeated elsewhere. Recording where the pattern is repeated takes up space too.
Another way of looking at it, using the HTML example, would be that if you add lots of whitespace, then that will probably compress well, so the final size of the compressed file will at best stay the same. So the "compression ratio" would be higher, but there also wouldn't be any more interesting content in the uncompressed file, so the absolute improvement in compression would be at best zero.

Related

Can antlr4 be used to parse very large gzip compressed files?

I am trying to parse very large gzip compressed (10+GB) file in python3. Instead of creating the parse tree, instead I used embedded actions based on the suggestions in this answer.
However, looking at the FileStream code it wants to read the entire file and then parse it. This will not work for big files.
So, this is a two part question.
Can ANTLR4 use a file stream, probably custom, that allows it to read chunks of the file at a time? What should the class interface look like?
Predicated on the above having "yes", would that class need to handle seek operations, which would be a problem if the underlying file is gzip compressed?
Short anser: no, not possible.
Long(er) answer: ANTLR4 can potentially use unlimited lookahead, so it relies on the stream to seek to any position with no delay or parsing speed will drop to nearly a hold. For that reason all runtimes use a normal file stream that reads in the entire file at once.
There were discussions/attempts in the past to create a stream that buffers only part of the input, but I haven't heard of anything that actually works.

Do any of the Python compression module algorithms simply store the data for speed optimisation?

From Wikipedia, about ZPAQ Compression-
ZPAQ has 5 compression levels from fast to best. At all but the best level, it uses the statistics of the order-1 prediction table used for deduplication to test whether the input appears random. If so, it is stored without compression as a speed optimization.
I've been working with the Python Data Compression and Archiving module, and wonder if any of those implementations (ZLIB, BZ2, LZMA) do the same? Do any of them simply store the data 'as-is' when it looks almost random? I'm not a coding expert and can't really follow the source code.
Related: How to efficiently predict if data is compressible
Some incomplete / best-guess remarks:
LZMA2 seems to do that, although for different reasons: compression-ratio; not for improving compression-time.
This is indicated at wiki:
LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters.
The XZ LZMA2 encoder processes the input in chunks (of up to 2 MB uncompressed size or 64 KB compressed size, whichever is lower), handing each chunk to the LZMA encoder, and then deciding whether to output an LZMA2 LZMA chunk including the encoded data, or to output an LZMA2 uncompressed chunk, depending on which is shorter (LZMA, like any other compressor, will necessarily expand rather than compress some kinds of data).
The latter quote also shows that there is no expected compression-speed gain as it's more or less a: do both and pick best approach.
(The article seems to focus on xz-based lzma2; probably transfers to whatever is within python; but no guarantees)
Above, together with python's docs:
Compression filters:
FILTER_LZMA1 (for use with FORMAT_ALONE)
FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)
would make me think you got everything you need and just need to use the right filter.
So check your reasoning again (time- or compression-ratio) and try the lzma2-filter with custom-prepared mixed data (if you don't want to trust blindly).
Intuition i don't expect the more classic zlib/bz2 formats to exploit uncompressable data (but it's a pure guess).

how to check compression type without decompressing?

I wrote code in nodejs to decompress different file types (like tar, tar.gz etc..)
I do not have the filename available to me.
Currently I use brute force to decompress. The first one that succeeds, wins..
I want to improve this by knowing the compression type beforehand.
Is there a way to do this?
Your "brute force" approach would actually work very well, since the software would determine incredibly quickly, usually within the first few bytes, that it had been handed the wrong thing. Except for the one that will work.
You can see this answer for a list of prefix bytes for common formats. You would also need to detect the tar format within a compressed format, which is not detailed there. Even if you find a matching prefix, you still need to proceed to decompress and decode to test the hypothesis, which is essentially your brute force method.

How to include UTF-8 data: URIs (for SVGs), in LESS?

Reasoning
I'm bundling small minified SVGs (icons) with my CSS via LESS's data-uri method, to reduce HTTP requests similar to the purpose of icon fonts such as Octicons or Ye Olde CSS Sprites.
However, LESS encodes them in Base64.
This is sub-optimal in the case of SVG, which can be Data URI'd in UTF-8 (example).
There's three reasons why this is sub-optimal:
1: Base64 is silly for text
The purpose of Base64 is to encode binary data using only 6 bits per byte, making it safe to embed in text files. This is great for PNGs and JPEGs, but it makes any text file 33% larger for no reason. If you're now thinking "well gzip takes care of that", then keep in mind that...
2: Encoding text in Base64 makes gzip much less effective
To understand why this is the case, consider this:
btoa('width') === 'd2lkdGg='
btoa(' width') === 'IHdpZHRo'
btoa(' width') === 'ICB3aWR0aA=='
As a practical example, let's take an actual SVG and experiment with it:
$ wc -c *
68630 tiger.svg
25699 tiger.svg.gz
91534 tiger.txt
34633 tiger.txt.gz
Even after gzipping, it's still ~35% larger.
3: It disregards some free sources of redundancy
Think about the width example above. Every SVG will have this substring, and if you embed SVGs in a CSS, you'll probably have this keyword somewhere else (or others), which gzip could benefit from (because this is how Huffman Coding works), but not if it's hidden by Base64.
The Question
How can I embed SVGs in LESS as data: URIs using UTF-8 instead of Base64?
I can imagine a thousand ways to do this involving build tools like Grunt, but it breaks my workflow because I won't be able to do things like style: include:less all.less from a Jade view (I do this in development), or even just #import 'images.less'; from a less file.
I'm an idiot. This is simple:
data-uri('image/svg+xml;charset=UTF-8', 'path/to.svg')
I had to read LESS's source to figure this one out.
All the benefits I mention above are gained here, in particular that if you have tons of small SVGs, they will benefit from the redundancy between each other. And it works in all browsers.

Will random data appended to a JPG make it unusable?

So, to simplify my life I want to be able to append from 1 to 7 additional characters on the end of some jpg images my program is processing*. These are dummy padding (fillers, etc - probably all 0x00) just to make the file size a multiple of 8 bytes for block encryption.
Having tried this out with a few programs, it appears they are fine with the additional characters, which occur after the FF D9 that specifies the end of the image - so it appears that the file format is well defined enough that the 'corruption' I'm adding at the end shouldn't matter.
I can always post process the files later if needed, but my preference is to do the simplest thing possible - which is to let them remain (I'm decrypting other file types and they won't mind, so having a special case is annoying).
I figure with all the talk of Steganography hullaballo years ago, someone has some input here...
(encryption processing by 8 byte blocks, I don't want to save pre-encrypted file size, so append 0x00 to input data, and leave them there after decoding)
No, you can add bits to the end of a jpg file, without making it unusable. The heading of the jpg file tells how to read it, so the program reading it will stop at the end of the jpg data.
In fact, people have hidden zip files inside jpg files by appending the zip data to the end of the jpg data. Because of the way these formats are structured, the resulting file is valid in either format.
You can .. but the results may be unpredictable.
Even though there is enough information in the format to tell the client to ignore the extra data it is likely not a case the programmer tested for.
A paranoid program might look at the size, notice the discrepancy and decide it won't process your file because clearly it doesn't fully understand it. This is particularly likely when reading data from the web when random bytes in a file could be considered a security risk.
You can embed your data in the XMP tag within a JPEG (or EXIF or IPTC fields for that matter).
XMP is XML so you have a fair bit of flexibility there to do you own custom stuff.
It's probably not the simplest thing possible but putting your data here will maintain the integrity of the JPEG and require no "post processing".
You data will then show up in other imaging software such as PhotoShop, which may not be ideal.
As others have stated, you have no control how programs process image files and therefore some programs may find the images valid others may not.
However, there is a bigger issue here. Judging by your question, I'm deducing you're practicing "security through obscurity." It's widely considered a very bad practice. Use Google to find a plethora of articles about the topic.

Resources