Zipping downsampled data (with anti-aliasing) comes out bigger than zipping the original - zip

So I have 10 files, each holds about 150 signals of floating point numbers of size 10000. (They are SEGY files, which hold siesmic data)
Each one of those files weighs about 95MB.
When I compress them all together using Zip I get an approximately 440MB archive.
I downsampled those signals by a factor of 2, so each file is now about 47MB. I zipped them all together and got an archive of size 660MB.
How is that possible?
EDIT:
Apparently, I downsampled with an anti-aliasing filter. When removing that filter, the compression behaved as expected.
Still wondering, why would an anti-aliasing filter cause this kind of behavior?

Seismic data isn't compressible with general compression tools like zip. You need special compress like BWT.
why would an anti-aliasing filter cause this kind of behavior?
Compression works on eliminating similarities in the data with shorthands. Any time you see increases in size, you are experiencing data that looks like random data to the compressor. It attempts to compress, but ultimately makes the file larger.
Aliasing in seismic data presents as distortions of frequency introduced by inadequately sampling a signal. This also can lead to signal to noise issues. Anti-aliasing attempts to smith this out. The net effect is prior to anti-aliasing you have less datapoints than after anti-aliasing. The additional data points reduces the chance the compression algorithm will be able to eliminate similar chunks of data.

Related

How should I handle large video datasets in Google Cloud ML Engine?

I am experimenting with video classification using Keras in Cloud ML Engine. My dataset consists in video sequences saved as separate images (eg. seq1_frame1.png, seq1.frame2.png...) which I have uploaded to a GCS bucket.
I use a csv file referencing the start of end frames of different subclips, and a generator which feeds batch of clips to the model. The generator is responsible for loading frames from the bucket, reading them as images, and concatenating them as numpy arrays.
My training is fairly long, and I suspect the generator is my bottleneck due to the numerous reading operations.
In the exemples I found online, people usually save pre-formatted clips as tfrecords files directly to GCS. I feel like this solution isn't ideal for very large datasets as it implies duplicating the data, even more so if we decide to extract overlapping subclips.
Is there something wrong in my approach ? And more importantly, is there a "golden-standard" for using large video datasets for machine learning ?
PS : I explained my setup for reference, but my question is not bound to Keras, generators or Cloud ML.
In this, you are almost always going to be trading time for space. You just have to work out which is more important.
In theory, for every frame, you have height*width*3 bytes. That's assuming 3 colour channels. One possible way you could save space is to use only one channel (probably choose green, or, better still, convert your complete dataset to greyscale). That would reduce your full size video data to one third size. Colour data in video tends to be at a lower resolution than luminance data so it might not affect your training, but it depends on your source files.
As you probably know, .png is a lossless image compression. Every time you load one, the generator will have to decompress first, and then concatenate to the clip. You could save even more space by using a different compression codec, but that would mean every clip would need full decompression and probably add to your time. You're right, the repeated decompression will take time. And saving the video uncompressed will take up quite a lot of space. There are places you could save space, though:
reduce to greyscale (or green scale as above)
temporally subsample frames (do you need EVERY consecutive frame, or could you sample every second one?)
do you use whole frames or just patches? Can you crop or rescale the video sequences?
are you using optical flow? It's pretty processor intensive, consider it as a pre-processing step, too, so you only have to do it once per clip (again this is trading space for time)

How to process KML/GeoJSON in Nodejs?

I ran out of google searches and reaching out for help here. We are right now processing KML file using geoXml3 at client side. But Ideally I would want to pre-process it in server side and send the ploygons on the client side. Because KML file is 18MB file and it takes forever to download on client side and then client parses it and draws the polygon on google map.
We changed KML files to GeoJSON and reduced the size , compressed it - after all the circus the response time is still not good. I just want to know if there is a way / library in node that can do this.
When you say that you are compressing the file, what do you mean? If you mean an algorithm such as zip or lha, that won't necessarily reduce the size of the file that much. What you want to do is to remove line segments from the KML file. In reducing some geographic information, I found that there were lengths of many miles that varied less than a foot from a straight line. Since the data points were spaced every few feet, that meant that the vast majority of the points in the KML file could be removed without making an appreciable change in the appearance of the geometry. Looking for straight line segments is relatively simple.
You should also keep in mind the scale of the map you are viewing and the spacing of the data points in the KML file. Even if the lines are complex curves, it may be possible to remove large numbers of points by trying to curve fit segments of the features and reduce the size of the data in this way.
You seem to be implying that downloading the data from the server to the client takes far more time than processing the data on the server. If this is correct, reducing the number of points is the most efficient method.
There are a number of tricks you can try to reduce the download & rendering time of a KML polygon file.
As already suggested in previous answers the key is to reduce the size of your data. This can be accomplished in many ways, depending on your use case:
Reduce the number of points that each line segment / polygon boundary is made up of. There are plenty of algorithms for this, the Douglas-Peucker line simplification algorithm is the most well known.
Reduce the precision of your data points. If your coordinates are stored to a high degree of precision (i.e. latitude / longitudes with several decimal places) then you can round these up to a lower degree of precision / fewer decimal places. Note that you might have to play with this as it will make the boundary of your polygons choppy / jagged if you go too far.
Compression. It looks as though you have already experimented with this. Gzip compression should be able to reduce the wire payload size of KML significantly.
Finally, if you are still not getting the results you want you could consider generalizing your data further by removing small / insignificant polygons. Again, this will depend on your use case

Why is GIF image size more than the sum of individual frame size?

I just tried to convert few JPEGs to a GIF image using some online services. For a collection of 1.8 MB of randomly selected JPEGs, the resultant GIF was about 3.8 MB in size (without any extra compression enabled).
I understand GIF is lossless compression. And that's why I expected the resultant output to be around 1.8 MB (input size). Can someone please help me understand what's happening with this extra space ?
Additionally, is there a better way to bundle a set of images which are similar to each other (for transmission) ?
JPEG is a lossy compressed file, but still it is compressed. When it uncompresses into raw pixel data and then recompressed into GIF, it is logical to get that bigger a size
GIF is worse as a compression method for photographs, it is suited for flat colored drawings mostly. It uses RLE [run-length encoding] if I remember well, that is you get entries in the compressed file that mean "repeat this value N times", so you need to have lots of same colored pixels in horizontal sequence to get good compression.
If you have images that are similar to each other, maybe you should consider packing them as consequtive frames (the more similar should be closer) of a video stream and use some lossless compressor (or even risk it with a lossy one) for video, but maybe this is an overkill.
If you have a color image, multiply the width x height x 3. That is the normal size of the uncompressed image data.
GIF and JPEG are two difference methods for compressing that data. GIF uses the LZW method of compression. In that method the encoder creates a dictionary of previously encountered data sequences. The encoder write codes representing sequences rather than the actual data. This can actual result in an file larger than the actual image data if the encode cannot find such sequences.
These GIF sequences are more likely to occur in drawing where the same colors are used, rather than in photographic images where the color varies subtly through out.
JPEG uses a series of compression steps. These have the drawback that you might not get out exactly what you put in. The first of these is conversion from RGB to YCbCr. There is not a 1-to-1 mapping between these colorspaces so modification can occur there.
Next is subsampling.The reason for going to YCbCr is that you can sample the Cb and Cr components at a lower rate than the Y component and still get good representation of the original image. If you do 1 Y to 4 Cb and 4 Cr you reduce the amount of data to compress by half.
Next is the discrete cosine transform. This is a real number calculation performed on integers. That can produce rounding errors.
Next is quantization. In this step less significant values from the DCT are discarded (less data to compress). It also introduces errors from integer division.

How can a jpeg encoder become more efficient

Earlier I read about mozjpeg. A project from Mozilla to create a jpeg encoder that is more efficient, i.e. creates smaller files.
As I understand (jpeg) codecs, a jpeg encoder would need to create files that use an encoding scheme that can also be decoded by other jpeg codecs. So how is it possible to improve the codec without breaking compatibility with other codecs?
Mozilla does mention that the first step for their encoder is to add functionality that can detect the most efficient encoding scheme for a certain image, which would not break compatibility. However, they intend to add more functionality, first of which is "trellis quantization", which seems to be a highly technical algorithm to do something (I don't understand).
I'm also not entirely sure this quetion belongs on stack overflow, it might also fit superuser, since the question is not specifically about programming. So if anyone feels it should be on superuser, feel free to move this question
JPEG is somewhat unique in that it involves a series of compression steps. There are two that provide the most opportunities for reducing the size of the image.
The first is sampling. In JPEG one usually converts from RGB to YCbCR. In RGB, each component is equal in value. In YCbCr, the Y component is much more important than the Cb and Cr components. If you sample the later at 4 to 1, a 4x4 block of pixels gets reduced from 16+16+16 to 16+1+1. Just by sampling you have reduced the size of the data to be compressed by nearly 1/3.
The other is quantization. You take the sampled pixel values, divide them into 8x8 blocks and perform the Discrete Cosine transform on them. In 8bpp this takes 8x8 8-bit data and converts it to 8x8 16 bit data (inverse compression at that point).
The DCT process tends to produce larger values in the upper right corner and smaller values (close to zero) towards the lower left corner. The upper right coefficients are more valuable than the lower left coefficients.
The 16-bit values are then "quantized" (division in plain english).
The compression process defines an 8x8 quantization matrix. Divide the corresponding entry in the DCT coefficients by the value in the quantization matrix. Because this is integer division, the small values will go to zero. Long runs of zero values are combined using run-length compression. The more consecutive zeros you get, the better the compression.
Generally, the quantization values are much higher at the lower left than in the upper right. You try to force these DCT coefficients to be zero unless they are very large.
This is where much of the loss (not all of it though) comes from in JPEG.
The trade off is to get as many zeros as you can without noticeably degrading the image.
The choice of quantization matrices is the major factor in compression. Most JPEG libraries present a "quality" setting to the user. This translates into the selection of a quantization matrices in the encoder. If someone could devise better quantization matrices, you could get better compression.
This book explains the JPEG process in plain English:
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_1?ie=UTF8&qid=1394252187&sr=8-1&keywords=0201604434
JPEG provides you multiple options. E.g. you can use standard Huffman tables or you can generate Huffman tables optimal for a specific image. The same goes for quantization tables. You can also switch to using arithmetic coding instead of Huffman coding for entropy encoding. The patents covering arithmetic coding as used in JPEG have expired. All of these options are lossless (no additional loss of data). One of the options used by Mozilla is instead of using baseline JPEG compression they use progressive JPEG compression. You can play with how many frequencies you have in each scan (SS, spectral selection) as well as number of bits used for each frequency (SA, successive approximation). Consecutive scans will have additional frequencies and or addition bits for each frequency. Again all of these different options are lossless. For the standard images used for JPEG switching to progressive encoding improved compression from 41 KB per image to 37 KB. But that is just for one setting of SS and SA. Given the speed of computers today you could automatically try many many different options and choose the best one.
Although hardly used the original JPEG standard had a lossless mode. There were 7 different choices for predictors. Today you would compress using each of the 7 choices and pick the best one. Use the same principle for what I outlined above. And remember non of them encounter additional loss of data. Switching between them is lossless.

Windowing and lossless compression

I'm studying how FLAC works, although my question is valid for any lossless codec.
I wonder how may a codec be lossless if the original signal is multiplied by a window which is not rectangular.
I think this operation will modify the stream that we don't want to change.
I know a rectangular window has a terrible spectral response (sinc, with many lobes), but, what's the problem? We don't want to disturb the audio stream, and by multiplying by something different to 1, we will.
Thank you.
A window function can be applied when you want to transform your signal from the time domain to the frequency domain. If you are working with chunks of data then a window can be applied to minimise the effects of spectral leakage.
You can use a (symmetrical) window and apply it to chunks of audio if you also introduce whats known as overlap. Usually 50% overlap is used. This means that the last 50% of your previous chunk is added to the first 50% of your next chunk. This is a lossless operation.

Resources