DEFLATE (RFC1951) dynamic huffman "incomplete length" - deflate

I've been studying RFC1951 and 'puff.c', and have a question about the issue of "incomplete length".
As near I can tell, defining a "dynamic" Huffman code table that allows for more codes than specified by HLIT+257 will produce an error, at least by puff.c. For example, an error is produced by 'puff.c' if, as a simple debugging test, I were to use a Huffman table of all 9-bit codes to define only 257 lit/lens. Is this outcome purposeful or a bug? And can I assume that any "inflator" based on the 'zlib' library will produce the same error?
I can't find any specification in RFC 1951 that should REQUIRE the use of a sufficiently tight Huffman code. Certainly, I can see that using an "under-subscribed" Huffman table might be inefficient, in terms of compression, but I'm not sure why such a table should be prohibited from use.
My interest isn't simply hypothetical. I really want to use an under-subscribed, literal-only, Huffman code (but NOT the example cited above) to compress some application specific images into PNG files. But I want to make sure it will work with any PNG image viewer.

The RFC specifies that the codes are Huffman codes, which by definition are complete codes. (Complete means that all bit patterns are used.)
zlib will reject incomplete or oversubscribed codes, except in the special case noted in the RFC:
If only one distance code is used, it is encoded using one bit, not
zero bits; in this case there is a single code length of one, with one
unused code.
There the incomplete code 0 for the single symbol, with code 1 unused, is permitted.
(That, by the way, is unnecessary. If there is only one distance symbol, then you don't need any bits to specify it. You know that that distance symbol must be used with any length. If that symbol needs extra bits, then those extra bits immediately follow the length. But, oh well -- for that case Phil Katz put an extraneous zero bit in every match, and now we're stuck with it.)
The fact that the RFC even had to note this special case is another clue that incomplete codes are not accepted otherwise.
There is sort of another exception in deflate, in that the fixed literal/length code is incomplete, with two unused codes at the end.
The bottom line is, no, you will not be able to use an incomplete code in a dynamic header (except the special case) and expect zlib or any compliant deflate decoder to be able to decode it.
As for why this strictness is useful, constraints on dynamic headers permit rapid detection of non-deflate streams or corrupted deflate streams. Similarly, a dynamic header with no end code is not permitted by zlib, so as to avoid the case of a bogus dynamic header permitting any following random bits to be decodable forever, never detecting an error. The unused fixed codes also help in this regard, since eventually they trigger an error in random input.
By the way, if you want to define a fixed, complete Huffman code for your case, it's super simple, and would reduce the size of almost all of your codes by one bit. Just encode eight bits for the symbols 0..253, using that symbol number directly as the code (reversing the bits of course), and nine bits for symbols 254..257, using the codes 508..511 (bits reversed).

Related

Paraview "possible mismatch of datasize with declaration" error

Paraview (v4.1.0 64-bit, OSX 10.9.2) is giving me the following error:
Generic Warning: In /Users/kitware/Dashboards/MyTests/NightlyMaster/ParaViewSuperbuild-Release/paraview/src/paraview/VTK/IO/Legacy/vtkDataReader.cxx, line 1388
Error reading ascii data. Possible mismatch of datasize with declaration.
I'm not sure why. I've double-checked that fields are all of the expected lengths, and none of the values are NaN, inf, or otherwise extremely large. The issue starts with the output from timestep 16 (0-15 produces no error). Graphically, steps 0-15 produce plots of my data as expected; step 16 shows the "Y/Yc" series having an unexpectedly large point (0.5625, 2.86616e+36).
Is fine:
http://www.filedropper.com/ring0000015
Produces error:
http://www.filedropper.com/ring0000016
I have been facing the same problem for the last 6 months and been struggling to find a solution. I was given the following reasons to explain the error(http://www.cfd-online.com/Forums/paraview/139451-error-while-reading-vtk-files-paraview.html#post503315):
It could be a problem due to the character used for the line ending (http://en.wikipedia.org/wiki/Newline)
In a nutshell:
a)On Windows, line transition is with CR+LF.
b)On Linux, line transition is with LF only.
c)On Mac, some older versions used CR only. Nowadays I guess it should use LF as well.
CR= "Carriage Return" byte
LF= "Line Feed" byte
There might be one or more values that are of type NaN or Inf or some other special computational numeric definition for non-real numbers. They might be readable on Linux, but not on Mac, perhaps of the next possibility. If this is the case,
Location based numeric definitions, aka Locale, might be triggering situations where values are being stored with commas or with a strange scientific notation. For example, if a value "1.0002" is stored as "1,0002" or even perhaps "1.0002ES+000"
I have viewed other forums, and they have generally stated #2 and #3 and the possible solutions -- it has in general worked. However, none of the above seemed to solve my problem.
I noticed that some of the stored solution values in the ASCII files were as small as 10.e-34. I had a feeling that the underflow conditions maybe be triggering problems. I put a check in my code for underflow conditions and rounded them off to 0. This fixed the issue, with the solution being displayed at all times without error messages.
This may not fix the Inf/NaN problems, but if the numbers in the vtk file are too large or too small (i.e. 1e-50, 1e45), this may cause the same error.
One solution in this case is to change the datatype specification. When I had this problem, I specified the datatype as "float", which uses a 32-bit floating point representation (same as "float32"). Changing it to "float64" uses a 64-bit double-precision representation, which is consistent with my C++ code that generated the vtk file that uses doubles. This may eliminate the problem.
If you are using Fortran, this problem also occur when you write to file but not close it in code.
For example:
do i=1,10
write(numb,'(i3)')i
open(unit=1, file='test'//numb//'.vtk')
write(1,*).......
enddo

Haskell Char and unicode surrogates

I was playing around with strings and discovered that Haskell (correctly) disallows characters above Unicode code point 0x10ffff (ie one gets something like a sequence out of range error if one attempts to use something above this limit). Out of curiosity, i played around with the Unicode surrogate halves (0xd800 to 0xdfff) - invalid Unicode codepoints, and discovered that they seem to be permitted. I am curious as to why this is. Is it simply because being a bounded item means only defining a maximum and a minimum?
Disallowing the surrogate code units would indeed make Char a more correct type for Unicode code points. The Report says that Char is "an enumeration whose values represent Unicode characters", so probably this should be considered a GHC bug.
There's no specific notion of "a bounded item", but it would require extra checks in various places (right now chr just needs to make one comparison to check if its argument is valid, for instance) and possibly make some things behave more strangely (if people indirectly expect code points to be contiguous).
I don't know that there's an especially good rationale for it, though, or that the trade-off was even considered originally. In Haskell 1.4, Char was just a 16-bit type, so it would have been natural to extend it to 17*2^16 values without adding extra checks. This issue is occasionally brought up -- I've brought it up before -- but most people don't seem to worry about it very much. It's probably reasonable to file a GHC bug about it, though, to get a proper discussion going.
Note that Data.Text (which uses UTF-16 as its internal representations) does disallow the invalid code units (it has to).

linux libiconv transcode from ISO8859 or IBM850 to UTF8 error

I don't know what the original code is, so I assume that the original code is IBM850 or ISO8859-1.My process below
IBM850 -> UTF8
if this is OK, I consider the original code is IBM850, if NOK,do next step:
ISO8859-1 -> UTF8
if this is OK, I consider the original code is UTF8.
But there is a problem,
if the original code is ISO8859-1, it will be recognised to IBM850.
if the original code is IBM850, it will be recognised to ISO8859-1.
It seems that there are common ground between IBM850 and ISO8859-1.
Who can help me, thanks.
Yes, only the most trivial kind of autodetection is possible by testing whether conversion fails or succeeds. It's not going to work for input encodings where (almost) any input is valid.
You should know something more about your likely output, to test whether if it makes more sense after translating from IBM850 or from ISO8859-1. That's what enca and libenca do. You can probably start with some simple expectations to check:
Does your source happen to be within the ASCII subset of both encodings? Then you're happy with any conversion (but you have no way to know the original encoding at all).
Does your code use box drawing characters? If it does not, it would be easy to reject some candidates for IBM850.
Does your code use control characters from ISO8859-1? If it does not, it would be easy to reject some candidates for ISO8859-1 if codepoints 0x80-0x9F are used.
Do the fragments of your code which are non-ASCII always represent a text in a natural language? Then you can use frequency tables for characters and their pairs, selecting the source encoding which makes the result closer to your natural language(s) on these criteria. (If both variants are almost equally acceptable, it's probably better to give an error message and leave the final decision to humans).

Encode string to another base with more characters?

I know that I can encode numbers to a base like 65 to decrease the size of the character display (even if the number is smaller in binary).
However, is there a way to encode UTF-8 text to another base with more characters than our standard 26 letter English alphabet? In other words, Instead of requiring 4 "characters" for the word "four" - I can create a representation or hash using only, maybe 2 (i.e. "6$")?
I believe the point of Base64 is you can easily convert any binary data into "human readable" letters and numbers. It makes it easy to transcribe arbitrary data to newsgroups or transmit them over text based protocols.
If you want to further "compress" this data, you need to figure out how many characters you want to allow. There's only so many combinations of 8 bits. The most efficient would be to use all of them, in which case why just not use gzip?
Your question seems related to Order-0 entropy coding :
http://en.wikipedia.org/wiki/Entropy_encoding
The most famous algorithm is this family is Huffman coding :
http://en.wikipedia.org/wiki/Huffman_coding
Huffman will not only tells you that only 64 characters are used and therefore only 6 bits per characters are necessary : it will also make a difference between frequent characters, such as (space), and rare ones, such as (;). It will then create a code in which frequent characters use less bits than rarer ones, resulting in better compression (typically 4.5bits per character on English texts).
Huffman coding is an all-around compression technique, used as part of many compression algorithms, including zip.
You can find a demo program which only applies one pass of Huffman compression here (Huff0), it will help you determine how much can be gained by using this technique for your sample inputs :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Resources