I am using the Graphics.EasyRender haskell package to render PDF files.
Everything is fine with normal ascii characters, but I get encoding problems when using for example German Umlauts like öäüß or french chars like éè etc.
let document = newpage_defer $ do
textbox align_left (Font Helvetica 12) (Color_Gray 0) 10 100 100 100 0 myStr
myStr :: String
myStr = "test text ä"
the output is
test text ˆ⁄
I also tried to encode UTF8, but then it gets worse:
test text ˆ´⁄
As a test I also tries to generate postcript output from the EasyRender package, but there I have the same problems.
PDF is my main concern, any Idea how use these Umlauts in the target PDF?
Each PDF font has its own encoding, and usually PDF files has them embedded together with encoding information. easyrender doesn't embed fonts, it uses standard PDF fonts, that are deprecated.
AFAIU, Helverica uses the StandardEncoding (see PDF reference for details), and it doesn't contain ä character. So I don't think you can non-ASCII characters with easyrender.
To confirm, try to draw a string with byte (octal) 341 (and also 256 and 306 in case it uses MacRomanEncoding or WinAnsiEncoding), it should represent character Æ.
ADDED: Character ä in UTF8 is represented as two bytes, 303 and 244. In the StandardEncoding they represent ˆ and ⁄ -- exactly what you see in the first output. Not sure where ´ comes from in the second one.
There are two issues:
Re-encoding a standard font like Helvetica to be able to "show" characters in the range 128-255.
Representing characters 128-255 in a Postscript string.
The first is covered in Section 5.9.1 of the Postscript Language Reference, page 349. Another reference is http://apps.jcns.fz-juelich.de/doku/sc/ps-latin
The second is a deficiency with the easyrender library. One way to represent the character Ä (capital A with an umlaut) is via the octal
escape \304. That is, in the Postscript file you need to have:
(\304) show
However, looking at the library's source code, the function ps_escape is incapable of producing octal escape codes for characters in the range 128-255.
Another way to solve this problem is to emit the generated Postscript in the Latin1 encoding:
import System.IO
...
main = do
let doc = newpage_defer $ do
...
let output = render_string Format_PDF doc
hSetEncoding stdout latin1
putStr output
Putting these two ideas together:
import Graphics.EasyRender
import System.IO
reencode_fonts = " /Helvetica findfont dup length dict begin { 1 index /FID ne {def} {pop pop} ifelse } forall /Encoding ISOLatin1Encoding def currentdict end /Helvetica exch definefont pop"
my_custom = custom { ps_defs = reencode_fonts }
document = newpage_defer $ do
textbox align_left (Font Helvetica 12) (Color_Gray 0) 10 100 100 100 0 myStr
endpage 500 500
myStr :: String
myStr = "test text ä"
out = render_custom_string Format_PS my_custom document
main = do
hSetEncoding stdout latin1
putStr out
Related
While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).
I want to replace them with specific characters, but am unable to do so. Removing them won't work either.
What it looks like in the exported flat file: https://i.imgur.com/7MQpoMH.png
Another example: https://i.imgur.com/3ZUSGIr.png
This is what I've tried:
(and mind, <0x01> represents a none-editable entity. It's not recognized here.)
import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
s=p.read()
# included in case it bears any significance
import re
import binascii
s = "Some string with hex: <0x01>"
s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte
s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')
or something along these lines in hopes to get a grasp of it while iterating through the whole string:
for x in s:
try:
base64.encodebytes(x)
base64.decodebytes(x)
s.strip(binascii.unhexlify(x))
s.decode('utf-8')
s.encode('latin1').decode('utf-8')
except:
pass
Nothing seems to get the job done.
I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing?
NB: I have to preserve umlauts (äöüÄÖÜ)
-- edit:
Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?
with io.open('out.txt', 'w', encoding="utf-8") as temp:
temp.write(s)
Judging from the images, these are actually control characters.
Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation.
You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.
In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits.
The fragment from the first image is probably equal to the following string:
"sich z\x01 B. irgendeine"
Your attempts to remove them were close.
s = s.replace('\x01', '.') should work.
Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?
Using Python 3.5, I opened the XHTML file as UTF8 text:
inputTopicFile = open(inputFileName, "rt", encoding="utf8")
As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:
I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:
firstLine = firstLine[3:]
Didn't work -- the characters <? were no longer present at the start of the resulting line.
So I did this experiment:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, firstLine[charPos]))
Which printed:
charPos 0 ==
charPos 1 == <
charPos 2 == ?
I then added .encode to that loop as follows:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))
Which gave me:
charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'
Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?
At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)
So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?
EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file:
inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")
That strips out the BOM. Voila!
As you mentioned in your edit, you can open the file with the utf8-sig encoding, but to answer your question of why it was behaving this way:
Python 3 distinguishes between byte strings (the ones with the b prefix) and character strings (without the b prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode, you turn the character string into a byte string.
Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode.
I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.
In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?
In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you find my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.
is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding
I can't find a basic description of how string data is stored in Perl! Its like all the documentation is assuming I already know this for some reason. I know about encode(), decode(), and I know I can read raw bytes into a Perl "string" and output them again without Perl screwing with them. I know about open modes. I also gather Perl must use some interal format to store character strings and can differentiate between character and binary data. Please where is this documented???
Equivalent question is; given this perl:
$x = decode($y);
Decode to WHAT and from WHAT??
As far as I can figure there must be a flag on the string data structure that says this is binary XOR character data (of some internal format which BTW is a superset of Unicode -http://perldoc.perl.org/Encode.html#DESCRIPTION). But I'd like it if that were stated in the docs or confirmed/discredited here.
This is a great question. To investigate, we can dive a little deeper by using Devel::Peek to see what is actually stored in our strings (or other variables).
First lets start with an ASCII string
$ perl -MDevel::Peek -E 'Dump "string"'
SV = PV(0x9688158) at 0x969ac30
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x969ea20 "string"\0
CUR = 6
LEN = 12
Then we can turn on unicode IO layers and do the same
$ perl -MDevel::Peek -CSAD -E 'Dump "string"'
SV = PV(0x9eea178) at 0x9efcce0
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x9f0faf8 "string"\0
CUR = 6
LEN = 12
From there lets try to manually add some wide characters
$ perl -MDevel::Peek -CSAD -e 'Dump "string \x{2665}"'
SV = PV(0x9be1148) at 0x9bf3c08
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x9bf7178 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
From that you can clearly see that Perl has interpreted this correctly as utf8. The problem is that if I don't give the octets using the \x{} escaping the representation looks more like the regular string
$ perl -MDevel::Peek -CSAD -E 'Dump "string ♥"'
SV = PV(0x9143058) at 0x9155cd0
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0x9168af8 "string \342\231\245"\0
CUR = 10
LEN = 12
All Perl sees is bytes and has no way to know that you meant them as a unicode character, unlike when you entered the escaped octets above. Now lets use decode and see what happens
$ perl -MDevel::Peek -CSAD -MEncode=decode -E 'Dump decode "utf8", "string ♥"'
SV = PV(0x8681100) at 0x8683068
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x869dbf0 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
TADA!, now you can see that the string is correctly internally represented matching what you entered when you used the \x{} escaping.
The actual answer is it is "decoding" from bytes to characters, but I think it makes more sense when you see the Peek output.
Finally, you can make Perl see you source code as utf8 by using the utf8 pragma, like so
$ perl -MDevel::Peek -CSAD -Mutf8 -E 'Dump "string ♥"'
SV = PV(0x8781170) at 0x8793d00
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x87973b8 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
CUR = 10
LEN = 12
Rather like the fluid string/number status of its scalar variables, the internal format of Perl's strings is variable and depends on the contents of the string.
Take a look at perluniintro, which says this.
Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.
What that means is that a string like "I have £ two" is stored as (bytes) I have \x{A3} two. (The pound sign is U+00A3.) Now if I append a multi-byte unicode string such as U+263A - a smiling face - Perl will convert the whole string to UTF-8 before it appends the new character, giving (bytes) I have \xC2\xA3 two\xE2\x98\xBA. Removing this last character again leaves the string UTF-8 encoded, as `I have \xC2\xA3 two.
But I wonder why you need to know this. Unless you are writing an XS extension in C the internal format is transparent and invisible to you.
Perls internal string format is implementation dependant, but usually a super set of UtF-8. It doesn't matter what it is because you use decode and encode to convert strings to and from the internal format to other encodings.
Decode converts to perls internal format, encode converts from perls internal format.
Binary data is stored internaly the same way characters 0 through 255 are.
Encode and decode just convert between formats. For example UTF8 encoding means each character will only be an octet using perl character vlaues 0 through 255, ie that the string consists of UTF8 octets.
Short answer: It's a mess
Slightly longer: The difference isn't visible to the programmer.
Basically you have to remember if your string contains bytes or characters, where characters are unicode codepoints. If you only encounter ASCII, the difference is invisible, which is dangerous.
Data itself and the representation of such data are distinct, and should not be confused. Strings are (conceptually) a sequence of codepoints, but are represented as a byte array in memory, and represented as some byte sequence when encoded. If you want to store binary data in a string, you re-interpret the number of a codepoint as a byte value, and restrict yourself to codepoints in 0–255.
(E.g. a file has no encoding. The information in that file has some encoding (be it ASCII, UTF-16 or EBCDIC at a character level, and Perl, HTML or .ini at an application level))
The exact storage format of a string is irrelevant, but you can store complete integers inside such a string:
# this will work if your perl was compiled with large integers
my $string = chr 2**64; # this is so not unicode
say ord $string; # 18446744073709551615
The internal format is adjusted accordingly to accomodate such values; normal strings won't take up one integer per character.
Perl can handle more than Unicode can, so it's very flexible. Sometimes you want to interface with something that cannot, so you can use encode(...) and decode(...) handle those transformations. see http://perldoc.perl.org/utf8.html