Converting string to 64 bit floats in node.js - node.js

This question has two inter-related parts which I am deeply struggling with. I just want to point out I have literally been struggling with this for days!
Part 1. I have a string of data which is like the following:
?\21167\200Z\251\3072\26467\000\350-\306%\33767\240\314\203\310\030\n77\200U\374\307\013577\000#\036\306\376_77\200s\215\307\361\21277\000t\235\306\344\26577\000\204\000\307\327\34077\000\264\217\306\313\01387\000R&\307\276687\000\312\210\306\261a87\000\364\026\306\244\21487p\257\"\311\227\26787\000#U\306\212\34287\000\324\210\306}\r97\000\274*\306p897\000\270\016\306J\27197\000\250u\306=\34497\000\224r\3060\017:7\000\316\213\306#::7#\254\223\310\026e:7\000d\251\306\374\272:7\000(-\306\360\345:7\000\322\202\306\343\020;7\200D\230\307\311f;7\000\\\314\306\274\221;7\0000\246\306\257\274;7\000\230\342\305\242\347;7\000\000\220\310\225\022<7\0002\003\307\210=<7\200X\316\307|h<7\000-\033\307o\223<7\000\000\367\305b\276<7\000|\237\306U\351<7\000Di\306H\024=7\000\356>\307;?=7\000\024u\306!\225=7\240J\317\310\025\300=7\000\224#\306\010\353=7\000\373\027\307\373\025>7\0008p\310\324\226>7\000\360\205\306\272\354>7\000\226m\307\255\027?7\000\224\304\306\241B?7\200D\240\310\224m?7\000\304x\306\207\230?7\000B\337\306z\303?7\000;\217\307`\031#7\000\252q\307SD#7\000\244d\3069\232#7\000\270\324\306-\305#7\200\311\266\307 \360#7\340P\233\310\023\033A7\000\014\245\307\006FA7\000B\324\307\371pA7\000\362\002\307\354\233A7\000L#\306\337\306A7\000$\031\306\322\361A7\000\002\004\307\306\034B7\000,U\306\254rB7\000\274\341\306\237\235B7 W\212\310\222\310B7\000\177N\307\205\363B7\0004\351\306x\036C7\000\004\037\306kIC7#;f\310^tC7\000\371Q\307E\312C7\000$f\3078\365C7\000\032\320\306+ D7\000\226\252\306\036KD7\200\350\216\310\021vD7\000\270?\306\367\313D7\000#\032\306\353\366D7\200\2100\310\336!E7\000\270[\306\321LE7\000\"{\307\304wE7\000^\301\306\267\242E7\000\246\240\306\252\315E7\000b\006\307\235\370E7\200\003M\311\220#F7\000\374\203\306\203NF7\000T{\306wyF7\000PT\306j\244F7\000\364^\306]\317F7\000\324\003\307P\372F7\000\303 \307K^67\000\204\000\306X367\000\324\341\306e\01067\000\271\001\307r\33557\000\374\004\306\177\26257\000\240\235\305\214\20757\000\217\242\307\231\\57\000PS\306\246157 \013\236\310\262\00657\000 \222\307\277\33347\000\244\210\306\314\26047\000\226\353\306\346Z47\200\257P\310\331\20547\000\213M\307\363/47\000$\021\306\000\00547\000\373\241\307\r\33237\000z\355\307\032\25737\000\010\244\306&\20437\000\271p\3103Y37\000\307f\307#.37\000\241^\307M\00337\000l&\306Z\33027\000\226\213\306g\25527\000fx\307t\20227\300\352\020\310\201W27\000\253\231\307\215,27\000v\267\306\247\32617#\373r\310\264\25317\000X\202\305\301\20017\300\357\020\310\316U17\000\016\227\306\333*17\000\030\010\307\350\37707\300XN\310\365\32407\000\350U\306\001\25207\000\344\363\306\016\17707\200\324\360\307\033T07\000\022\"\307()07 \213\375\3105\376/7\000\265\222\307O\250/7\000\251r\307B\323/7\200O\215\307\\}/7\200\235\376\307iR/7\000\324\274\307\202\374.7\000S*\307u\'/7\000R-\307\217\321.7\000j\177\307\234\246.7\200)\346\307\251{.7\000\364\350\307\266P.7\000!\310\307\303%.7\000-/\307\320\372-7\200\'\207\307\334\317-7\300\357\006\310\351\244-7\000$(\310\366y-7\000\177:\307\003O-7\200t\361\307\020$-7\0001\217\307\035\371,7\000`G\305*\316,7\200\t3\3107\243,7\300\022\017\310Dx,7\000\244m\307PM,7\000\327o\307]\",7\000\004s\307j\367+7 \360\257\310w\314+7\000\265;\307\204\241+7\000D2\306\221v+7\000\261`\307\236K+7\000 \313\306\253 +7 \337!\311\267\365*7#6L\310\304\312*7\000\271;\310\321\237*7\000%\205\307\336t*7\000\0145\307\353I*7\200E\242\307\370\036*7\000~E\307\005\364)7\000\311\031\307\022\311)7\000\302\374\307\037\236)7\200\276\t\310+s)7\000\261Z\3108H)7\200\350\325\307E\035)7\240\321\201\310R\362(7\000\276\334\307_\307(7\000\246\016\310l\234(7\300\006\254\310yq(7\000\266\024\310\206F(7\200Q2\310\223\033(7\200d\235\307\237\360\'7\000\311\314\307\254\305\'7#i\250\310\271\232\'7\300\254Y\310\306o\'7#\260\t\310\323D\'7\000\336\232\307\340\031\'7\000w&\307\355\356&7\360\226\017\311\372\303&7\000\240\226\306\006\231&7\200\253Q\310\023n&7#OT\310 C&7\000\034&\310-\030&7\000\213d\310:\355%7\000\000\200\302\363\202\0317\000\031\273G\000X\0317\000\246\276G\r-\0317\000\314\271F\032\002\0317\000\242\037G\'\327\0307\200\334\237G4\254\0307\3401\260HA\201\0307\000\262\307FNV\0307\000\230_FZ+\0307\000\022\256Fg\000\0307\200\354\232Gt\325\0277\000\246\355F\201\252\0277\000\336\355F\216\177\0277\000:$G\233T\0277\000\324jF\250)\0277\000\273\231G\265\376\0267p\365\031I\302\323\0267\000j\240G\333}\0267\000\370\247E\350R\0267\000tbF\365\'\0267\000Q7G\034\247\0257\000\304\036F)|\0257\000\214mF5Q\0257`\254\203HB&\0257\000\213|GO\373\0247\000o#Hi\245\0247\000\026\213Fvz\0247\000\014\346F\203O\0247\000\277\236G\220$\0247\000\014\214F\235\371\0237\200\355\345G\251\316\0237\000\312\253F\266\243\0237\200\001\203G\303x\0237\200\034\206G\320M\0237\000\030\236G\335\"\0237#x\033H\352\367\0227\000\3630G\367\314\0227\000>\260F\004\242\0227\000|\242F\021w\0227\000\240uH\035L\0227\000Z\005G7\366\0217\000\':GQ\240\0217\000-\tGkJ\0217\000\340LFx\037\0217\000\224\316F\204\364\0207\000\270\365F\221\311\0207\000\004\332F\236\236\0207\000\224\021G\253s\0207\000\334UG\270H\0207\000\274\211F\305\035\0207\200\243\353G\337\307\0177\000z\037H\370q\0177\000\t3G\022\034\0177\000\335<G\037\361\0167#\275+H,\306\0167\200\323\272G9\233\0167\300\030)HSE\0167\000\026\237F_\032\0167\000L\227Fl\357\r7\000~\243Fy\304\r7\000\306\037G\206\231\r7\000\334XF\240C\r7\000 dF\272\355\0147\000\374\324F\307\302\0147\000\014\207F\340l\0147\000\327CG\372\026\0147\000\370\261F\007\354\0137\000\025\003G\024\301\0137# \\H!\226\0137\000\254\247F.k\0137\000\024\270F;#\0137\000\362\216Fa\277\n7\000\020\023F{i\n7\000(\033G\347\255\0317\0008\226E\332\330\0317\000\000\276F\315\003\0327\000\230\253E\300.\0327\000XEF\263Y\0327\000cSH\246\204\0327\200`\224G\231\257\0327\000\366\337G\214\332\0327\000P\353Es0\0337\000&\212Ff[\0337\000\263JGY\206\0337\000\212gG?\334\0337\200\260\202GL\261\0337\000\324PF2\007\0347\340\023\037I%2\0347\000\256\314F\030]\0347\000\250#F\013\210\0347\000<\251G\377\262\0347\000\320hE\362\335\0347\000\270nF\3303\0357\200\263\327G\313^\0357\000\016\227F\276\211\0357\200\313\244G\261\264\0357\0005\262G\244\337\0357\200\307\301G\230\n\0367\2001\200G\2135\0367\000\020:F~`\0367\200\330\234Gq\213\0367\200\222\314Gd\266\0367\000\357BGW\341\0367\000\240\234FJ\014\0377\200\025\332G=7\0377\000d\021F0b\0377\000dhH$\215\0377\000\002iG\027\270\0377\200[\246G\n\343\0377\200e\264G\375\r 7\000 2G\3608 7\000#\302E\343c 7\200y\250G\326\216 7\000\335DG\311\271 7\000\323\263G\275\344 7\000B\214F\260\017!7\200\246\243G\226e!7\0006`G\243:!7\200\036\344G\211\220!7\200u\371G|\273!7\000\340\202Go\346!7\000\272\377Fb\021\"7#v<HU<\"7\000\"\330FIg\"7\000\004FG<\222\"7\200;\271G/\275\"7\200C\243H\"\350\"7\000\370\337F\025\023#7\300\r\001H\010>#7\000\215^H\373h#7\200K\202G\356\223#7\000\236\350G\341\276#7\200k\303G\325\351#7\000F\346G\310\024$7\3002\030H\273?$7\200\220\276G\241\225$7`d\371H\256j$7\240\275\036I\224\300$7#y\036H\207\353$7\000#\237Gz\026%7\000^<Gal%7\200^?HnA%7\2000\231HT\227%7\200\310\323GG\302%7\300+DH
I am reading data from a file and I end up with a variable myString with the above data in it. When I do typeof on it, it says it's a string and when I do a console.log on it, it outputs the above exactly.So I created a new node.js file, and put all of the data above inside quotes:
newString = "?\21167\200Z\251\3072\26467\000 etc"
When I console.log(newString) it outputs
?‰67€Z©Ç2´67è-Æ%ß67 Ìƒ77€UüÇ 577ÉÆþ_77€sÇñŠ77tÆäµ77„Ç×à7 etc
Rather than the original string "?\21167\200Z\251\3072\26467\000..." etc
What is going on here?
Part 2. Because of 1 I have struggled to to extract the 64 bit floats from this string.
I can do it Python!:
import numpy as np
data_new = b"?\21167\200Z\251\3072\26467\000\350-\306%\33767\240\314\2 etc etc"
print(np.frombuffer(data_new, dtype="f4,f4"))
Is there an easy way to do this in node?

Explanation
In JavaScript, \<OCTAL SEQUENCE> inside a string literal will give you the Unicode character represented by that codepoint in octal. This feature is deprecated in modern JavaScript but still works for backward-compatibility reasons (except in strict mode, where it throws a SyntaxError).
The octal sequence must be in the range 0o0..0o377 (decimal 0..255 or hexadecimal 0x0..0xff) for this to happen.
'\1'.codePointAt(0).toString(16) // 1
String.fromCodePoint(0x01) // <Start of Heading>
// https://www.compart.com/en/unicode/U+0001
'\377'.codePointAt(0).toString(16) // ff
String.fromCodePoint(0xff) // "ÿ"
If the number isn't valid Octal, the backslash is simply "swallowed":
'\9'.codePointAt(0).toString(16) // 39
String.fromCodePoint(0x39) // "9"
// 9 is not a valid octal number, so the string
// simply evaluates to the character "9"
Solutions
Option 1
Replace all the backslashes \ with double-backslashes \\ to escape them, then you should be good to go.
'\\1'.codePointAt(0).toString(16) // 5c
String.fromCodePoint(0x5c) // Reverse Solidus (AKA backslash)
// https://www.compart.com/en/unicode/U+005C
Option 2
Use String.raw with a template literal.
String.raw`\1\2\3` === "\\1\\2\\3" // true
Note that this will not work if there is a backslash at the very end of the string:
String.raw`\` // Uncaught SyntaxError: Unexpected end of input

Related

Why the string in unicode form is not equal to its unicode code point value?

We can get the string 你's unicode code point value:
u'你'.encode('unicode-escape')
b'\\u4f60'
Why the string in unicode form is not equal to its unicode code point value?
u'你' == u'\x4f\x60'
False
u'你' == u'\\u4f60'
False
It is, but your comparison strings are not correct to compare. The first one is two separate characters of a single byte, and the second one has the backslash escaped, meaning that it is the literal 6 characters \u4f60.
u'你' == u"\u4f60"
True
The encoded byte string has the two backslashes since the encoding escapes it, making it not equivalent even if turned back into a string unless you decode it with unicode-escape as well.
Side note, the u is default in python 3.

How to convert string like "//u****" to text?

I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.

Output UUID in Go as a short string

Is there a built in way, or reasonably standard package that allows you to convert a standard UUID into a short string that would enable shorter URL's?
I.e. taking advantage of using a larger range of characters such as [A-Za-z0-9] to output a shorter string.
I know we can use base64 to encode the bytes, as follows, but I'm after something that creates a string that looks like a "word", i.e. no + and /:
id = base64.StdEncoding.EncodeToString(myUuid.Bytes())
A universally unique identifier (UUID) is a 128-bit value, which is 16 bytes. For human-readable display, many systems use a canonical format using hexadecimal text with inserted hyphen characters, for example:
123e4567-e89b-12d3-a456-426655440000
This has length 16*2 + 4 = 36. You may choose to omit the hypens which gives you:
fmt.Printf("%x\n", uuid)
fmt.Println(hex.EncodeToString(uuid))
// Output: 32 chars
123e4567e89b12d3a456426655440000
123e4567e89b12d3a456426655440000
You may choose to use base32 encoding (which encodes 5 bits with 1 symbol in contrast to hex encoding which encodes 4 bits with 1 symbol):
fmt.Println(base32.StdEncoding.EncodeToString(uuid))
// Output: 26 chars
CI7EKZ7ITMJNHJCWIJTFKRAAAA======
Trim the trailing = signs when transmitting, so this will always be 26 chars. Note that you have to append "======" prior to decode the string using base32.StdEncoding.DecodeString().
If this is still too long for you, you may use base64 encoding (which encodes 6 bits with 1 symbol):
fmt.Println(base64.RawURLEncoding.EncodeToString(uuid))
// Output: 22 chars
Ej5FZ-ibEtOkVkJmVUQAAA
Note that base64.RawURLEncoding produces a base64 string (without padding) which is safe for URL inclusion, because the 2 extra chars in the symbol table (beyond [0-9a-zA-Z]) are - and _, both which are safe to be included in URLs.
Unfortunately for you, the base64 string may contain 2 extra chars beyond [0-9a-zA-Z]. So read on.
Interpreted, escaped string
If you are alien to these 2 extra characters, you may choose to turn your base64 string into an interpreted, escaped string similar to the interpreted string literals in Go. For example if you want to insert a backslash in an interpreted string literal, you have to double it because backslash is a special character indicating a sequence, e.g.:
fmt.Println("One backspace: \\") // Output: "One backspace: \"
We may choose to do something similar to this. We have to designate a special character: be it 9.
Reasoning: base64.RawURLEncoding uses the charset: A..Za..z0..9-_, so 9 represents the highest code with alphanumeric character (61 decimal = 111101b). See advantage below.
So whenever the base64 string contains a 9, replace it with 99. And whenever the base64 string contains the extra characters, use a sequence instead of them:
9 => 99
- => 90
_ => 91
This is a simple replacement table which can be captured by a value of strings.Replacer:
var escaper = strings.NewReplacer("9", "99", "-", "90", "_", "91")
And using it:
fmt.Println(escaper.Replace(base64.RawURLEncoding.EncodeToString(uuid)))
// Output:
Ej5FZ90ibEtOkVkJmVUQAAA
This will slightly increase the length as sometimes a sequence of 2 chars will be used instead of 1 char, but the gain will be that only [0-9a-zA-Z] chars will be used, as you wanted. The average length will be less than 1 additional character: 23 chars. Fair trade.
Logic: For simplicity let's assume all possible uuids have equal probability (uuid is not completely random, so this is not the case, but let's set this aside as this is just an estimation). Last base64 symbol will never be a replaceable char (that's why we chose the special char to be 9 instead of like A), 21 chars may turn into a replaceable sequence. The chance for one being replaceable: 3 / 64 = 0.047, so on average this means 21*3/64 = 0.98 sequences which turn 1 char into a 2-char sequence, so this is equal to the number of extra characters.
To decode, use an inverse decoding table captured by the following strings.Replacer:
var unescaper = strings.NewReplacer("99", "9", "90", "-", "91", "_")
Example code to decode an escaped base64 string:
fmt.Println("Verify decoding:")
s := escaper.Replace(base64.RawURLEncoding.EncodeToString(uuid))
dec, err := base64.RawURLEncoding.DecodeString(unescaper.Replace(s))
fmt.Printf("%x, %v\n", dec, err)
Output:
123e4567e89b12d3a456426655440000, <nil>
Try all the examples on the Go Playground.
As suggested here, If you want just a fairly random string to use as slug, better to not bother with UUID at all.
You can simply use go's native math/rand library to make random strings of desired length:
import (
"math/rand"
"encoding/hex"
)
b := make([]byte, 4) //equals 8 characters
rand.Read(b)
s := hex.EncodeToString(b)
Another option is math/big. While base64 has a constant output of 22
characters, math/big can get down to 2 characters, depending on the input:
package main
import (
"encoding/base64"
"fmt"
"math/big"
)
type uuid [16]byte
func (id uuid) encode() string {
return new(big.Int).SetBytes(id[:]).Text(62)
}
func main() {
var id uuid
for n := len(id); n > 0; n-- {
id[n - 1] = 0xFF
s := base64.RawURLEncoding.EncodeToString(id[:])
t := id.encode()
fmt.Printf("%v %v\n", s, t)
}
}
Result:
AAAAAAAAAAAAAAAAAAAA_w 47
AAAAAAAAAAAAAAAAAAD__w h31
AAAAAAAAAAAAAAAAAP___w 18owf
AAAAAAAAAAAAAAAA_____w 4GFfc3
AAAAAAAAAAAAAAD______w jmaiJOv
AAAAAAAAAAAAAP_______w 1hVwxnaA7
AAAAAAAAAAAA_________w 5k1wlNFHb1
AAAAAAAAAAD__________w lYGhA16ahyf
AAAAAAAAAP___________w 1sKyAAIxssts3
AAAAAAAA_____________w 62IeP5BU9vzBSv
AAAAAAD______________w oXcFcXavRgn2p67
AAAAAP_______________w 1F2si9ujpxVB7VDj1
AAAA_________________w 6Rs8OXba9u5PiJYiAf
AAD__________________w skIcqom5Vag3PnOYJI3
AP___________________w 1SZwviYzes2mjOamuMJWv
_____________________w 7N42dgm5tFLK9N8MT7fHC7
https://golang.org/pkg/math/big

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Trying to make a textfield where people write the unicode without the backslash. I want to add the backslash after they typed it. So the user types u2605 and the code converts it to "\u2605", i then convert this to a unicode character and insert it in textflow.
My code:
this works:
span.text = publicFunctions.htmlUnescape(he.encode("\u2605"))
this doesn't work:
span.text = publicFunctions.htmlUnescape(he.encode("\\u"+"2605"))
how to make a string that acts as a unicode string?
Tried all sorts of things, escape(unescape()), convert to number, "\u", "\u" ... nothing helps.
trace("\u2605" == "\u"+"2605") ... will return false. So will
trace("\u2605" == "\u"+"2605")
"\u2605" is a string with a single character, the character with the code point 2605, while "\\u" + "2605" is a string with 6 characters (the backslash, the u and the four digit number).
If you want to construct a unicode character from just the four digits, you should be able to use String.fromCharCode. The thing is just that the escape sequence uses a hexadecimal number, while the method obviously takes a decimal number. So if the user enters a hexadecimal string, you will have to convert that first:
trace(String.fromCharCode(parseInt('2605', 16)) == '\u2605'));
That's an interesting issue! I don't think you can concatenate a string literal and achieve what you're trying to do. The relevant character escaping happens when the string literal is originally formed, which means that you need the whole sequence together in the first place.
But you should be able to take the user-supplied number and dynamically generate a Unicode string with String.fromCharCode(...).
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/String.html#fromCharCode()

Perl's default string encoding and representation

In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?
In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you find my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.
is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

Resources