Writing Bytes to strings.builder prints nothing - string

I am learning go and am unsure why this piece of code prints nothing
package main
import (
"strings"
)
func main(){
var sb strings.Builder
sb.WriteByte(byte(127))
println(sb.String())
}
I would expect it to print 127

You are appending a byte to the string's buffer, not the characters "127".
Since Go strings are UTF-8, any number <=127 will be the same character as that number in ASCII. As you can see in this ASCII chart, 127 will get you the "delete" character. Since "delete" is a non-printable character, println doesn't output anything.
Here's an example of doing the same thing from your question, but using a printable character. 90 for "Z". You can see that it does print out Z.
If you want to append the characters "127" you can use sb.WriteString("127") or sb.Write([]byte("127")). If you want to append the string representation of a byte, you might want to look at using fmt.Sprintf.
Note: I'm not an expert on character encoding so apologies if the terminology in this answer is incorrect.

Related

Converting string to 64 bit floats in node.js

This question has two inter-related parts which I am deeply struggling with. I just want to point out I have literally been struggling with this for days!
Part 1. I have a string of data which is like the following:
?\21167\200Z\251\3072\26467\000\350-\306%\33767\240\314\203\310\030\n77\200U\374\307\013577\000#\036\306\376_77\200s\215\307\361\21277\000t\235\306\344\26577\000\204\000\307\327\34077\000\264\217\306\313\01387\000R&\307\276687\000\312\210\306\261a87\000\364\026\306\244\21487p\257\"\311\227\26787\000#U\306\212\34287\000\324\210\306}\r97\000\274*\306p897\000\270\016\306J\27197\000\250u\306=\34497\000\224r\3060\017:7\000\316\213\306#::7#\254\223\310\026e:7\000d\251\306\374\272:7\000(-\306\360\345:7\000\322\202\306\343\020;7\200D\230\307\311f;7\000\\\314\306\274\221;7\0000\246\306\257\274;7\000\230\342\305\242\347;7\000\000\220\310\225\022<7\0002\003\307\210=<7\200X\316\307|h<7\000-\033\307o\223<7\000\000\367\305b\276<7\000|\237\306U\351<7\000Di\306H\024=7\000\356>\307;?=7\000\024u\306!\225=7\240J\317\310\025\300=7\000\224#\306\010\353=7\000\373\027\307\373\025>7\0008p\310\324\226>7\000\360\205\306\272\354>7\000\226m\307\255\027?7\000\224\304\306\241B?7\200D\240\310\224m?7\000\304x\306\207\230?7\000B\337\306z\303?7\000;\217\307`\031#7\000\252q\307SD#7\000\244d\3069\232#7\000\270\324\306-\305#7\200\311\266\307 \360#7\340P\233\310\023\033A7\000\014\245\307\006FA7\000B\324\307\371pA7\000\362\002\307\354\233A7\000L#\306\337\306A7\000$\031\306\322\361A7\000\002\004\307\306\034B7\000,U\306\254rB7\000\274\341\306\237\235B7 W\212\310\222\310B7\000\177N\307\205\363B7\0004\351\306x\036C7\000\004\037\306kIC7#;f\310^tC7\000\371Q\307E\312C7\000$f\3078\365C7\000\032\320\306+ D7\000\226\252\306\036KD7\200\350\216\310\021vD7\000\270?\306\367\313D7\000#\032\306\353\366D7\200\2100\310\336!E7\000\270[\306\321LE7\000\"{\307\304wE7\000^\301\306\267\242E7\000\246\240\306\252\315E7\000b\006\307\235\370E7\200\003M\311\220#F7\000\374\203\306\203NF7\000T{\306wyF7\000PT\306j\244F7\000\364^\306]\317F7\000\324\003\307P\372F7\000\303 \307K^67\000\204\000\306X367\000\324\341\306e\01067\000\271\001\307r\33557\000\374\004\306\177\26257\000\240\235\305\214\20757\000\217\242\307\231\\57\000PS\306\246157 \013\236\310\262\00657\000 \222\307\277\33347\000\244\210\306\314\26047\000\226\353\306\346Z47\200\257P\310\331\20547\000\213M\307\363/47\000$\021\306\000\00547\000\373\241\307\r\33237\000z\355\307\032\25737\000\010\244\306&\20437\000\271p\3103Y37\000\307f\307#.37\000\241^\307M\00337\000l&\306Z\33027\000\226\213\306g\25527\000fx\307t\20227\300\352\020\310\201W27\000\253\231\307\215,27\000v\267\306\247\32617#\373r\310\264\25317\000X\202\305\301\20017\300\357\020\310\316U17\000\016\227\306\333*17\000\030\010\307\350\37707\300XN\310\365\32407\000\350U\306\001\25207\000\344\363\306\016\17707\200\324\360\307\033T07\000\022\"\307()07 \213\375\3105\376/7\000\265\222\307O\250/7\000\251r\307B\323/7\200O\215\307\\}/7\200\235\376\307iR/7\000\324\274\307\202\374.7\000S*\307u\'/7\000R-\307\217\321.7\000j\177\307\234\246.7\200)\346\307\251{.7\000\364\350\307\266P.7\000!\310\307\303%.7\000-/\307\320\372-7\200\'\207\307\334\317-7\300\357\006\310\351\244-7\000$(\310\366y-7\000\177:\307\003O-7\200t\361\307\020$-7\0001\217\307\035\371,7\000`G\305*\316,7\200\t3\3107\243,7\300\022\017\310Dx,7\000\244m\307PM,7\000\327o\307]\",7\000\004s\307j\367+7 \360\257\310w\314+7\000\265;\307\204\241+7\000D2\306\221v+7\000\261`\307\236K+7\000 \313\306\253 +7 \337!\311\267\365*7#6L\310\304\312*7\000\271;\310\321\237*7\000%\205\307\336t*7\000\0145\307\353I*7\200E\242\307\370\036*7\000~E\307\005\364)7\000\311\031\307\022\311)7\000\302\374\307\037\236)7\200\276\t\310+s)7\000\261Z\3108H)7\200\350\325\307E\035)7\240\321\201\310R\362(7\000\276\334\307_\307(7\000\246\016\310l\234(7\300\006\254\310yq(7\000\266\024\310\206F(7\200Q2\310\223\033(7\200d\235\307\237\360\'7\000\311\314\307\254\305\'7#i\250\310\271\232\'7\300\254Y\310\306o\'7#\260\t\310\323D\'7\000\336\232\307\340\031\'7\000w&\307\355\356&7\360\226\017\311\372\303&7\000\240\226\306\006\231&7\200\253Q\310\023n&7#OT\310 C&7\000\034&\310-\030&7\000\213d\310:\355%7\000\000\200\302\363\202\0317\000\031\273G\000X\0317\000\246\276G\r-\0317\000\314\271F\032\002\0317\000\242\037G\'\327\0307\200\334\237G4\254\0307\3401\260HA\201\0307\000\262\307FNV\0307\000\230_FZ+\0307\000\022\256Fg\000\0307\200\354\232Gt\325\0277\000\246\355F\201\252\0277\000\336\355F\216\177\0277\000:$G\233T\0277\000\324jF\250)\0277\000\273\231G\265\376\0267p\365\031I\302\323\0267\000j\240G\333}\0267\000\370\247E\350R\0267\000tbF\365\'\0267\000Q7G\034\247\0257\000\304\036F)|\0257\000\214mF5Q\0257`\254\203HB&\0257\000\213|GO\373\0247\000o#Hi\245\0247\000\026\213Fvz\0247\000\014\346F\203O\0247\000\277\236G\220$\0247\000\014\214F\235\371\0237\200\355\345G\251\316\0237\000\312\253F\266\243\0237\200\001\203G\303x\0237\200\034\206G\320M\0237\000\030\236G\335\"\0237#x\033H\352\367\0227\000\3630G\367\314\0227\000>\260F\004\242\0227\000|\242F\021w\0227\000\240uH\035L\0227\000Z\005G7\366\0217\000\':GQ\240\0217\000-\tGkJ\0217\000\340LFx\037\0217\000\224\316F\204\364\0207\000\270\365F\221\311\0207\000\004\332F\236\236\0207\000\224\021G\253s\0207\000\334UG\270H\0207\000\274\211F\305\035\0207\200\243\353G\337\307\0177\000z\037H\370q\0177\000\t3G\022\034\0177\000\335<G\037\361\0167#\275+H,\306\0167\200\323\272G9\233\0167\300\030)HSE\0167\000\026\237F_\032\0167\000L\227Fl\357\r7\000~\243Fy\304\r7\000\306\037G\206\231\r7\000\334XF\240C\r7\000 dF\272\355\0147\000\374\324F\307\302\0147\000\014\207F\340l\0147\000\327CG\372\026\0147\000\370\261F\007\354\0137\000\025\003G\024\301\0137# \\H!\226\0137\000\254\247F.k\0137\000\024\270F;#\0137\000\362\216Fa\277\n7\000\020\023F{i\n7\000(\033G\347\255\0317\0008\226E\332\330\0317\000\000\276F\315\003\0327\000\230\253E\300.\0327\000XEF\263Y\0327\000cSH\246\204\0327\200`\224G\231\257\0327\000\366\337G\214\332\0327\000P\353Es0\0337\000&\212Ff[\0337\000\263JGY\206\0337\000\212gG?\334\0337\200\260\202GL\261\0337\000\324PF2\007\0347\340\023\037I%2\0347\000\256\314F\030]\0347\000\250#F\013\210\0347\000<\251G\377\262\0347\000\320hE\362\335\0347\000\270nF\3303\0357\200\263\327G\313^\0357\000\016\227F\276\211\0357\200\313\244G\261\264\0357\0005\262G\244\337\0357\200\307\301G\230\n\0367\2001\200G\2135\0367\000\020:F~`\0367\200\330\234Gq\213\0367\200\222\314Gd\266\0367\000\357BGW\341\0367\000\240\234FJ\014\0377\200\025\332G=7\0377\000d\021F0b\0377\000dhH$\215\0377\000\002iG\027\270\0377\200[\246G\n\343\0377\200e\264G\375\r 7\000 2G\3608 7\000#\302E\343c 7\200y\250G\326\216 7\000\335DG\311\271 7\000\323\263G\275\344 7\000B\214F\260\017!7\200\246\243G\226e!7\0006`G\243:!7\200\036\344G\211\220!7\200u\371G|\273!7\000\340\202Go\346!7\000\272\377Fb\021\"7#v<HU<\"7\000\"\330FIg\"7\000\004FG<\222\"7\200;\271G/\275\"7\200C\243H\"\350\"7\000\370\337F\025\023#7\300\r\001H\010>#7\000\215^H\373h#7\200K\202G\356\223#7\000\236\350G\341\276#7\200k\303G\325\351#7\000F\346G\310\024$7\3002\030H\273?$7\200\220\276G\241\225$7`d\371H\256j$7\240\275\036I\224\300$7#y\036H\207\353$7\000#\237Gz\026%7\000^<Gal%7\200^?HnA%7\2000\231HT\227%7\200\310\323GG\302%7\300+DH
I am reading data from a file and I end up with a variable myString with the above data in it. When I do typeof on it, it says it's a string and when I do a console.log on it, it outputs the above exactly.So I created a new node.js file, and put all of the data above inside quotes:
newString = "?\21167\200Z\251\3072\26467\000 etc"
When I console.log(newString) it outputs
?‰67€Z©Ç2´67è-Æ%ß67 Ìƒ77€UüÇ 577ÉÆþ_77€sÇñŠ77tÆäµ77„Ç×à7 etc
Rather than the original string "?\21167\200Z\251\3072\26467\000..." etc
What is going on here?
Part 2. Because of 1 I have struggled to to extract the 64 bit floats from this string.
I can do it Python!:
import numpy as np
data_new = b"?\21167\200Z\251\3072\26467\000\350-\306%\33767\240\314\2 etc etc"
print(np.frombuffer(data_new, dtype="f4,f4"))
Is there an easy way to do this in node?
Explanation
In JavaScript, \<OCTAL SEQUENCE> inside a string literal will give you the Unicode character represented by that codepoint in octal. This feature is deprecated in modern JavaScript but still works for backward-compatibility reasons (except in strict mode, where it throws a SyntaxError).
The octal sequence must be in the range 0o0..0o377 (decimal 0..255 or hexadecimal 0x0..0xff) for this to happen.
'\1'.codePointAt(0).toString(16) // 1
String.fromCodePoint(0x01) // <Start of Heading>
// https://www.compart.com/en/unicode/U+0001
'\377'.codePointAt(0).toString(16) // ff
String.fromCodePoint(0xff) // "ÿ"
If the number isn't valid Octal, the backslash is simply "swallowed":
'\9'.codePointAt(0).toString(16) // 39
String.fromCodePoint(0x39) // "9"
// 9 is not a valid octal number, so the string
// simply evaluates to the character "9"
Solutions
Option 1
Replace all the backslashes \ with double-backslashes \\ to escape them, then you should be good to go.
'\\1'.codePointAt(0).toString(16) // 5c
String.fromCodePoint(0x5c) // Reverse Solidus (AKA backslash)
// https://www.compart.com/en/unicode/U+005C
Option 2
Use String.raw with a template literal.
String.raw`\1\2\3` === "\\1\\2\\3" // true
Note that this will not work if there is a backslash at the very end of the string:
String.raw`\` // Uncaught SyntaxError: Unexpected end of input

Lua gmatch odd characters (Slovak alphabet)

I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string:
local str = "KORYTNAČKA"
for c in str:gmatch("%a") do print(c) end
--result: K,O,R,Y,T,N,A,K,A
I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird:
local str = "KORYTNAČKA"
for c in str:gmatch("["..str.."]") do print(c) end
--result: K,O,R,Y,T,N,A,Ä,Œ,K,A
Anyone know how to solve this?
Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern "%a" matches one-byte character, so the result is not what you expected.
The pattern "["..str.."]" works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character.
If UTF-8 is used, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence in Lua 5.2, like this:
local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
print(c)
end
In Lua 5.1(which is the version Corona SDK is using), use this:
local str = "KORYTNAČKA"
for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(c)
end
For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.
Lua has no built-in treatment for Unicode strings. You can see that Ä,Œ is a 2 bytes representing UTF-8 encoding of a Č character.
Yu Hao already provided sample solution, but for more details here is good source.
I've tested and found this solution working properly in Lua 5.1, reserve link. You could extract individual characters using utf8sub function, see sample.
string.gmatch(str, "[%z\1-\127\192-\253][\128-\191]*")
Use utf8 plugin. Then replace string.gmatch with utf8.gmatch.
Example (tested on Win7, it works for me)
yourfilename.lua
local utf8 = require( "plugin.utf8" )
for c in utf8.gmatch( "KORYTNAČKA", "%a" ) do print(c) end
and
build.settings
settings =
{
plugins =
{
["plugin.utf8"] =
{
publisherId = "com.coronalabs"
},
},
}
Read more :
Introducing the UTF-8 string plugin,
Documenatation for utf8 plugin,
Lua String Manipulation,
string.gmatch().
Have a nice day:)

Ignore character accents when sorting strings

I'm writing a golang program, which takes a list of strings and sorts them into bucket lists by the first character of string. However, I want it to group accented characters with the unaccented character that it most resembles. So, if I have a bucket for the letter A, then I want strings that start with Á to be included.
Does Go have anything built-in for determining this, or is my best bet to just have a large switch statement with all characters and their accented variations?
Looks like there are some addon packages for this. Here's an example...
package main
import (
"fmt"
"golang.org/x/text/collate"
"golang.org/x/text/language"
)
func main() {
strs := []string{"abc", "áab", "aaa"}
cl := collate.New(language.English, collate.Loose)
cl.SortStrings(strs)
fmt.Println(strs)
}
outputs:
[aaa áab abc]
Also, check out the following reference on text normalization:
http://blog.golang.org/normalization

AS3 - "\u2605" NOT the same as "\\u"+"2605"?

Trying to make a textfield where people write the unicode without the backslash. I want to add the backslash after they typed it. So the user types u2605 and the code converts it to "\u2605", i then convert this to a unicode character and insert it in textflow.
My code:
this works:
span.text = publicFunctions.htmlUnescape(he.encode("\u2605"))
this doesn't work:
span.text = publicFunctions.htmlUnescape(he.encode("\\u"+"2605"))
how to make a string that acts as a unicode string?
Tried all sorts of things, escape(unescape()), convert to number, "\u", "\u" ... nothing helps.
trace("\u2605" == "\u"+"2605") ... will return false. So will
trace("\u2605" == "\u"+"2605")
"\u2605" is a string with a single character, the character with the code point 2605, while "\\u" + "2605" is a string with 6 characters (the backslash, the u and the four digit number).
If you want to construct a unicode character from just the four digits, you should be able to use String.fromCharCode. The thing is just that the escape sequence uses a hexadecimal number, while the method obviously takes a decimal number. So if the user enters a hexadecimal string, you will have to convert that first:
trace(String.fromCharCode(parseInt('2605', 16)) == '\u2605'));
That's an interesting issue! I don't think you can concatenate a string literal and achieve what you're trying to do. The relevant character escaping happens when the string literal is originally formed, which means that you need the whole sequence together in the first place.
But you should be able to take the user-supplied number and dynamically generate a Unicode string with String.fromCharCode(...).
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/String.html#fromCharCode()

Perl's default string encoding and representation

In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?
In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you find my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.
is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

Resources