How to display unicode for "micron" or lower case "Mu" in a CEdit? - visual-c++

How to display unicode for "micron" or lower case "Mu" in a CEdit?
thx
CEdit *ed = new CEdit();
ed->Create(WS_VISIBLE | WS_BORDER, CRect(200,100,300,120), this, 10);
CString s;
s.Format("%c", xx); //<--- how to do unicode here?
ed->SetWindowText(s);
I think the unicodes:
micron 0x00b5
squared 0x00b2
cubed 0x00b3
degree 0x00b0

If you're using the UNICODE project setting, this is trivial:
s.Format("%c", 0x3BC);
The CString will contain wide (UTF-16) characters which can hold most common Unicode characters in a single location; the only problem will come with Unicode code points over U+FFFF.

Related

Getting the last two characters of japanese words returns question marks in Netbeans 8.2

I have this code:
String str;
Scanner Lectura = new Scanner(System.in);
System.out.print("Type a word in plain form: ");
str=Lectura.next();
//obtains word
String substring = str.length() > 2 ? str.substring(str.length() - 2) : str;
System.out.print(substring);
...to get the last two digits of a word. I intent to use it with japanese verbs, but for some reason it just returns question marks:
https://i.stack.imgur.com/3e7C2.png
I typed in 食べる which should return べる; the last two characters. Here's proof it's not a unicode problem or something (the console font is so small):
https://i.stack.imgur.com/dTHEz.png
What's the problem? What can I do to get those characters and not question marks?
I guess the Scanner is using your system's default encoding? Your sample works for me (I have UTF-8 on my OS...) You can specify the encoding when creating Scanner, maybe it will help?
Scanner Lectura = new Scanner(System.in, "UTF-8");

How to convert string like "//u****" to text?

I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.

Swift string count() vs NSString .length not equal

Why do these two lines give me different results?
var str = "Hello 😘" // the square is an emoji
count(str) // returns 7
(str as NSString).length // returns 8
Original for reference:
This is because Swift uses Extended Grapheme Clusters. Swift sees the smiley as one character, but the NSString method sees it as two Unicode Characters, although they are "combined" and represent a single symbol.
I think the documentation says it best:
The character count returned by the count(_:) function is not always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string. To reflect this fact, the length property from NSString is called utf16Count when it is accessed on a Swift String value.

How to read CSV which is in Hindi font using C#.net?

I am trying to read data from csv and put that in drop down. This CSV is written in Hindi font (shusha.ttf).
While reading each line I am getting junk values.
string sFileName = "C://MyFile.csv";
Assembly assem = Assembly.GetCallingAssembly();
FileStream[] fss = assem.GetFiles();
if (!File.Exists(sFileName))
{
MessageBox.Show("Items File Not Present");
return false;
}
StreamReader sr = new StreamReader(sFileName);
string sItem = null;
bool isFirstLine = true;
do
{
sItem = sr.ReadLine();
if (sItem != null)
{
string[] arrItems = sItem.Split(',');
if (!isFirstLine)
{
listItems.Add(arrItems[0]);
}
isFirstLine = false;
}
} while (sItem != null);
return true;
You're not providing an encoding paramter to the StreamReader, so it is assuming a default encoding, which is not the encoding the file was written with.
Not all text files or csv files are the same. Encoding systems choose how to convert 'characters' (glyphs, word pictures, letters, whatever) to bytes to store in a computer.
There are many different encoding systems - ASCII, EBDIC, Utf8, Utf16, Utf32, etc.
You need to figure out which encoding the file was written with and pass that as the Encoding parameter to the StreamReader class.
I would have figured that the file was written with UTF8, since it's a pretty universal standard for non-english text; StreamReader's default is to use UTF8 when you don't provide a value, so it is probably not utf8. It's possible it's UTF16, or perhaps even some other completely different encoding.
For the curious who want some background on Unicode - unicode is a standard that assigns simple numbers to glyphs, ranging form ascii to snowmen to mandarin, etc. Unicode just gives each glyph a number, known as a code point. Unicode however is NOT an encoding - it doesn't say how to actually represent those code points as bytes.
UTF8 is a unicode encoding that can cover the entirety of the unicode space, as is UTF16 and UTF32. UTF8 writes 1 byte out for code points below a certain value, 2 bytes for code points below a certain higher value, and so on, and uses signaling bits in each byte to help indicate whether a code point was written using one, two, three, etc bytes.
Internally, for instance, C# represents strings using UTF16, which is why if you look at the raw memory for strings containing only ascii text, you'll see lots of '0' values - ascii doesn't need the other 8 bits, so the values end up being all 0.
Here's a link from wikipedia that explains how UTF8 packs bits from the code point value, with signalling bits, into bytes to store in memory: https://en.wikipedia.org/wiki/UTF-8

Perl's default string encoding and representation

In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?
In C, a string is a collection of octets, but Perl has two string storage formats:
String of 8-bit values.
String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.
Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
You can store bytes in a 72-bit string just as easily as an 8-bit string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
The \x{FB01} and \x{E9} are code points.
Not quiet, the numeric values inside the braces are codepoints. The whole \x expression is just a notation for a character. There are several notations for characters, most of them starting with a backslash, but the common one is the simple string literal. You might as well write:
use utf8;
my $string = "Can you find my résumé?\n";
# ↑ ↑ ↑
And code points are encoded via an encoding scheme to a series of octets.
True, but so far your string is a string of characters, not a buffer of octets.
But how does this work?
Strings consist of characters. That's just Perl's model. You as a programmer are supposed to deal with it at this level.
Of course, the computer can't, and the internal data structure must have some form of internal encoding. Far too much confusion ensues because "Perl can't keep a secret", the details leak out occasionally.
Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
No, the internal encoding is lax UTF8 (no dash). It does not have some of the restrictions that UTF-8 (a.k.a. UTF-8-strict) has.
UTF-8 goes up to 0x10_ffff, UTF8 goes up to 0xffff_ffff_ffff_ffff on my 64-bit system. Codepoints greater than 0xffff_ffff will emit a non-portability warning, though.
In UTF-8 certain codepoints are non-characters or illegal characters. In UTF8, anything goes.
Encode::is_utf8
… is an internals function, and is clearly marked as such. You as a programmer are not supposed to peek. But since you want to peek, no one can stop you. Devel::Peek::Dump is a better tool for getting at the internals.
Read http://p3rl.org/UNI for an introduction to the topic of encoding in Perl.
is_utf8 is a badly-named function that doesn't mean what you think it means or have anything to do with that. The answer to your question is that $string doesn't have an encoding, because it's not encoded. When you call Encode::encode with some encoding, the result of that will be a string that is encoded, and has a known encoding

Resources