Are there any direct consequences of toggling between Unicode, MBCS, and Not Set for the VS C++ compiler settings Configuration Properties->General->Character Set, apart from the setting of the _UNICODE, _MBCS and _T macros (which then, of course, indirectly has consequences through the generic text mappings for string functions)?
I am not expecting it to, but since the documentation says "Tells the compiler to use the specified character set", I'd like to be certain that, specifically, it doesn't have any influence on how any literal non-ASCII text put into strings or wstrings is encoded? (I am aware that non ASCII literals in the source is not portable, but am maintaining a solution where this is used heavily.)
Thanks in advance.
No, it only affects the macro definitions. Which in turn can have wide-ranging effects on anything from <tchar.h> or the Windows T string pointer types (LPTSTR etc).
If you use any non-ASCII codes in your string literals then you depend heavily on the way the compiler decodes the text in your source code file. The default encoding it assumes is your system code page as configured in Control Panel + Regional and Language options. This will not work well when your source code file ever strays too far away from your machine. Specifying utf8 with a BOM is wise so this is never a problem. In the IDE that's set with Save As, arrow on the Save button, "Save with encoding", pick 65001. Support for utf8 encoded source code files is spotty in older versions of C++ compilers.
For unadorned strings, C++ follows C: it's ASCII. If you wrap them with anything, the game changes.
C++0x standardises Unicode strings. UTF in particular. This is a new feature.
Related
I have a huge Filedump to handle (7000+ Files) which are all encoded in OEM-US (and i need them to remain OEM-US or return to OEM-US when I'm done)
The search in files feature from Notepad++ would actually solve all my Problems. (It's a single use job - I don't want to bore you with the details but its about sanatizing old code which has partially been written in foreign languages like german or french including their notorious characters like äöüèéàç)
The thing is: Most of the time, Notepad++ detects the wrong encodings and different encodings for different files. Usually, it detects ANSI or UTF-8 but sometimes it get exotic and all of a sudden my files are supposed to be encoded in Shift-JIS or Big5 Which messes up my search terms as they sometimes turn different special chars into the same set of replacement chars.
So I'm looking for a way to either
a) Tell notepad++ which encoding to select for the "search in files" job i want to run.
b) convert all Files to UTF-8, run the search-replace job there and restore the encoding to OEM-US
or
c) Find a different Software to handle this issue for me
Can someone help me?
All our source code is valid UTF-8, however some users on Windows cannot build them because their system is configured for a different encoding.
Without adding a BOM to source files, is it possible to tell MSVC to treat all source as UTF-8, irrespective of the users system encoding?
See MSDN's link regarding this topic (requires adding BOM header).
You can try:
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.
References
Docs - Visual C++ - Documentation - IDE and Tools - Building - Build Reference: /utf-8 (Set Source and Executable character sets to UTF-8)
If you happen to create cross-platform code solving the problem using a command-line switch means that
add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
or adding something like /utf-8 or /source-charset to the CFLAGs might mean you'll have to do a similar thing for other platforms, as well.
If possible it therefore might be better to avoid the problem, instead of solving it, by using an \uxxxx instead of an unicode character in strings: This way the source specifies which unicode characters to use, but doesn't actually contain them.
Some characters that I enter in editor displayed not identical to those on keyboard. So I have error messages like this:
Character with decimal value 176 does not belong to the PL/I character
set. It will be ignored.
when trying to compile PL/I programm.
Sometimes character can displayed even properly, but I still have similar error message.
Examples of this characters are character that represents logical OR, logical NOT.
How to solve this problem? Is it a settings of editor, or settings of program IBM Personal Communications? Or may be it is better to enter 16-code of those symbols (how to do that if possible, and how to determine what code I need)?
There is a lot of places where this can go wrong...
The keyboard-driver on your client machine has to be configured correctly for the keyboard you use. But if other programs work correctly and only the mainframe emulation behaves strangely then this should be OK.
The PCOMM-session has to be configured to use the correct Host-codepage. Ask your mainframe technical guys what is used and configure your terminal emulation accordingly. Since we don't use PCOMM I can't help you with this, you will have to look around the session settings a bit.
In PL/I most characters are taken from the range that is identical in most EBCDIC codepages. The main exceptions are the characters for the OR- and NOT-operators which may differ. IBM-default for OR is '4F'X, which is a pipe-character '|' in codepage 1140 (English), but an exclamation mark '!' in codepage 1141 (German). Default for NOT is '5F'X which is a logical NOT-sign '¬' in 1140 but a caret '^' in 1141.
Since these problems are well known the compiler offers two options OR() and NOT() to set the characters to be used for these operators. So you might look in your compile-listing whether these parameters are set in your installation and what their values are since these are the characters you have to use.
I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.
I want to use printer (windows driver) to print Japanese in a vb 6 project.
My project is in Japanese Windows Environment (the OS is English Originally, set Japan region and related language).
I use Printer object to print a simple string type of Japanese such as "レジースター", the code like
Dim s As String
s="レジースター"
Printer.Print s
Printer.EndDoc
but the output result is a set of messy code like "OEvƒOEƒ|[ƒg"
Does anybody who can succeeded in printing out Japanese with Vb6 Printer Object in Japanese language Windows Envrionment, please help me.
Finally find the key is simple, it's a little bit tricky but I still don't know why. Just set the font of the Printer Object like "Printer.Font.Charset = 128" (128 for Japanese)
ATTN: Pls pay attention to my case, my OS is English with the language and region setting to Japanese.
What make me confused is that the default ANSI of Windows. As we know, the default value of Printer.Font.Charset is 0, it means ANSI (IF the language environment is Japanese then it will use code page 932, if it is English, it will use Windows-1252).
My OS is Japanese (set to Japanese, not purely, Originally English OS), when I try to Write a file in Japanese it can display Japanese, but when I use the Printer Object to Print, it does have 0(ANSI) value of .Font.Charset, but actually it still use the original OS code page, so it is wired. And when I try to set the system to Chinese and Korean, both of the language is normal, only Japanese have this problem.
the trick that i have used for something like this is to use double StrConv() functions, one with the vbFromUnicode constant and the other with the vbToUnicode constant.
It takes alittle experimenting to get right , but it should look something like this, swap the constants and/or codepage values until you get the right conversion for your system
Dim s as string
s="レジースター"
Dim newS as string
newS = StrConv((StrConv(s,FromUnicode,CodePage1),ToUnicode,CodePage2)
Printer.Print newS
CodePage*N* is the Windows codepage value, 1252 for English, 932 for Japanese
Despite all strings in VB6 are Unicode (UTF-16), when it comes to interfacing with the world VB6 is completely non-Unicode.
You cannot store レジースター in your project file because the file is in ANSI.
You cannot simply pass a string to a declared API function because it will undergo an automatic conversion to ANSI first. To avoid this, you have to declare string parameters as pointers.
Apparently same happens to the Print call - the string gets converted to the currect ANSI codepage before it reaches the printer driver.
You can try printing manually by creating a device context and printing on it.
Or you can search for another solution, like this one (I did not try it).