PL/I character set and IBM Personal Communications - wrong characters are displayed

PL/I character set and IBM Personal Communications - wrong characters are displayed - mainframe

Some characters that I enter in editor displayed not identical to those on keyboard. So I have error messages like this:
Character with decimal value 176 does not belong to the PL/I character
set. It will be ignored.
when trying to compile PL/I programm.
Sometimes character can displayed even properly, but I still have similar error message.
Examples of this characters are character that represents logical OR, logical NOT.
How to solve this problem? Is it a settings of editor, or settings of program IBM Personal Communications? Or may be it is better to enter 16-code of those symbols (how to do that if possible, and how to determine what code I need)?

There is a lot of places where this can go wrong...
The keyboard-driver on your client machine has to be configured correctly for the keyboard you use. But if other programs work correctly and only the mainframe emulation behaves strangely then this should be OK.
The PCOMM-session has to be configured to use the correct Host-codepage. Ask your mainframe technical guys what is used and configure your terminal emulation accordingly. Since we don't use PCOMM I can't help you with this, you will have to look around the session settings a bit.
In PL/I most characters are taken from the range that is identical in most EBCDIC codepages. The main exceptions are the characters for the OR- and NOT-operators which may differ. IBM-default for OR is '4F'X, which is a pipe-character '|' in codepage 1140 (English), but an exclamation mark '!' in codepage 1141 (German). Default for NOT is '5F'X which is a logical NOT-sign '¬' in 1140 but a caret '^' in 1141.
Since these problems are well known the compiler offers two options OR() and NOT() to set the characters to be used for these operators. So you might look in your compile-listing whether these parameters are set in your installation and what their values are since these are the characters you have to use.

Related

Unicode Codepoints for special characters in MS Keyboard Layout Creator

My goal:
I am trying to get the MS Keyboard Layout Creator to allow me to perform a carriage return/enter whenever I hit the [R-Arrow] key in combination with the [Control] key, but still have the [R-Arrow] key perform as normal (i.e. move one character right) when hit alone. I'm doing this because my laptop keyboard [Enter] key is busted, and I want use this hack for a short time, before I go ahead and get another keyboard. Yes, I know it might be easier to get a new one. :)
As far as I can tell, I have almost figured everything out. The only pieces of information I still need are the exact hexadecimal codepoints for both the 1) right arrow navigation and 2) enter/carriage-return. I am hoping someone can direct me to this info. I have found the unicode reference but I am unable to discern which codes I might use for the carriage return and the right arrow navigation (not the right arrow ascii character →, I don't care about that)
Example code in my existing KLC file:
KBD Layout01 "Layout01 Description"
COPYRIGHT "(c) 2017 Company"
COMPANY "Company"
LOCALENAME "en-US"
LOCALEID "00000409"
VERSION 1.0
SHIFTSTATE
0 //Column 4
1 //Column 5 : Shft
2 //Column 6 : Ctrl
LAYOUT ;an extra '#' at the end is a dead key
//SC VK_ Cap 0 1 2
//-- ---- ---- ---- ---- ----
39 SPACE 0 0020 0020 -1 // SPACE, SPACE, <none>
53 DECIMAL 0 002e 002e -1 // FULL STOP, FULL STOP,
My understanding of the code (SPACEBAR example)
Looking at the preexisting examples in the file, (the space and the decimal) I have figured out the following:
Note: the examples in parentheses below refer only to the spacebar.
The first number is the keyboard key (e.g. 39 above)
The word which follows that number is the designated label to refer to that key (e.g. SPACE above)
the next three numbers are hexadecimal codepoints/symbols which refer to "SHIFTSTATES"
The first is the codepoint for what the key will output if pressed while the CAPSLOCK is pressed.
The second is the codepoint for what the key will output if pressed simultaneously with the SHIFT key.
The third is the codepoint for what the key will output if pressed simultaneously with the CONTROL key.
The goal: figuring out the codes for right-arrow navigation and enter
I have figured out this much for my line of code that I want to add in order so that pressing the right key alone will still navigate right, but in wihc the combination "control-right" will instead trigger a carriage-return/enter
4d RIGHT 0 ??I don't know?? ??I don't know?? -1
I believe the I know following
4d (in the 1st column) is the key code for the right arrow key
the handle RIGHT (in the 2nd column) is the handle/name for the right arrow
0 (in the 3rd column means don't change the key if the capslock is pressed
What I need your help to figure out
What the codepoint/hexadecimal/unicode symbol is for performing a right arrow navigation (I think that is what goes in the fourth column if I want [Shift]-[Right-Arrow] to make the cursor move one character to the right).
What the codepoint/hexadecimal/unicode symbol is for performing a carriage-return/enter(I think that is what goes in the fifth column if I want [Control]-[Right-Arrow] to trigger an enter/carriage-return).
It may be that I am mistaken and the symbols I need are not unicode codepoints; if I am wrong, please correct me, as that info will help me get closer to my goal. Any help would be greatly appreciated!

I don't know if you still need this, as I had already written down most of it I post it.
I looked into it for a while, I haven't found an actual definitive answer but I can give you some hints (I post this as an answer nonetheless because it would have been too unwieldy to use comments).
I have a strong feeling that what you ask is not possible (that control keys such as the arrows cannot be mapped to different keys/characters/functions when a modifier such as ctrl is pressed).
I'm not really a huge expert in these things but I can give you some pointers:
(in the following there is a good deal of information not much related to your problem, but it might help you understand better)
When you press a key in Windows there are at least 3 sets of codes that are involved:
Scan codes: these are the codes that are actually generated by the hardware and sent to the pc. I have little knowledge of them, I never had a need to use them and I was too young when they were more relevant. They can theoretically vary from keyboard to keyboard but they're largely standardized; the USB keyboards are really standardized, for what I could understand, and their scan codes ought to be those listed in these HID Usage Tables (section 10). Wikipedia has some info but not a full list of the traditional codes. Most likely you won't need these, though (but maybe you will). By the way, these scan codes are also passed to the applications (I'm not sure how reliably) but they hardly ever use them.
Virtual-key codes: The scan codes in Windows are translated by the keyboard driver into a common set of key codes specified by Microsoft: the Virtual-Key Codes. These are independent by the keyboard and are what's (normally) used by the applications when they need to handle the single key presses.
Unicode, or other charset, characters: Windows recognizes when the keys being pressed are supposed to produce printable characters and passes these characters to the applications. At the times when an application is only interested in printable characters it only looks at these characters, although when they need to do more complex things (shortcuts...) they also have access to the virtual-key codes (and, if they really want, to the scan codes). Unicode is a character set, not a "key-codes set", so it generally contains only printable characters. To facilitate interoperability with ASCII and other legacy charsets it also includes the control characters defined in previous standards, but among these control characters the arrow keys are not present, so there are no unicode codepoints for the keyboard's arrows.
In the second column of the klc it would appear that you have to put the name of the virtual-key constant with VK_ removed. Quite weird indeed.
Several Microsoft documentation pages say that the WDK kbd.h file that you can also find in the inc directory of the Microsoft Keyboard Layout Creator has the detailed information about this stuff. Personally I couldn't make too much out of it, though.
If you really want to dig into this the late Michael Kaplan's blog has probably the information you're looking for, somewhere.
Your best luck is most likely to use some other application. I stumbled upon KbdEdit, that does handle the arrow keys, but it really seems that it can't assign a different function to the key when used with a modifier (but you can change the effect of the key altogether, irrespective of the pressed modifier).
For the Enter key you would likely need to use the virtual key, which is 0D (VK_RETURN).
The sequence of characters used to indicate line breaks on Windows is CR LF, which have (in Unicode and almost every other existing charset) codepoints 0D 0A, respectively.
The Windows message that notifies applications of entered characters (point 1.3 above - I mean the WM_CHAR message, by the way) though reports only a CR (0D) when you press Enter; so if those klf files use unicode codepoints in some part there's a good chance that they use that (CR) to indicate a Enter key.
All in all, your best bet is probably to just assign the Enter to a different key (for example a function key, the right ctrl or win key if you have them or the caps-lock).

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>

Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

Why are some characters like "○" overlapped?

I stumbled upon an interesting character while skyping. I was initially interested to send an empty message but, since Skype has some validation rules, it doesn't allow empty messages.
But, with a few ALT+NUMPAD ASCII symbols I saw that for alt+777 it returns ○ on most editors and ̉ on skype, which delivers an empty-like message and this is where curiosity got to me. I started to abuse this symbol and noticed that it can overlap other characters. So if I write the ASCII symbol on a word, the result would be a ̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉.
You can see for yourself that "̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24 while "mutatedWord".length == 11.
And the funny thing is that, up until now, I could only generate that weird aphostrophy on skype, anywhere else it results in ○.
Can someone explain this behavior to me?

You're writing Unicode codepoint 777 (but usually notated in hex as U+0307). This codepoint is one of Unicode's combinbing characters, namely "combining hook above". You've already discovered what they do: they add a diacritic to the next character.
If there's no letter following, there's no consisitent behavior. If Skype renders it differently from the other apps you've tried, it probably is using a different font engine.

Is EOT character sitting over terminal promp an issue?

Warning: you know how they say "there's not such thing as a stupid question"? Well, this one is, or, I suspect it's really minor, but wth, why not ask. Search engines didn't bring me anything remotely useful, though that could be bad searchterm-fu.
I recently downloaded sqlite3 onto Ubuntu 10 to start learning SQL commands. I un-tar'd 3.7.12.01 and make installed.
After creating a test.db with create table test (id) I decided to see what I'd get if I cat it. Just because.
The result is an EOT character (u+0004) which is sitting right over my prompt. Illustrated screenshot: http://imgur.com/omfMa
I realise this is not the type of file you would use cat on. I only want to know, before I go further,
does the strange placement of this character signal any future issues when actually playing around with SQL, or some issue with newlines, or fonts (this is monofur set at a high font size) or similar?
I've never seen a result character placed directly over my prompt before.

The character is placed over your prompt, because it is a double-width character, and terminals in general are not good at handling double-width characters. It does not mean anything.

There are some control codes which can do very funny things with your terminal, such as changing colors, fonts etc.
But none of them do really harm - you should be able to reset your terminal to a healthy state, or close it and open a new one.

Does VS C++ Character Set compiler setting influence character encoding?

Are there any direct consequences of toggling between Unicode, MBCS, and Not Set for the VS C++ compiler settings Configuration Properties->General->Character Set, apart from the setting of the _UNICODE, _MBCS and _T macros (which then, of course, indirectly has consequences through the generic text mappings for string functions)?
I am not expecting it to, but since the documentation says "Tells the compiler to use the specified character set", I'd like to be certain that, specifically, it doesn't have any influence on how any literal non-ASCII text put into strings or wstrings is encoded? (I am aware that non ASCII literals in the source is not portable, but am maintaining a solution where this is used heavily.)
Thanks in advance.

No, it only affects the macro definitions. Which in turn can have wide-ranging effects on anything from <tchar.h> or the Windows T string pointer types (LPTSTR etc).
If you use any non-ASCII codes in your string literals then you depend heavily on the way the compiler decodes the text in your source code file. The default encoding it assumes is your system code page as configured in Control Panel + Regional and Language options. This will not work well when your source code file ever strays too far away from your machine. Specifying utf8 with a BOM is wise so this is never a problem. In the IDE that's set with Save As, arrow on the Save button, "Save with encoding", pick 65001. Support for utf8 encoded source code files is spotty in older versions of C++ compilers.

For unadorned strings, C++ follows C: it's ASCII. If you wrap them with anything, the game changes.
C++0x standardises Unicode strings. UTF in particular. This is a new feature.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string