Why are some characters like "○" overlapped? - string

I stumbled upon an interesting character while skyping. I was initially interested to send an empty message but, since Skype has some validation rules, it doesn't allow empty messages.
But, with a few ALT+NUMPAD ASCII symbols I saw that for alt+777 it returns ○ on most editors and ̉ on skype, which delivers an empty-like message and this is where curiosity got to me. I started to abuse this symbol and noticed that it can overlap other characters. So if I write the ASCII symbol on a word, the result would be a ̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉.
You can see for yourself that "̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24 while "mutatedWord".length == 11.
And the funny thing is that, up until now, I could only generate that weird aphostrophy on skype, anywhere else it results in ○.
Can someone explain this behavior to me?

You're writing Unicode codepoint 777 (but usually notated in hex as U+0307). This codepoint is one of Unicode's combinbing characters, namely "combining hook above". You've already discovered what they do: they add a diacritic to the next character.
If there's no letter following, there's no consisitent behavior. If Skype renders it differently from the other apps you've tried, it probably is using a different font engine.

Related

Unicode character order problem when text is displayed

I am working on an application that converts text into some other characters of the extended ASCII character set that gets displayed in custom font.
The program operation essentially parses the input string using regex, locates the standard characters and outputs them as converted before returning a string with the modified text which displays correctly when viewed with the correct font.
Every now and again, the function returns a string where the characters are displayed in the wrong order, almost like they are corrupted or some data is missing from the Unicode double width spacing. I have examined the binary output, the hex data, and inspected the data in the function before i return it and everything looks ok, but every once in a while something goes wrong and cant quite put my finger on it.
To see an example of what i mean when i say the order is weird, just take a look at the following piece of converted text output from the program and try to highlight it with your mouse. You will see that it doesn't highlight in the order you expect despite how it appears.
Has anyone seen anything like this before and have they any ideas as to what is going on?
ך┼♫יἯ╡П♪דἰ
You are mixing various Unicode characters with different LTR/RTL characteristics.
LTR means "left-to-right" and is the direction that English (and many other western language) text is written.
RTL is "right-to-left" and is used mostly by Arabic and Hebrew (as well as several other scripts).
By default when rendering Unicode text the engine will try to use the directionality of the characters to figure out what direction a given part of the code should go. And normally that works just fine because Hebrew words will have only Hebrew letters and English words will only use letters from the Latin alphabet, so for each chunk there's a easily guessable direction that makes sense.
But you are mixing letters from different scripts and with different directionality.
For example ך is U+05DA HEBREW LETTER FINAL KAF, but you also use two other Hebrew characters. You can use something like this page to list the Unicode characters you used.
You can either
not use "wrong" directionality letters or
make the direction explict using a Left-to-right mark character.
Edit: Last but not least: I just realized that you said "custom font": if you expect displaying with a specific custom font, then you should really be using one of the private use areas in Unicode: they are explicitly reserved for private use like this (i.e. where the characters don't match the publicly defined glyphs for the codepoints). That would also avoid surprises like the ones you get, where some of the used characters have different rendering properties.

PL/I character set and IBM Personal Communications - wrong characters are displayed

Some characters that I enter in editor displayed not identical to those on keyboard. So I have error messages like this:
Character with decimal value 176 does not belong to the PL/I character
set. It will be ignored.
when trying to compile PL/I programm.
Sometimes character can displayed even properly, but I still have similar error message.
Examples of this characters are character that represents logical OR, logical NOT.
How to solve this problem? Is it a settings of editor, or settings of program IBM Personal Communications? Or may be it is better to enter 16-code of those symbols (how to do that if possible, and how to determine what code I need)?
There is a lot of places where this can go wrong...
The keyboard-driver on your client machine has to be configured correctly for the keyboard you use. But if other programs work correctly and only the mainframe emulation behaves strangely then this should be OK.
The PCOMM-session has to be configured to use the correct Host-codepage. Ask your mainframe technical guys what is used and configure your terminal emulation accordingly. Since we don't use PCOMM I can't help you with this, you will have to look around the session settings a bit.
In PL/I most characters are taken from the range that is identical in most EBCDIC codepages. The main exceptions are the characters for the OR- and NOT-operators which may differ. IBM-default for OR is '4F'X, which is a pipe-character '|' in codepage 1140 (English), but an exclamation mark '!' in codepage 1141 (German). Default for NOT is '5F'X which is a logical NOT-sign '¬' in 1140 but a caret '^' in 1141.
Since these problems are well known the compiler offers two options OR() and NOT() to set the characters to be used for these operators. So you might look in your compile-listing whether these parameters are set in your installation and what their values are since these are the characters you have to use.

Underlining text posted to slack

I am trying to see if there is a way to underline a text posted to slack. I am using webhook for posting messages to slack.
You can approximate it with Unicode’s COMBINING LOW LINE character: http://www.fileformat.info/info/unicode/char/0332/index.htm . Before posting, split your string along grapheme boundaries and insert a COMBINING LOW LINE after each. This sort of works, but with Slack’s default font the underline sometimes splits visually between characters. It’s enough though to give an impression, which might be what you want if, for example, you’re trying to give an example of the position of a link within a piece of text.
I don't think this can be done. See https://api.slack.com/docs/formatting for the available message formatting options.

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>
Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

Troubles with text encoding

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - François Dubois".
What am I missing here?
If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.
That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.
I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links

Resources