Unicode character order problem when text is displayed

Unicode character order problem when text is displayed - text

I am working on an application that converts text into some other characters of the extended ASCII character set that gets displayed in custom font.
The program operation essentially parses the input string using regex, locates the standard characters and outputs them as converted before returning a string with the modified text which displays correctly when viewed with the correct font.
Every now and again, the function returns a string where the characters are displayed in the wrong order, almost like they are corrupted or some data is missing from the Unicode double width spacing. I have examined the binary output, the hex data, and inspected the data in the function before i return it and everything looks ok, but every once in a while something goes wrong and cant quite put my finger on it.
To see an example of what i mean when i say the order is weird, just take a look at the following piece of converted text output from the program and try to highlight it with your mouse. You will see that it doesn't highlight in the order you expect despite how it appears.
Has anyone seen anything like this before and have they any ideas as to what is going on?
ך┼♫יἯ╡П♪דἰ

You are mixing various Unicode characters with different LTR/RTL characteristics.
LTR means "left-to-right" and is the direction that English (and many other western language) text is written.
RTL is "right-to-left" and is used mostly by Arabic and Hebrew (as well as several other scripts).
By default when rendering Unicode text the engine will try to use the directionality of the characters to figure out what direction a given part of the code should go. And normally that works just fine because Hebrew words will have only Hebrew letters and English words will only use letters from the Latin alphabet, so for each chunk there's a easily guessable direction that makes sense.
But you are mixing letters from different scripts and with different directionality.
For example ך is U+05DA HEBREW LETTER FINAL KAF, but you also use two other Hebrew characters. You can use something like this page to list the Unicode characters you used.
You can either
not use "wrong" directionality letters or
make the direction explict using a Left-to-right mark character.
Edit: Last but not least: I just realized that you said "custom font": if you expect displaying with a specific custom font, then you should really be using one of the private use areas in Unicode: they are explicitly reserved for private use like this (i.e. where the characters don't match the publicly defined glyphs for the codepoints). That would also avoid surprises like the ones you get, where some of the used characters have different rendering properties.

Related

The structure of Arabic letters in Unicode

I got two different "versions" of Arabic letters on Wikipedia. The first example seems to be 3 sub-components in one:
"ـمـ".split('').map(x => x.codePointAt(0).toString(16))
[ '640', '645', '640' ]
Finding this "m medial" letter on this page gives me this:
ﻤ
fee4
The code points 640 and 645 are the "Arabic tatwheel" ـ and "Arabic letter meem" م. What the heck? How does this work? I don't see anywhere in the information so far on Unicode Arabic how these glyphs are "composed". Why is it composed from these parts? Is there a pattern for the structure of all glyphs? (All the glyphs on the first Wikipedia page are similar, but the second they are one code point). Where do I find information on how to parse out the characters effectively in Arabic (or any other language for that matter)?

Arabic is a script with cursive joining; the shape of the letters changes depending on whether they occur initially, medially, or finally within a word. Sometimes you may want to display these contextual forms in isolation, for example to simply show what they look like.
The recommended way to go about this is by using special join-causing characters for the letters to connect to. One of these is the tatweel (also called kashida), which is essentially a short line segment with “glue” at each end. So if you surround the letter م with a tatweel character on both sides, the text renderer automatically selects its medial form as if it occured in the middle of a word (ـمـ). The underlying character code of the م doesn’t change, only its visible glyph.
However, for historical reasons Unicode also contains a large set of so-called presentation forms for Arabic. These represent those same contextual letter shapes, but as separate character codes that do not change depending on their surroundings; putting the “isolated” presentation form of م between two tatweels does not affect its appearance, for instance: ـﻡـ
It is not recommended to use these presentation forms for actually writing Arabic. They exist solely for compatibility with old legacy encodings and aren’t needed for correctly typesetting Arabic text. Wikipedia just used them for demonstration purposes and to show off that they exist, I presume. If you encounter presentation forms, you can usually apply Unicode normalisation (NFKD or NFKC) to the string to get the underlying base letters. See the Unicode FAQ on presentation forms for more information.

What kind of sign is "‎" and what is it used for

What kind of sign is "‎" and what is it used for (note there is a invisible sign there)?
I have searched through all my documents and found a lot of them. They messed upp my htaccess file. I think I got them when I copied webadresses from google to redirect. So maybe a warning searching through your documents for this one also :)

It is U+200E LEFT-TO-RIGHT MARK. (A quick way to check out such things is to copy a string containing the character and paste it in the writing area in my Full Unicode input utility, then click on the “Show U+” button there, and use Fileformat.Info character search to check out the name and other properties of the character, on the basis of its U+... number.)
The LEFT-TO-RIGHT MARK sets the writing direction of directionally neutral characters. It does not affect e.g. English or Arabic words, but it may mess up text that contains parentheses for example – though for text in English, there should be no confusion in this sense.
But, of course, when text is processed programmatically, as when a web server processes a .htaccess file, they are character data and make a big difference.

String wired to Case Structure always goes to Default - LabVIEW

I am trying to take the readout from an instrument and if the readout matches certain strings (from the instrument programming manual) I want to set an indicator to a specific value, different for each possible string. A case structure seems like the best option, with all the possible readouts as the cases. I did this and added "" as the default case to send out a value for the no-match case. The trouble is that if I wire the readout string to the case structure it always executes the default case no matter what the readout (and yes, before anyone asks, I verified that the readout strings match my cases exactly). To check that the case structure was working I wired a constant to the case structure and it works fine, even when I copy and paste the value from the readout string to the constant. Also, I made sure that case insensitive matching was selected so that's not the issue. Anyone have an idea why this is happening? I can post sample VI's if necessary.

Found the problem. Converted the strings into byte arrays and looked at the ascii values. Apparently one had a new line character at the end even though there was no new line on the indicator. Fixed comparison by trimming white space on strings. Look out for that.

To check exactly what's in your string you can wire it to an indicator, right-click on that indicator and choose '\'Codes Display. This will then show codes such as \n for newline, \00 for ASCII 0, \FF for ASCII 255, etc.

Why pointer show unicode string reversed?

I have problems with unicode strings. My pointer to a string in farsi (saved as Unicode, codepage 1200) return the string reversed. Why? I know that farsi is a right-to-left language, but this is a C/C++ matter. My pointer to a string should point to the start of secuence as is stored in file.
I'm using VC++2005, standard console app.
Any help will be welcomed, I have attached screenshot and sample project.
test project
screen capture
Regards,
Juan

If the order is reversed in VC++2005, then probably it just does not handle directionality right, i.e. it displays Arabic letters left to right instead of properly obeying their inherent directionality. Such things happen in many editors and development tools. It does not as such affect the behavior of applications.

How to convert chars (Â or Ã or â€) to ASCII codes while creating word document using C#?

I have string " Single 63”x14” rear window" am parsing this string into HTML and creating a word document applying styles using(System.IO.File.ReadAllText(styleSheet)).
in the document am getting this string as "Single 63â€x14â€ rear window" in C#.
How can I get the correct character to show up in Word?

You would have to find out the incoming encoding of the string
" Single 63”x14” rear window"
And also which encoding the word document allows.
It appears that the encoding characters for those funky quotes are not supported by Word. You could always create a nifty little string parser to search for characters outside the Word encoding range and replace them with either String.Empty, or search for specific supported characters that look similar.
Eg. String.Replace("”","\"");
(this probably wouldn't work without directly manipulating the encoding values, but you haven't provided those so can't give an exact example)

The encoding you are looking at appears to be UTF-8. It's actually probably exactly what you want, you just need to view it using a tool which supports UTF-8, and if you process it and put it on a web page, add the required HTML meta tag so that browsers will display it using the correct encoding.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string