Which Letter takes up the most EM (globally)? - styling

I was reading up on changing placeholder text when I stumbled across this question.
I went back and learnt about placeholders, anyway. And one SO answer said something along the lines of:
Be careful when designing your placeholder text, since anything outside of the control will be cut off.
Putting these two answer together, it made me think (yes, I know, bad thing to do!) -
What is the longest letter in EM in Global (language) terms?
(since we are meant to size letters in EM and all).
The longest in the English Alphabet is 'W' apparently (from linked Question) - so in terms of global languages, what is?
If I had a control such like:
+------------------------+
|123456789101112131415161|
+------------------------+
where the placeholder was 24 numbers long. How can i ensure they all fit?
Since numbers seem to be the same EM width:
11111
22222
33333
44444
55555
66666
77777
88888
99999
How can I ensure that 24 characters, no matter what length/EM width will fit?
I could just go:
+------------------------+
|WWWWWWWWWWWWWWWWWWWWWWWW|
+------------------------+
But what if there is a wider letter used from another language? How can I ensure that the placeholder text can be read? (without resizing the input itself dynamically)? I literally want the minimum width it would have to be to display 24 characters, no more - no matter what language is placed in the field.
Here's an example of an even longer 'letter' than English's W:
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
ŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒŒ
ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ
EDIT
I know how i would test (as above) but not 'contenders' as to which is the widest character in the world?

determine your font (eg. Arial)
determine your font-size (eg. 10px)
determine your font-weight (eg. bold)
determine your character-set (eg. UTF-8)
print a couple of same characters (eg. 24) per row for each character of the character set
devide and conquer
-> remove rows that are obviously shorter than others and refresh the page
-> repeat removal as long as there are more than one rows on the page (for equally long rows just pick any one)

Print each character enclosed by a span and then search for the calculated width for that character (either in your browser's devtools or you can automate this with a simple Javascript script that will check the largest of your characters).

Related

Vim: Utf-8 ې character breaks displayed string

I have file that has hex content: db90 3031 46, which should be displayed in vim as "ې" followed by "01F", but what I noticed is that it is never displayed correctly. Then I noticed It is the same in other places like in terminal and browser I always get ې01F? Why is that? Just paste that in google and try yourself you will never be able to put "ې" and 0 as next character.
That's an Arabic character with right-to-left indicator, so you probably need to switch back to left-to-right mode, such as with U+200e.
The Unicode bidirectional stuff is rather complex - the behaviour you are seeing is probably caused by the fact that the Latin digits are marked EN = European number (a weak type), while letters such as F are marked L = left to right (a strong type).
Weak types are treated differently in the Unicode specification, such as with this quote which covers your particular case (my emphasis):
Problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or there are nested segments of different-direction text, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the right display.
So your code point followed by a digit renders as "ې7" (I typed that 7 in after the Arabic character despite the fact it's showing up before it), while following it with a letter gives "ېX".
For what it's worth, the text "ې‎7" was generated here by inserting ‎ between the two characters, the HTML equivalent of the U+200e Unicode code point.
If you head on over to this UTF-8 codec site and enter %u06D0%u200e7 into the decoding section, you'll see that it comes out in your desired order (removing the %200e shows it in the order you're describing in your question).

What exactly does IDWriteTextFormat1::SetLastLineWrapping do?

Documentation for IDWriteTextFormat1::SetLastLineWrapping() is insufficient:
Sets the wrapping mode of the last line. If [the single BOOL parameter is] set to FALSE, the last line is not wrapped. If set to TRUE, the last line is wrapped.
For IDWriteTextLayout2::SetLastLineWrapping() it is equally terse:
Set whether or not the last word on the last line is wrapped.
Some details lacking that I want to know:
Which one is the last line? In my tests, sometimes it is the last visible line that gets the extra word, but sometimes it is the next one, and then more lines follow. (Is it a bug in DirectWrite? In my code?) Here "visible" means (for horizontal lines, in the vertical direction) completely inside the layout rectangle.
How does it interact with IDWriteTextFormat::SetTrimming()? Some tests suggest that behaviour is different when trimming is set. (With SetTrimming(&DWRITE_TRIMMING{DWRITE_TRIMMING_GRANULARITY_CHARACTER,0,0}, nullptr);.)
It was intended for rectangular tiles where the last word of the application name would otherwise be hidden because it got wrapped to the next line. By default, this would yield:
"here is an example of a long application name" ->
<----------->.
here is an /|\
example of a |
long \|/
application
name
So if the tile height was short (say only 3 lines), then only none of the word "application" would be visible, but it's more useful to also show at least part of the next word.
<----------->.
here is an /|\
example of a |
long applica\|/
name
In conjunction with ellipsis character trimming, it looks like:
It always means the last visible line whether trimming is enabled or not, but when trimming is enabled, any partial lines are trimmed out, making the last "visible" the last untrimmed line. So that's the difference you're seeing.

Usable Unicode Ranges for Custom Text Process

I am working on a processor that parts texts into blocks with marks:
LOREM IPSUM SED AMED
will be parsed like:
{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}
But I dont want to use "{word}" etc, because it causes processor down, because it is an string again... I need to mark like these:
\E002\0001 LOREM \E003\0001 \E004\0002
\E002\0003 IPSUM \E003\0004 \E004\0005
\E002\0006 SED \E003\0006 \E004\0007
\E002\0008 AMED \E003\0008
First \E002 means element type number, its last bit represent element's close. So element number increments with +2.
Second \0001 means element index for stacking.
I am just used \E002 irrelevantly for this example.
But \0001 also using in Unicode Range, and this leads me to where I start again...
So which unicode range can I use? \ff0000? or how can I solve this?
Thanks!
The Unicode Consortium thought of this. There is a range of Unicode code points that are meant to never represent a displayable character, but meta-codes instead:
Noncharacters are code points that are permanently reserved and will never have characters
assigned to them.
...
Tag characters were intended to support a general scheme for the internal tagging of text
streams in the absence of other mechanisms, such as markup languages. The use of tag
characters for language tagging is deprecated.
(http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf)
You should be able to use regular control characters as "private" tags, because these should never occur in proper strings. This would be the range from U+0000 to U+001F, excluding tab (U+0009), the common "returns" (U+000A and U+000D), and, for safety, U+0000 itself (some libraries do not like Null characters in the middle of strings).
Non-characters
Noncharacters are code points that are permanently reserved in the Unicode Standard for
internal use. They are not recommended for use in open interchange of Unicode text data.
You can use U+FEFF (which is currently officially defined as Not-A-Character), or U+FFFE and U+FFFF. There are several more "officially not-a-characters" defined, and you can be fairly sure they would not occur in regular text strings.
A few random sequences with predefined definitions, and so highly unlikely to occur in plain text strings are:
Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for
special character definitions.
Annotation Characters: U+FFF9–U+FFFB
An interlinear annotation consists of annotating text that is related to a sequence of annotated
characters. For all regular editing and text-processing algorithms, the annotated characters
are treated as part of the text stream. The annotating text is also part of the content,
but for all or some text processing, it does not form part of the main text stream.
Tag Characters: U+E0000–U+E007F
This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCIIbased
string tags using characters that can be strictly separated from ordinary text content
characters in Unicode.
(all quotations from the chapter as above)
Staying within conventions, you can also use U+2028 (line separator) and/or U+2029 paragraph separator.
Technically, your use of U+E000–U+F8FF (the "Private Use Area") is okay-ish, because these code points only can define an unambiguous character in combination with a certain font. However, it is possible these codes may pop up if you get your plain text from a source where the font was included.
As for how to encode this into your strings: it doesn't really matter if the numerical code immediately following your private tag marker is a valid Unicode character or not. If you see one of your own tag markers, then the value immediately following is always your own private sequence number.
As you see, there are lots of possibilities. I guess the most important criterium is whether you want to use other functions on these strings. If you create a string that is technically invalid Unicode (for instance, because it includes not-a-character values), some external functions may choose to fail to work on them, or silently remove the bad values. In such a case, you'd need to rigorously stick to a system in which you only use 'valid' code points.

Change the next N characters in VIM

Say I have the following line:
|add_test() (| == cursor position)
And want to replace the 'add' with a 'del'.
del|_test()
I can either press X three times and then press i to insert and type del.
What I want is something like 3c or 3r to overwrite just 3 characters.
Both of these don't do what I want, 3c overwrites 3 characters with the same
character, 3r does several other things.
Is there an easy way to do this without manually Xing and inserting the text?
3s, "substitute 3 characters" is the same as c3l. 3cl and c3l should be the same, I don't know why you'd see the same character repeated. I'm also a fan of using t, e.g. ct_ as another poster mentioned, then I don't have to count characters and can just type "del".
I struggled with the "replace a couple of characters" for a few days too; 'r' was great for single characters, R was great for a string of matching length, but I wanted something like the OP is asking for. So, I typed :help x and read for a while, it turns out that the description of s and S are just a couple of pages down from x.
In other words, :help is your friend. Read and learn.
Use c{motion} command:
cf_ - change up to the first '_' (including);
ct_ - change up to the first '_' (excluding);
cw - change the first word;
The word is determined by iskeyword variable. Check it with :set iskeyword? and remove any '_', like that :set iskeyword=#,48-57,192-255.
By the way see :help c and :help motion if you want more.
I think 3cl is what you want. It changes 3 characters to the right. I'd type ct_del<esc>, but that's not what you asked
c3  ('c', '3', space), then type the characters you want to insert. (Or you can use right-arrow or l rather than space.)
Or, as #Mike just said in a comment, R works nicely if the number of characters happens to match the number of characters you're deleting.
Or ct_ to change from the cursor to the next _ character.
Or, as #bloody suggests in a comment, 3s.
If the works have the same length you can use the R command which replaces what you had previously with what you type.
The other answers given use numbers. When the text is longer it's easier to not have to count. For example I often make headlines in markdown files like:
Some super duper long title that I don't want to have to count
double the line with yy pp
Some super duper long title that I don't want to have to count
Some super duper long title that I don't want to have to count
Highlighth the whole line with V
then use r{char} or in this case r= to get:
Some super duper long title that I don't want to have to count
==============================================================
(I added a space above to trip stack overflow's markdown formatting)

How to avoid certain formatting with PAR?

PAR does i.m.o. a much better formatting as Vim default formatter.
But sometimes PAR does't work very well.
p.e.
this is a test this is a test this is a test.
this is my text this is my text this is my text.
formatting with par 44 becomes:
this is a test this is a test this is a t.
this is tes my text this is my text this t.
this is is my tex t.
Is there a way to resolve this kind of formattion?
Par is very powerful, and complex. I don't know 10% of its capacity, but would
be worth spending an entire week just to master it.
What's happening with your text is related to the last characters in each line.
As you can see, every line in:
this is a test this is a test this is a test.
this is my text this is my text this is my text.
Ends with "t.". The Par manual says in the DESCRIPTION section:
Each output paragraph is generated from the corresponding input paragraph
as follows:
1) An optional prefix and/or suffix is removed from each input line.
2) The remainder is divided into words (separated by spaces).
3) The words are joined into lines to make an eye-pleasing paragraph.
4) The prefixes and suffixes are reattached.
Probably1 Par is guessing that t. is a suffix, and thus removing
them in step 1. After everything is formated Par puts the t. back aligning
them.
To solve this, pass the s option with a value of 0. This way suffixes will be
disabled.
:%!par s0w44
1 I'm saying probably because I'm not completely sure of that. As I
said earlier I'm not a master, maybe there is something else involved.

Resources