Why ~ sign used before number? - tilde

I saw in many places that there some number is presenting with prefixing the ~ sign. Actually, I couldn't understand the reason why they use this special character before a number.
As an example- Every mobile PPI number is presenting with prefixing this special character ( ~403 ppi density ).

The tilde character (~) is often used before a number to indicate that it is an approximation.

Related

Is there a module or regex in Python to convert all fonts to a uniform font? (Text is coming from Twitter)

I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that ñ, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an ñ, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ¼ or ℆. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as ℒ (the Laplace transform) and ℚ (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as 𝝪 which is upsilon, not capital Y), and the 10 digits in five variants (𝟎 𝟙 𝟤 𝟯 𝟺). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, x² and H₂O become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.

What is this crazy German character combination to represent an umlaut?

I was just parsing the following website.
There one finds the text
und wären damit auch
At first, the "ä" looks perfectly fine, but once I inspect it, it turns out that this is not the regular "ä" (represented as ascw 228) but this:
ascw: 97, char: a
ascw: 776, char: ¨
I have never before seen an "ä" represented like this.
How can it happen that a website uses this weird character combination and what might be the benefit from it?
What you don't mention in your questions is the used encoding. Quite obviously it is a Unicode based encoding.
In Unicode, code point U+0308 (776 in decimal) is the combining diaeresis. Out of the letter a and the diaeresis, the German character ä is created.
There are indeed two ways to represent German characters with umlauts (ä in this case). Either as a single code point:
U+00E4 latin small letter A with diaeresis
Or as a sequence of two code points:
U+0061 latin small letter A
U+0308 combining diaeresis
Similarly you would combine two code points for an upper case 'Ä':
U+0041 latin capital letter A
U+0308 combining diaeresis
In most cases, Unicode works with two codes points as it requires fewer code points to enable a wide range characters with diacritics. However for historical reasons a special code point exist for letters with German umlauts and French accents.
The Unicode libraries is most programming languages provide functions to normalize a string, i.e. to either convert all sequences into a single code point if possible or extend all single code points into the two code point sequence. Also see Unicode Normalization Forms.
Oh my, this was the answer or original problem with the name of a fileupload.
Cannot convert argument 2 to ByteString because the character at index 6 has value 776 which is greater than 255
For future references.

Why do ANSI color escapes end in 'm' rather than ']'?

ANSI terminal color escapes can be done with \033[...m in most programming languages. (You may need to do \e or \x1b in some languages)
What has always seemed odd to me is how they start with \033[, but they end in m Is there some historical reason for this (perhaps ] was mapped to the slot that is now occupied by m in the ASCII table?) or is it an arbitrary character choice?
It's not completely arbitrary, but follows a scheme laid out by committees, and documented in ECMA-48 (the same as ISO 6429). Except for the initial Escape character, the succeeding characters are specified by ranges.
While the pair Escape[ is widely used (this is called the control sequence introducer CSI), there are other control sequences (such as Escape], the operating system command OSC). These sequences may have parameters, and a final byte.
In the question, using CSI, the m is a final byte, which happens to tell the terminal what the sequence is supposed to do. The parameters if given are a list of numbers. On the other hand, with OSC, the command-type is at the beginning, and the parameters are less constrained (they might be any string of printable characters).

What do you call the different types of characters of a password when it is being validated?

I hope this question is not too pedantic, but is there a technical term for the different "categories" that are part of a password when it is being validated? For example default AD password complexity requirements that must be met (Microsoft calls them "categories"):
Passwords must contain characters from three of the following five **categories**:
Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
Base 10 digits (0 through 9)
Nonalphanumeric characters: ~!##$%^&*_-+=`|\(){}[]:;"'<>,.?/
Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
Is there a term used by security engineers or cryptographers to refer these "categories"?
There's not any official term for these. I would tend to call it a "character type".
For example, this term is used in Novell's document Creating Password Policies:
The password must contain at least one character from three of the four types of character, uppercase, lowercase, numeric, and special
and this NIST document regarding Enterprise Password Management:
AFAIK, in 10 years working in security, no final and shared nomenclature has been given for this. MS "Categories" is a good one and probably the most used, but it is not formally shared among each context (i.e. Java could call it differently, PHP, OWASP, Oracle, ..., could have their own)
Academically speaking, they are only factors to enlarge the basic character set of an offline bruteforce attack, rainbow table creation time or avoid trivial dictionary brute. Bruteforce complexity is roughly 2|C|^n where n is the expected length of the password and C is the character set chosen, and |C| is the number of elements in there.
Having more categories increases the value of |C| - so they should be called something like "password character set subsets" instead of "categories" but you see why nobody bothers with the theoretical bit here, nomenclature is unfriendly.
If you look for it and you find the way academics call them, please post it, it is always useful.

Is there a unicode range that is a copy of the first 128 characters?

I would like to be able to put and other characters into a text without it being interpreted by the computer. So was wondering is there a range that is defined as mapping to the same glyphs etc as the range 0-0x7f (the ascii range).
Please note I state that the range 0-0x7f is the same as ascii, so the question is not what range maps to ascii.
I am asking is there another range that also maps to the same glyphs. I.E when rendered will look the same. But when interpreted may be can be seen as a different code.
so I can write
print "hello "world""
characters in bold avoid the 0-0x7f (ascii range)
Additional:
I was meaning homographic and behaviourally, well everything the same except a different code point. I was hopping for the whole ascii/128bit set, directly mapped (an offset added to them all).
The reason: to avoid interpretation by any language that uses some of the ascii characters as part of its language but allows any unicode character in literal strings e.g. (when uft-8 encoded) C, html, css, …
I was trying to retro-fix the idea of “no reserved words” / “word colours” (string literals one colour, keywords another, variables another, numbers another, etc) so that a string literal or variable-name(though not in this case) can contain any character.
I interpret the question to mean "is there a set of code points which are homographic with the low 7-bit ASCII set". The answer is no.
There are some code points which are conventionally rendered homographically (e.g. Cyrillic upparcase А U+0410 looks identical to ASCII 65 in many fonts, and quite similar in most fonts which support this code point) but they are different code points with different semantics. Similarly, there are some code points which basically render identically, but have a specific set of semantics, like the non-breaking space U+00A0 which renders identically to ASCII 32 but is specified as having a particular line-breaking property; or the RIGHT SINGLE QUOTATION MARK U+2019 which is an unambiguous quotation mark, as opposed to its twin ASCII 39, the "apostrophe".
But in summary, there are many symbols in the basic ASCII block which do not coincide with a homograph in another code block. You might be able to find homographs or near-homographs for your sample sentence, though; I would investigate the IPA phonetic symbols and the Greek and Cyrillic blocks.
The answer to the question asked is “No”, as #tripleee described, but the following note might be relevant if the purpose is trickery or fun of some kind:
The printable ASCII characters excluding the space have been duplicated at U+FF01 to U+FF5E, but these are fullwidth characters intended for use in CJK texts. Their shape is (and is meant to be) different: hello  world. (Your browser may be unable to render them.) So they are not really homographic with ASCII characters but could be used for some special purposes. (I have no idea of what the purpose might be here.)
Depends on the Unicode standard you use.
In UTF-8, the first 128 characters have the exact ASCII counterparts as code numbers. In UTF-16, the first 128 ASCII characters are between 0x0000 and 0x007F (2 bytes).

Resources