Why were certain special keyboard characters chosen? - keyboard

Of all of the characters available, and given the relatively small number of keys, why were some of the special characters chosen?
I am referring n particular to:
~
|
` (but not ´)
^ (but no other accents)
among others. It seems an odd choice in the competition for attention.

Related

Is there a module or regex in Python to convert all fonts to a uniform font? (Text is coming from Twitter)

I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that ñ, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an ñ, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ¼ or ℆. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as ℒ (the Laplace transform) and ℚ (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as 𝝪 which is upsilon, not capital Y), and the 10 digits in five variants (𝟎 𝟙 𝟤 𝟯 𝟺). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, x² and H₂O become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.

vim delete single space between (character/number) and (character number)

I have a text file with four variables (TA 000, TB 111, T2 333, T56 R88), separated by 3 single spaces among each other like:
TA 000 TB 111 T2 333 T56 R88
Is it possible to erase the single space within the variables with vim, maintaining intact the 3 spaces that separate the variables?
TA000 TB111 T2333 T56R88
Certainly. One approach is with capturing groups, capturing the words + single space + numbers, and reassembling only with words + numbers:
:%s/\(\w\+\) \(\d\+\)/\1\2/g
Another approach matches only the single space (and replaces it with nothing), asserting (but not matching) the stuff around it:
:%s/\w\zs \ze\d//g
The \zs and \ze (you can look up anything here via :h /\zs etc.) are specific to Vim. A variation (that would work in other regular expression engines, too) would be using positive lookahead and lookbehind, but the syntax is more complex.
If the three spaces have special meaning (to limit the matching places), you can incorporate those into both approaches, too. I leave that to you, as such relatively easy problems provide great learning experiences :-)

What do you call the different types of characters of a password when it is being validated?

I hope this question is not too pedantic, but is there a technical term for the different "categories" that are part of a password when it is being validated? For example default AD password complexity requirements that must be met (Microsoft calls them "categories"):
Passwords must contain characters from three of the following five **categories**:
Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
Base 10 digits (0 through 9)
Nonalphanumeric characters: ~!##$%^&*_-+=`|\(){}[]:;"'<>,.?/
Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
Is there a term used by security engineers or cryptographers to refer these "categories"?
There's not any official term for these. I would tend to call it a "character type".
For example, this term is used in Novell's document Creating Password Policies:
The password must contain at least one character from three of the four types of character, uppercase, lowercase, numeric, and special
and this NIST document regarding Enterprise Password Management:
AFAIK, in 10 years working in security, no final and shared nomenclature has been given for this. MS "Categories" is a good one and probably the most used, but it is not formally shared among each context (i.e. Java could call it differently, PHP, OWASP, Oracle, ..., could have their own)
Academically speaking, they are only factors to enlarge the basic character set of an offline bruteforce attack, rainbow table creation time or avoid trivial dictionary brute. Bruteforce complexity is roughly 2|C|^n where n is the expected length of the password and C is the character set chosen, and |C| is the number of elements in there.
Having more categories increases the value of |C| - so they should be called something like "password character set subsets" instead of "categories" but you see why nobody bothers with the theoretical bit here, nomenclature is unfriendly.
If you look for it and you find the way academics call them, please post it, it is always useful.

Why are special characters not allowed in variable names?

Why special character( except underscore) are not allowed in variable name of programming language ?
Is there are any reason related to computer architecture or organisation.
Most languages have long histories, using ASCII (or EBCDIC) character sets. Those languages tend to have simple identifier descriptions (e.g., starts with A-Z, followed by A-Z,0-9, maybe underscore; COBOL allows "-" as part of a name). When all you had was an 029 keypunch or a teletype, you didn't have many other characters, and most of them got used as operator syntax or punctuation.
On older machines, this did have the advantage that you could encode an identifier as a radix 37 (A-Z,0-9, null) [6 characters in 32 bits] or radix 64 (A-Z,a-z,0-9,underscore and null) numbers [6 characters in 36 bits, a common word size in earlier generations of machines) for small symbol tables. A consequence: many older languages had 6 character limits on identifier sizes (e.g., FORTRAN).
LISP languages have long been much more permissive; names can be anything but characters with special meaning to LISP, e.g., ( ) [ ] ' ` #, and usually there are ways to insert these characters into names using some kind of escape convention. Our PARLANSE language is like LISP; it uses "~" as an escape, so you can write ~(begin+~)end as a single identifier whose actual spelling is "(begin+end)".
More modern languages (Java, C#, Scala, ...., uh, even PARLANSE) grew up in an era of Unicode, and tend to allow most of unicode in identifiers (actually, they tend to allow named Unicode subsets as parts of identifiers). An identifier made of chinese characters is perfectly legal in such languages.
Its kind of a matter of taste in the Western hemisphere: most identifier names still tend to use just letters and digits (sometimes, Western European letters). I don't know what the Japanese and Chinese really use for identifier names when they have Unicode capable character sets; what little Asian code I have seen tends to follow western identifier conventions but the comments tend to use much more of the local native and/or Unicode character set.
Fundamentally it is because they're mostly used as operators or separators, so it would introduce ambiguity.
Is there any reason relate to computer architecture or organization.
No. The computer can't see the variable names. Only the compiler can. But it has to be able to distinguish a variable name from two variable names separated by an operator, and most language designers have adopted the principle that the meaning of a computer program should not be affected by white space.

Detecting syllables in a word containing non-alphabetical characters

I'm implementing readability test and have implemented simple algorithm of detecting sylables.
Detecting sequences of vowels I'm counting them in words, for example word "shoud" contains one sequence of vowels which is 'ou'. Before I'm counting them i'm removing suffixes like -les, -e, -ed (for example word "like" contains one syllable but two sequences of vowels, so this method works).
But...
Consider these words / sequences:
x-ray (it contains two syllables)
I'm (One syllable, maybe I may use removal of all apostrophes in the text?)
goin'
I'd've
n' (for example Pork n' Beans)
3rd (how to treat this ?)
12345
What to do with special characters? Remove them all? It will be ok for most of words, but not with "n'" and "x-ray". And how treat cyphers.
These are special cases of words but I'll be very glad to see some experience or ideas in this subject.
I'd advise you to first determine how much of your data consists of these kinds of words and how much it matters to your program's overall performance. Also compile some statistics of which kinds occur most.
There's no simple correct solution for this problem, but I can suggest a few heuristics:
A ' between two consonants (shouldn't) seems to mark the elision of a syllable
A ' with a vowel or word boundary on one side (I'd, goin') seems not to do so (but note that goin' is still two syllables)
Any word, including n' is at least one syllable long
Dashes (-) may be handled by treating the text on both sides as separate words
3rd can be handled by code that writes ordinals out as words, or by simpler heuristics.

Resources