I am writing a program that bascially gets terminal output, and needs to strip-out ANSI/VT-100 escape codes. I can find specs on the code(s) - but there are many, and they all have different sizes (numbers of characters).
Anyone know of an algorithmic way of figuring out the size of a given code? (i.e. Something like if "if it starts with a letter, it will be 2 chars, but if it starts with a symbol it will be one char").
Related
I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.
I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.
Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.
ANSI terminal color escapes can be done with \033[...m in most programming languages. (You may need to do \e or \x1b in some languages)
What has always seemed odd to me is how they start with \033[, but they end in m Is there some historical reason for this (perhaps ] was mapped to the slot that is now occupied by m in the ASCII table?) or is it an arbitrary character choice?
It's not completely arbitrary, but follows a scheme laid out by committees, and documented in ECMA-48 (the same as ISO 6429). Except for the initial Escape character, the succeeding characters are specified by ranges.
While the pair Escape[ is widely used (this is called the control sequence introducer CSI), there are other control sequences (such as Escape], the operating system command OSC). These sequences may have parameters, and a final byte.
In the question, using CSI, the m is a final byte, which happens to tell the terminal what the sequence is supposed to do. The parameters if given are a list of numbers. On the other hand, with OSC, the command-type is at the beginning, and the parameters are less constrained (they might be any string of printable characters).
Is there a way to set the maximum number of characters for GtkTextBuffer? What I need is when the maximum number of characters is reached, further typing produces no output. Beep would be nice but not required. Or maybe I need to do something with GtkTextView?
I need my GtkTextView to serve as a word-wrapping single line entry, so I want to disallow Return characters as well.
I'm looking for a solution for GTK+2 and GTK+3.
This question has a similar intent, but in this case, the poster wants to only display a certain number of tailing characters. ptomato (who would know more about it than I) recommends using a callback for typing to cut or truncate the string.
Your callback would be very similar, except that you would cut off the tailing characters and strip any newline characters as they were fed in.
You could also use a similar mechanism to the comment fields on SO, where you are notified that your string is too long and either disallow continuing or truncate it to size only when the data is actually being used.
For wrapping, you also want to read about wrapping modes, although you may already know about that.
I would like to be able to put and other characters into a text without it being interpreted by the computer. So was wondering is there a range that is defined as mapping to the same glyphs etc as the range 0-0x7f (the ascii range).
Please note I state that the range 0-0x7f is the same as ascii, so the question is not what range maps to ascii.
I am asking is there another range that also maps to the same glyphs. I.E when rendered will look the same. But when interpreted may be can be seen as a different code.
so I can write
print "hello "world""
characters in bold avoid the 0-0x7f (ascii range)
Additional:
I was meaning homographic and behaviourally, well everything the same except a different code point. I was hopping for the whole ascii/128bit set, directly mapped (an offset added to them all).
The reason: to avoid interpretation by any language that uses some of the ascii characters as part of its language but allows any unicode character in literal strings e.g. (when uft-8 encoded) C, html, css, …
I was trying to retro-fix the idea of “no reserved words” / “word colours” (string literals one colour, keywords another, variables another, numbers another, etc) so that a string literal or variable-name(though not in this case) can contain any character.
I interpret the question to mean "is there a set of code points which are homographic with the low 7-bit ASCII set". The answer is no.
There are some code points which are conventionally rendered homographically (e.g. Cyrillic upparcase А U+0410 looks identical to ASCII 65 in many fonts, and quite similar in most fonts which support this code point) but they are different code points with different semantics. Similarly, there are some code points which basically render identically, but have a specific set of semantics, like the non-breaking space U+00A0 which renders identically to ASCII 32 but is specified as having a particular line-breaking property; or the RIGHT SINGLE QUOTATION MARK U+2019 which is an unambiguous quotation mark, as opposed to its twin ASCII 39, the "apostrophe".
But in summary, there are many symbols in the basic ASCII block which do not coincide with a homograph in another code block. You might be able to find homographs or near-homographs for your sample sentence, though; I would investigate the IPA phonetic symbols and the Greek and Cyrillic blocks.
The answer to the question asked is “No”, as #tripleee described, but the following note might be relevant if the purpose is trickery or fun of some kind:
The printable ASCII characters excluding the space have been duplicated at U+FF01 to U+FF5E, but these are fullwidth characters intended for use in CJK texts. Their shape is (and is meant to be) different: hello world. (Your browser may be unable to render them.) So they are not really homographic with ASCII characters but could be used for some special purposes. (I have no idea of what the purpose might be here.)
Depends on the Unicode standard you use.
In UTF-8, the first 128 characters have the exact ASCII counterparts as code numbers. In UTF-16, the first 128 ASCII characters are between 0x0000 and 0x007F (2 bytes).
I found out after testing that linux allows any character in a file name except for / and null (\0). So what sequence should I not allow in a filename? I heard a leading - may confuse some command line programs, which doesn't matter to me, however it may bother other people if they decide to collect a bunch of files and filter it with some GNU programs.
It was suggested to me to remove leading and trailing spaces and I plan to only because typically the user doesn't mean to have leading/trailing space.
What problematic sequence might there be and what sequence should I consider not allowing?
I am also considering not allowing characters illegal in windows just for convenience. I think I may not allow dashes at the beginning (dash is a legal window character)
Your question is somewhat confusing since you talk at length about Linux, but then in a comment to another answer you say that you are generating filenames for people to download, which presumably means that you have absolutely no control whatsoever over the filesystem and operating system that the files will be stored on, making Linux completely irrelevant.
For the purpose of this answer I'm going to assume that your question is wrong and your comment is correct.
The vast majority of operating systems and filesystems in use today fall roughly into three categories: POSIX, Windows and MacOS.
The POSIX specification is very clear on what a filename that is guaranteed to be portable across all POSIX systems looks like. The characters that you can use are defined in Section 3.276 (Portable Filename Character Set) of the Open Group Base Specification as:ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789._-The maximum filename length that you can rely on is defined in Section 13.23.3.5 (<limits.h> Minimum Values) as 14. (The relevant constant is _POSIX_NAME_MAX.)
So, a filename which is up to 14 characters long and contains only the 65 characters listed above, is safe to use on all POSIX compliant systems, which gives you 24407335764928225040435790 combinations (or roughly 84 bits).
If you don't want to annoy your users, you should add two more restrictions: don't start the filename with a dash or a dot. Filenames starting with a dot are customarily interpreted as "hidden" files and are not displayed in directory listings unless explicitly requested. And filenames starting with a dash may be interpreted as an option by many commands. (Sidenote: it is amazing how many users don't know about the rm ./-rf or rm -- -rf tricks.)
This leaves you at 23656340818315048885345458 combinations (still 84 bits).
Windows adds a couple of new restrictions to this: filenames cannot end with a dot and filenames are case-insensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It doesn't add any length restrictions, Windows can deal with 14 characters just fine.
This reduces the possible combinations to 17866587696996781449603 (73 bits).
Another restriction is that Windows treats everything after the last dot as a filename extension which denotes the type of the file. If you want to avoid potential confusion (say, if you generate a filename like abc.mp3 for a text file), you should avoid dots altogether.
You still have 13090925539866773438463 combinations (73 bits).
If you have to worry about DOS, then additional restrictions apply: the filename consists of one or two parts (seperated by a dot), where neither of the two parts can contain a dot. The first part has a maximum length of 8, the second of 3 characters. Again, the second part is usually reserved to indicate the file type, which leaves you only 8 characters.
Now you have 4347792138495 possible filenames or 41 bits.
The good news is that you can use the 3 character extension to actually correctly indicate the file type, without breaking the POSIX filename limit (8+3+1 = 12 < 14).
If you want your users to be able to burn the files onto a CD-R formatted with ISO9660 Level 1, then you have to disallow hyphen anywhere, not just as the first character. Now, the remaining character set looks likeABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_which gives you 3512479453921 combinations (41 bits).
I would leave the determination of what's "valid" up to the OS and filesystem driver. Let the user type whatever they want, and pass it on. Handle errors from the OS in an appropriate manner. The exception is I think it's reasonable to strip leading and trailing spaces. If people want to create filenames with embedded spaces or leading dashes or question marks, and their chosen filesystem allows it, it shouldn't be up to you to try to prevent them.
It's possible to mount different filesystems at different mount points (or drives in Windows) that have different rules regarding legal characters in a file name. Handling this sort of thing inside your application will be much more work than is necessary, because the OS will already do it for you.
Since you seem to be interested primarily in Linux, one thing to avoid is characters that the (typical) shell will try to interpret, for example, as a wildcard. You can create a file named "*" if you insist, but you might have some users who don't appreciate it much.
Are you developing an application where you have to ask the user to create files themselves? If that's what you are doing, then you can set the rules in your application. (eg only allow [a-zA-Z0-9_.] and reject the rest of special characters.) this is much simpler to enforce.
urlencode all strings to be use as filenames and you'll only have to worry about length. This answer might be worth reading.
I'd recommend the use of a set of whitelist characters. In general, symbols in filenames will annoy people.
By all means allow people to use a-z 0-9 and unicode characters > 0x80, but do not allow arbitrary symbols, things like & and , will cause a lot of annoyance, as well as fullstops in inappropriate places.
I think the ASCII symbols which are safe to allow are: fullstop underscore hyphen
Allowing any OTHER ascii symbols in the filename is asking for trouble.
A filename should also not start with an ascii symbol. Policy on spaces in filenames is tricky as users may expect to be able to use them, but some filenames are obviously silly (such as those which START with spaces)