I found out after testing that linux allows any character in a file name except for / and null (\0). So what sequence should I not allow in a filename? I heard a leading - may confuse some command line programs, which doesn't matter to me, however it may bother other people if they decide to collect a bunch of files and filter it with some GNU programs.
It was suggested to me to remove leading and trailing spaces and I plan to only because typically the user doesn't mean to have leading/trailing space.
What problematic sequence might there be and what sequence should I consider not allowing?
I am also considering not allowing characters illegal in windows just for convenience. I think I may not allow dashes at the beginning (dash is a legal window character)
Your question is somewhat confusing since you talk at length about Linux, but then in a comment to another answer you say that you are generating filenames for people to download, which presumably means that you have absolutely no control whatsoever over the filesystem and operating system that the files will be stored on, making Linux completely irrelevant.
For the purpose of this answer I'm going to assume that your question is wrong and your comment is correct.
The vast majority of operating systems and filesystems in use today fall roughly into three categories: POSIX, Windows and MacOS.
The POSIX specification is very clear on what a filename that is guaranteed to be portable across all POSIX systems looks like. The characters that you can use are defined in Section 3.276 (Portable Filename Character Set) of the Open Group Base Specification as:ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789._-The maximum filename length that you can rely on is defined in Section 13.23.3.5 (<limits.h> Minimum Values) as 14. (The relevant constant is _POSIX_NAME_MAX.)
So, a filename which is up to 14 characters long and contains only the 65 characters listed above, is safe to use on all POSIX compliant systems, which gives you 24407335764928225040435790 combinations (or roughly 84 bits).
If you don't want to annoy your users, you should add two more restrictions: don't start the filename with a dash or a dot. Filenames starting with a dot are customarily interpreted as "hidden" files and are not displayed in directory listings unless explicitly requested. And filenames starting with a dash may be interpreted as an option by many commands. (Sidenote: it is amazing how many users don't know about the rm ./-rf or rm -- -rf tricks.)
This leaves you at 23656340818315048885345458 combinations (still 84 bits).
Windows adds a couple of new restrictions to this: filenames cannot end with a dot and filenames are case-insensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It doesn't add any length restrictions, Windows can deal with 14 characters just fine.
This reduces the possible combinations to 17866587696996781449603 (73 bits).
Another restriction is that Windows treats everything after the last dot as a filename extension which denotes the type of the file. If you want to avoid potential confusion (say, if you generate a filename like abc.mp3 for a text file), you should avoid dots altogether.
You still have 13090925539866773438463 combinations (73 bits).
If you have to worry about DOS, then additional restrictions apply: the filename consists of one or two parts (seperated by a dot), where neither of the two parts can contain a dot. The first part has a maximum length of 8, the second of 3 characters. Again, the second part is usually reserved to indicate the file type, which leaves you only 8 characters.
Now you have 4347792138495 possible filenames or 41 bits.
The good news is that you can use the 3 character extension to actually correctly indicate the file type, without breaking the POSIX filename limit (8+3+1 = 12 < 14).
If you want your users to be able to burn the files onto a CD-R formatted with ISO9660 Level 1, then you have to disallow hyphen anywhere, not just as the first character. Now, the remaining character set looks likeABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_which gives you 3512479453921 combinations (41 bits).
I would leave the determination of what's "valid" up to the OS and filesystem driver. Let the user type whatever they want, and pass it on. Handle errors from the OS in an appropriate manner. The exception is I think it's reasonable to strip leading and trailing spaces. If people want to create filenames with embedded spaces or leading dashes or question marks, and their chosen filesystem allows it, it shouldn't be up to you to try to prevent them.
It's possible to mount different filesystems at different mount points (or drives in Windows) that have different rules regarding legal characters in a file name. Handling this sort of thing inside your application will be much more work than is necessary, because the OS will already do it for you.
Since you seem to be interested primarily in Linux, one thing to avoid is characters that the (typical) shell will try to interpret, for example, as a wildcard. You can create a file named "*" if you insist, but you might have some users who don't appreciate it much.
Are you developing an application where you have to ask the user to create files themselves? If that's what you are doing, then you can set the rules in your application. (eg only allow [a-zA-Z0-9_.] and reject the rest of special characters.) this is much simpler to enforce.
urlencode all strings to be use as filenames and you'll only have to worry about length. This answer might be worth reading.
I'd recommend the use of a set of whitelist characters. In general, symbols in filenames will annoy people.
By all means allow people to use a-z 0-9 and unicode characters > 0x80, but do not allow arbitrary symbols, things like & and , will cause a lot of annoyance, as well as fullstops in inappropriate places.
I think the ASCII symbols which are safe to allow are: fullstop underscore hyphen
Allowing any OTHER ascii symbols in the filename is asking for trouble.
A filename should also not start with an ascii symbol. Policy on spaces in filenames is tricky as users may expect to be able to use them, but some filenames are obviously silly (such as those which START with spaces)
Related
I am taking user input in an application I am writing and would like to expand escape sequences that the user enters.
For example if the user enters \\n it will be interpreted into a str \\\\n. I want to, in a general way, interpret that string (into a newline) and similar ones.
I could of course use String::replace() on the most essential ones and live without the rest, but I would prefer a general solution that also handles hex escapes (\x61 is a).
Escapes are usually handled by the lexer / parser (basically they're part of the language grammar), I don't think there's an stdlib function which would manage them as it would be done as a much lower level.
Furthermore, escapes tend to be highly language-specific, in possibly unexpected ways. Rust has a singularly small list of escape sequences, which is probably desirable compared to the garbage available from C's, but I still do not know that you'd want to allow e.g. arbitrary hex or unicode escape sequences.
I would therefore recommend setting up your own explicitly supported list of escapes, though if you really do not want to there are probably third-party packages which can help you.
I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.
I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.
Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.
I am building a string compressor and for simplicity reasons, I wanted to use some non-printable characters.
1) Is it in some way "bad" to use the 0-31 ASCII characters?
2) Can these characters occur in a normal text string?
If the answer is "partially":
3) What of them is better to use in this case? I think I will need maximum 9 of them.
Well the answer is that it depends on how you're using it. If you're treating the "string" as binary, then binary by definition can have any value. However if it is meant to be read/printed, it could cause serious problems to use characters 0-31.
It isn't too big a deal for the most part, except that 0 is "end of string" by many platforms. Though again, it depends entirely on how you're using it. My advice would be at the very least, avoid character 0. If you want the user to be able to copy and paste the string, then none of these would be suitable. They must be printable characters, in other words.
Is there a way to set the maximum number of characters for GtkTextBuffer? What I need is when the maximum number of characters is reached, further typing produces no output. Beep would be nice but not required. Or maybe I need to do something with GtkTextView?
I need my GtkTextView to serve as a word-wrapping single line entry, so I want to disallow Return characters as well.
I'm looking for a solution for GTK+2 and GTK+3.
This question has a similar intent, but in this case, the poster wants to only display a certain number of tailing characters. ptomato (who would know more about it than I) recommends using a callback for typing to cut or truncate the string.
Your callback would be very similar, except that you would cut off the tailing characters and strip any newline characters as they were fed in.
You could also use a similar mechanism to the comment fields on SO, where you are notified that your string is too long and either disallow continuing or truncate it to size only when the data is actually being used.
For wrapping, you also want to read about wrapping modes, although you may already know about that.
I have some ASCII-encoded files containing ascii representations of individual Unicode characters like ..., --, and so on that I'd like to convert to e.g. Unicode ellipsis and en-dash symbols for display purposes. This could be as simple as a simple replace filter over all such mappings (in the right order, to catch things like --- -> — and -- -> –, of course). (note: there are more than just those)
Does there exist a database of all such conversions somewhere? I assume the inverse must exist somehow to be able to gracefully convert unicode to plaintext whenever possible, e.g. … -> ....
It doesn't have to be extremely accurate or anything as long as the conversion is appropriate in most cases and makes sense. The output will be just be displayed to the user and won't be further processed. I could just compile a list myself as I go but it would be nice to save time and avoid duplicating effort if it has already been done.
Thanks!
A comprehensive list isn't a very good idea as there are a lot of Unicode characters that exist for compatibility, or are poorly supported (see my comment). Instead, you probably want to use a curated list/library like SmartyPants (ports/alternatives can be found for most other languages).