How does Linux determine filename case on ISO 9660? - linux

Here is a quote from this article:
ISO 9660 is not a complex file system, but has a few quirks that are
worth remembering. It seems that some operating systems also create
non-compliant CDs, so beware! The main example of this is the
character set that is available for file names. Strictly,filenames may
only consist of uppercase letters A-Z, digits, dots, and underscores.
Further there is a semicolon which separates the visible file name
from its version number suffix. Many operating systems also allow
lower case letters and other characters. Linux's VFS displays lower
case filenames to the user despite the CD contents actually containing
upper case characters.
So my question is, how does Linux know which letters are supposed to be uppercase and which letters are supposed to be lowercase, when on the CD they are all uppercase?

The ISO9660 filesystem has only supported filenames in the 8.3 uppercase format.
Some technologies have been designed over the years to extend the ISO9660 filesystem with features like long filenames, lowercase letters, and file permissions. The Joliet filesystem is the Windows solution, while Rock Ridge is one that works with Linux. In essence they store the original filename, with proper case, in a lookup table that is recorded in the removable media. More information in the Wikipedia article for ISO9660.

Related

What is the best way to search a large file for hexadecimal and export readable results to a file? (OS Agnostic)

My goal is to search a 500Gb file for a series of hexadecimal characters and to export the results into a text file. I need to automate this, as there are many patterns to be searched.
The results need to include: the location in the file, the 100 preceding hex characters values (represented in both hex and ascii).
As noted, this is OS agnostic (and language agnostic, if anyone suggests scripts or code).

How do text editors store data above 1 byte?

The basic question is, how does notepad (or other basic text editors) store data. I ran into this because I was trying to compare file size of different compression techniques, and realized something isn't quite right.
To elaborate..
If I save a text file with the following contents:
a
The file is 1 byte. This one happens to be 97, or 0x61.
I create a text file with the following contents:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Which is all the characters from 0-255, or 0x00 to 0xFF.
This file is 256 bytes. 1 byte for each character. This makes sense to me.
Then I append the following character to the end of the above string.
†
A character not contained in the above string. All 8 bit characters were already used. This character is 8224, or 0x2020. A 2 bytes character.
And yet, the file size has only changed from 256 to 257 bytes. In fact, the above character saved by itself only shows 1 byte.
What am I missing?
Edit: Please note that in the second text block, many of the characters do not show up on here.
In ANSI encoding (This 8-bit Microsoft-specific encoding), you save each character in one byte (8-bit).
ANSI also called Windows-1252, or Windows Latin-1
You should have a look at ANSI table in ANSI Character Codes Chart or Windows-1252
So for † character, its code is 134, byte 0x86.
Using one byte to encode a character only makes sense on the surface. Works okay if you speak English, it is a fair disaster is you speak Chinese or Japanese. Unicode today has definitions for 110,187 typographic symbols with room to grow up to 1.1 million. A byte is not a good way to store a Unicode symbol since it can encode only 256 distinct values.
Accordingly, text editors must always encode text when they store it to a file. Encoding is required to map 110,187 values onto a byte-oriented storage medium. Inevitably that takes more than 1 byte per character if you speak Chinese.
There have been lots and lots of encoding schemes in common use. Popular in the previous century were code pages, a scheme that uses a character set. A language-specific mapping that tries as hard as it can to need only 1 byte of storage per character by picking 256 characters that are likely to be needed in the language. Japanese, Korean and Chinese used a multi-byte mapping because they had to, other languages used 1.
Code pages have been an enormous disaster, a program cannot properly read a text file that was encoded in another language's code page. It worked when text files stayed close to the machine that created it, the Internet in particular broke that usage. Japanese was particularly prone to this disaster since it had more than one code page in common use. The result is called mojibake, the user looks at gibberish in the text editor. Unicode came around in 1992 to try solve this disaster. One new standard to replace all the other ones, tends to invoke another kind of disaster.
You are subjected to that kind of disaster, particularly if you use Notepad. A program that tries to be compatible with text files that were created in the past 30 years. Google "bush hid the facts" for a hilarious story about that. Note the dialog you get when you use File > Save As, the dialog has an extra combobox titled "Encoding". The default is ANSI, a broken name from the previous century that means "code page". As you found out, that character indeed only needed 1 byte in your machine's default code page. Depends where you live, it is 1252 in Western Europe and the Americas. You'd get 0x86 if you look at the file with a hex viewer.
Given that the dialog gives you a choice and you should not favor ANSI's mojibake anymore, always favor UTF-8 instead. Maybe they'll update Notepad some day so it uses a better default, very hard to do.

Rationale of fileencoding and encoding in vim or elsewhere

I don't get the point why there are encoding and also fileencoding in VIM.
In my knowledge, a file is like an array of bytes. When we create a text file, we create an array of characters (or symbols), and encode this character-array with encoding X to an array of bytes, and save the byte-array to disk. When read in text editor, it decode the byte-array with encoding X to reconstruct the original character-array, and display each character with a graph according to the font. In this process, only one encoding involved.
In VIM set encoding and fileencoding utf-8, which refers wiki of VIM about working with unicode,
encoding sets how vim shall represent characters internally. Utf-8
is necessary for most flavors of Unicode.
fileencoding sets the encoding for a particular file (local to
buffer)
"How vim shall represent characters internally" vs "encoding for a particular file"... resambles Unicode vs UTF-8? If so, why should a user bother with the former?
Any hint?
You're right; most programs have a fixed internal encoding (speaking of C datatypes, that's either char, which mostly then uses the underlying locale and may not be able to represent all characters, or UTF-8; or wchar (wide characters) which can represent the Unicode range). The choice is mainly driven by programming language and available APIs (as having to convert back and forth is tedious and not efficient).
Vim, because it supports a large variety of platforms (starting with the old Amiga where development started) and is geared towards programmers and highly advanced users allows to configure the internal representation.
heuristics
As long as all characters are recognizable, you don't need to care.
If certain files don't look right, you have to teach Vim to recognize the encoding via 'fileencodings', or explicitly specify it.
If certain characters do not show up right, you need to switch the 'encoding'. With utf-8, you're on the safe side.
If you have problems in the terminal only, fiddle with 'termencoding'.
As you can see, though it can be confusing to the beginner, you actually have all the power available to you!
I'll preface this by saying that I'm not a vim expert by any means.
I think the flaw in your thinking is here:
When read in text editor, it decode the byte-array with encoding X to reconstruct the original character-array, and display each character with a graph according to the font.
The thing is, vim is not responsible for rendering the glyph here. vim reads bytes from a file, stores them internally and sends bytes to the terminal which renders the glyph using a font. vim itself never touches fonts and hence never really needs to understand "characters". It only needs to work with bytes internally which it moves back and forth between files, internal buffers and the terminal.
Hence, there are three possible different byte storages involved:
fileencoding
(internal) encoding
termencoding
vim will convert between those as necessary. It could read from a Shift-JIS encoded file, store the data internally as UTF-16 and send/receive I/O to/from the terminal in UTF-8. I am not sure why you'd want to change the internal byte handling of vim (again, not an expert), but in any case, you can alter that setting if you want to.
Hypothesising follows: If you set encoding to a Unicode encoding, you're safe to be able to handle any possible character you may encounter. However, in some circumstances those Unicode encodings may be too large to comfortably fit into memory in very limited systems, so in this case you may want to use a more specialised encoding if you know what you're doing.

Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts

I have a quite big problem, where I can't find any help around in the web:
I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem:
Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.
I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!
This surprised me and I took a look into the apache-log where I found these entries:
192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:
ö c3 b6 LATIN SMALL LETTER O WITH DIAERESIS
¨ cc 88 COMBINING DIAERESIS
I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)
Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?
If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...
It's not quite the same thing as dead keys, but it's related. As you've worked out, U+00F6 and U+006F followed by U+0308 have the same visual result.
There are in fact Unicode rules in knowing to treat them the same, which is based on decompositions. There's a decomposition table in the character database, that tells us that U+00F6 canonically decomposes to U+006F followed by U+0308.
As well as canonical decomposition, there are compatibility decompositions. These lose some information, for example ² ends up being decomposed to 2. This is clearly a destructive change, but it is useful for searching when you want to be a bit fuzzy (how google knows a search for fiſh should return results about fish).
If there are more than one combining character after a non-combining character, then we can re-order them as long as we don't re-order those of the same class. This becomes clear when we consider that it doesn't matter whether we put a cedilla on something and then an acute accent, or an acute and then a cedilla, but if we put both an acute and an umlaut on a letter it clearly matters what way around they go.
From this, we have 4 normalisation forms. Put strings into an appropriate normalisation form before doing comparisons, and you don't get tripped up.
NFD: Break everything apart by canonically decomposing it as much as possible. Reorder combining characters in order of their combining class, but keep any with the same class in the same order relative to each other.
NFC: First put everything into NFD. Then continually look at the combining characters in order, if there isn't an earlier one of the same class. If there is an equivalent single character, then replace them, and re-do the scan looking to compose further.
NFKD: Like NFD, but using compatibility decomposition (damaging change, but useful for comparisons as explained above).
NFD: Do NFKD, then re-combine canonical only as per NFC.
There are also some re-combinations banned from use in NFC so that text that was valid NFC in one version of Unicode doesn't cease to be NFC if Unicode has more characters added to it.
Of NFD and NFC, NFC is clearly the more concise. It's not the most concise possible, but it is one that is very concise and can be tested for and/or created in a very efficient streaming manner.
Mac OSX uses NFD for file names. Because they're weirdos. (Okay, there are better arguments than that, they just didn't convince me!)
The Web Character Model uses NFC.* As such, you should use NFC on web stuff as much as possible. There can though be security considerations in blindly converting stuff to NFC. But if it starts from you, it should start in NFC.
Any programming language that deals with text should have a nice way of normlising text into any of these forms. If yours doesn't complain (or if yours is open source, contribute!).
See http://unicode.org/faq/normalization.html for more, or http://unicode.org/reports/tr15/ for the full gory details.
*For extra fun, if you inserted something beginning with a combining long solidus overlay (U+0338) at the start of an XML or HTML element's content, it would turn the > of the tag into ≯, turning well-formed XML into gibberish. For this reason the web character model insists that each entity must itself be NFC and not start with a combining character.
Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.
As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)
Command line tools to convert charsets on Linux are:
iconv - converting the content of a stream (maybe a file)
convmv - converting the filenames in a directory
The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.
The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/
PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here:
In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)
Linux distros treat filenames as binary strings, meaning no encoding is assumed - though the graphical shell (Gnome, KDE, etc) might make some assumptions based on environment variables, locale, etc.
OS-X on the other hand requires or enforces (I forget which) their own version of UTF-8 with Unicode normalization to expand all diacritics into combining characters.
On Linux when people do use Unicode in filenames they tend to prefer UTF-8 with precomposed characters when it comes to diacritics.

How would one store German text in an embedded system?

I've created a memory mapped 1 bit interface to an LCD in an embedded system, along with 4 or 5 bit mapped fonts for the 90+ printable ASCII characters. Writing to the screen is as simple as using an echo like statement (it's embedded Linux).
Other than something strictly proprietory, what recommendations can people make for storing German (or Spanish, or French for that mattter)? Unicode seems to be a pretty heavy hitter.
If I understand you right, you are searching a lightwight encoding for german characters? In Europe, you normaly use Latin-1 or better ISO 8859-15. This is a 8-Bit ASCII extension containing most of the characters used by western languages.
Well, UTF-8 isn't that big. I recommend it if you want to be able to use one or more languages where you don't find a matching char in Latin-1.

Resources