As far as I can tell I shouldn't be using ÅÄÖ (somthing like they have no visual representation in ASCII??).
So what is considered more SEO friendly? Replacing i.e. all "ä" with "a" or "ae"? (The CMS Umbraco replaces with ae and I'm leaning towards this).
EDIT: Summary of how some Swedish sites does it:
aftonbladet.se/ ä => a (http://www.aftonbladet.se/kropphalsa/)
uppsatser.se/ ä => ä (http://www.uppsatser.se/om/v%C3%A5rd+av/)
lindqvist.com/ ä => a (http://www.lindqvist.com/b/google-maps-placering-ar-gratis)
umbraco CMS sites (like vaxab.se) ä => ae (http://vaxab.se/tjaenster.aspx)
dn.se/ ä => a (http://www.dn.se/sthlm/brak-utanfor-aspuddsbadet-1.1008899)
In my opinion, if accented characters has to be retained for their real meaning, then it shouldn't be renamed.
Most case, replacing with the ASCII alphabet cousins should be good. E.g. replacing "résumé" with "resume" makes sense as that's what people search for and understand given the right context.
Else retain accented characters with the URL encoded representation, e.g:
%C5%C4%D6 will be the URL encoded representation for "ÅÄÖ".
Here are some of tips.
For what it's worth, Wikipedia, one of the biggest and most prominent sites on the Internet, preserves all sorts of weird diacritics in its URLs. Conventions for changing diacritics to plain letters will vary between languages: Acute and grave accents in French can be safely removed. (The French themselves tend to remove them when writing in block capitals.) Fadas in Irish should be retained if possible, or cleanly removed. Diaereses may be simply stripped, or replaced with a double vowel, or replaced with vowel+e, depending on the language. And for languages written in the Greek or Cyrillic alphabets, there's no obvious system unless you want to go for full transcription.
And there's no need: ru.wikipedia.org/wiki/Заглавная_страница
I'd suggest using the IDNA ToASCII that is the standard for internationalised top-level domains. ICANN have recently approved native-script top level domains using this algorithm, so I'd be surprised if any of the major search engines decided not to support it in future.
http://en.wikipedia.org/wiki/Internationalized_domain_name
Related
I have file that has hex content: db90 3031 46, which should be displayed in vim as "ې" followed by "01F", but what I noticed is that it is never displayed correctly. Then I noticed It is the same in other places like in terminal and browser I always get ې01F? Why is that? Just paste that in google and try yourself you will never be able to put "ې" and 0 as next character.
That's an Arabic character with right-to-left indicator, so you probably need to switch back to left-to-right mode, such as with U+200e.
The Unicode bidirectional stuff is rather complex - the behaviour you are seeing is probably caused by the fact that the Latin digits are marked EN = European number (a weak type), while letters such as F are marked L = left to right (a strong type).
Weak types are treated differently in the Unicode specification, such as with this quote which covers your particular case (my emphasis):
Problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or there are nested segments of different-direction text, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the right display.
So your code point followed by a digit renders as "ې7" (I typed that 7 in after the Arabic character despite the fact it's showing up before it), while following it with a letter gives "ېX".
For what it's worth, the text "ې7" was generated here by inserting between the two characters, the HTML equivalent of the U+200e Unicode code point.
If you head on over to this UTF-8 codec site and enter %u06D0%u200e7 into the decoding section, you'll see that it comes out in your desired order (removing the %200e shows it in the order you're describing in your question).
When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.
I was (re)reading Joel's great article on Unicode and came across this paragraph, which I didn't quite understand:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Why is there a question mark, and what does he mean by "or, if you're really good, a box"? And what character is he trying to display?
There is a question mark because the encoding process recognizes that the encoding can't support the character, and substitutes a question mark instead. By "if you're really good," he means, "if you have a newer browser and proper font support," you'll get a fancier substitution character, a box.
In Joel's case, he isn't trying to display a real character, he literally included the Unicode replacement character, U+FFFD REPLACEMENT CHARACTER.
It’s a rather confusing paragraph, and I don’t really know what the author is trying to say. Anyway, different browsers (and other programs) have different ways of handling problems with characters. A question mark “?” may appear in place of a character for which there is no glyph in the font(s) being used, so that it effectively says “I cannot display the character.” Browsers may alternatively use a small rectangle, or some other indicator, for the same purpose.
But the “�” symbol is REPLACEMENT CHARACTER that is normally used to indicate data error, e.g. when character data has been converted from some encoding to Unicode and it has contained some character that cannot be represented in Unicode. Browsers often use “�” in display for a related purpose: to indicate that character data is malformed, containing bytes that do not constitute a character, in the character encoding being applied. This often happens when data in some encoding is being handled as if it were in some other encoding.
So “�” does not really mean “unknown character”, still less “undisplayable character”. Rather, it means “not a character”.
A question mark appears when a byte sequence in the raw data does not match the data's character set so it cannot be decoded properly. That happens if the data is malformed, if the data's charset is explicitally stated incorrectly in the HTTP headers or the HTML itself, the charset is guessed incorrectly by the browser when other information is missing, or the user's browser settings override the data's charset with an incompatible charset.
A box appears when a decoded character does not exist in the font that is being used to display the data.
Just what it says - some browsers show "a weird character" or a question mark for characters outside of the current known character set. It's their "hey, I don't know what this is" character. Get an old version of Netscape, paste some text form Microsoft Word which is using smart quotes, and you'll get question marks.
http://blog.salientdigital.com/2009/06/06/special-characters-showing-up-as-a-question-mark-inside-of-a-black-diamond/ has a decent explanation.
Using vim 7.2.330 on a Ubuntu host from an XP host, I'm stuck at how to type/paste the following line in a text file:
include_once(‘/full/path/to/app’);
The document says it's important to use ASCII 145 and 146, but vim turns them into "<92><93>", and Nano turns them into �.
Note that I'm using a European keyboard layout, not the US layout.
Does someone know how to solve this?
Thank you.
Er, you should not be using the 2 types of special quotes for string quoting in PHP.
You should be typing
include_once('/full/path/to/app');
( That's ASCII character 39 )
This is not what it says at the end of
this document:
www.wpbbpthemes.org/integration/
"beware some pasting of this code make
the ‘ character change, make sure it’s
the button left of the enter key on
your [US] keyboard"
No, you are misinterpreting it. Lots of software in windows, and varying keyboards, erroneously do "smart quotes". Word and Internet Explorer are such examples. As a result, copy-pasting from these applications results in the wrong type of character in your source code, often conflicting with the content-encoding the document is served as, which renders on the displaying browser as a silly Ä or similar character.
Do not use characters 145 and 146 in your PHP source, it is not necessary, and it won't work.
Also, Important to note, the authors of that page have USED THE WRONG QUOTES IN THEIR EXAMPLES and as such, WILL NOT WORK AS STATED.
Their statement with regard to "beware some pasting will make the character change" is bogusly incorrect, they have the incorrect character in their source, and as such, copy-pasting it at any time will not work.
In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?
I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
The range 0x7F-0x9F (more control characters)
Ranges of characters that can safely be accepted would be even better to know.
There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.
See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:
U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
language override control codes that could also have scope outside of an element;
BOM.
Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.
The range 0x7F-0x9F (more control characters)
Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.
I suppose it depends on your purpose. In UTF-8, you could limit the user to the keyboard characters if that is your whim, which is 9,10,13,[32-126]. If you are using UTF-8, the 0x7f+ range signifies that you have a multi-byte Unicode character. In ASCII, 0x7f+ consists special display/format characters, and is localized to allow extensions depending on the language at the location.
Note that in UTF-8, the keyboard characters can differ depending on location, since users can input characters in their native language which will be outside the 0x00-0x7f range if their language doesn't use a Latin script without accents (Arabic, Chinese, Japanese, Greek, Crylic, etc.).
If you take a look here you can see what characters from UTF-8 will display.