Permalinks: Why are hypens(-) used over underscores(_) to replace spaces(and other unwanted characters)? - string

In some large websystems I have come across lately, friendly permalinks, e.g. part of a HTML path that is based on a (often user-specified) string rather than a numerical number, spaces(and other unwanted/disallowed characters that would otherwise need to be url-escaped) are replaced by hyphens (-), and not by underscores (_).
An example:
in the URL http://example.com/blog/this-is-my-first-post, this-is-my-first-post is a friendly permalink. Using underscores, this would be http://example.com/blog/this_is_my_first_post
Is this only a personal preference, or is there a technical reason to use hyphens over underscores?
Hypothetical possibilities I thought of:
Maybe it matters for Search Engine Optimalization?
Maybe it is actually important for how HTML paths are interpreted?
Maybe there is a historical reason?
What I do know:
Hyphens are treated as word-breaks in most (if not all?) computer systems/programs, e.g. use ctrl+left/ctrl+right to move in a sentence_that_uses_underscores vs a sentence-that-uses-hyphens.
In normal text that a user enters (e.g. names for objects or blogposts), usage of actual hyphens is higher than underscores.
Could someone shine some light on this?

Google has spoken:
Consider using punctuation in your URLs. The URL http://www.example.com/green-dress.html is much more useful to us than http://www.example.com/greendress.html. We recommend that you use hyphens (-) instead of underscores (_) in your URLs.
https://support.google.com/webmasters/answer/76329?hl=en

Related

Unicode character order problem when text is displayed

I am working on an application that converts text into some other characters of the extended ASCII character set that gets displayed in custom font.
The program operation essentially parses the input string using regex, locates the standard characters and outputs them as converted before returning a string with the modified text which displays correctly when viewed with the correct font.
Every now and again, the function returns a string where the characters are displayed in the wrong order, almost like they are corrupted or some data is missing from the Unicode double width spacing. I have examined the binary output, the hex data, and inspected the data in the function before i return it and everything looks ok, but every once in a while something goes wrong and cant quite put my finger on it.
To see an example of what i mean when i say the order is weird, just take a look at the following piece of converted text output from the program and try to highlight it with your mouse. You will see that it doesn't highlight in the order you expect despite how it appears.
Has anyone seen anything like this before and have they any ideas as to what is going on?
ך┼♫יἯ╡П♪דἰ
You are mixing various Unicode characters with different LTR/RTL characteristics.
LTR means "left-to-right" and is the direction that English (and many other western language) text is written.
RTL is "right-to-left" and is used mostly by Arabic and Hebrew (as well as several other scripts).
By default when rendering Unicode text the engine will try to use the directionality of the characters to figure out what direction a given part of the code should go. And normally that works just fine because Hebrew words will have only Hebrew letters and English words will only use letters from the Latin alphabet, so for each chunk there's a easily guessable direction that makes sense.
But you are mixing letters from different scripts and with different directionality.
For example ך is U+05DA HEBREW LETTER FINAL KAF, but you also use two other Hebrew characters. You can use something like this page to list the Unicode characters you used.
You can either
not use "wrong" directionality letters or
make the direction explict using a Left-to-right mark character.
Edit: Last but not least: I just realized that you said "custom font": if you expect displaying with a specific custom font, then you should really be using one of the private use areas in Unicode: they are explicitly reserved for private use like this (i.e. where the characters don't match the publicly defined glyphs for the codepoints). That would also avoid surprises like the ones you get, where some of the used characters have different rendering properties.

Ellipsis ignored in search engines

I noticed that ellipsis '...' is ignored in all search engines I tested, including in stackoverflow search engine, when trying to search literature to address this POST, is there any way to avoid this?
Sadly not directly possible, most search engines ignore all punctuation. Google has made an effort to allow for certain types to be searched like "," or ".", but all of these are special cases and don't follow a general rule.
What you can do however is to search for "sizeof ellipsis" which brings the results of thoughtful writers.
Sidenote: "..." in particular is kind of special in common texts, there are lots of different ways to use it in printing, with differences in spacing, bracing, vertical position etc.. Also there is the possibility of using the single character HTML … which results in … as does Unicode U+2026.

What kind of sign is "‎" and what is it used for

What kind of sign is "‎" and what is it used for (note there is a invisible sign there)?
I have searched through all my documents and found a lot of them. They messed upp my htaccess file. I think I got them when I copied webadresses from google to redirect. So maybe a warning searching through your documents for this one also :)
It is U+200E LEFT-TO-RIGHT MARK. (A quick way to check out such things is to copy a string containing the character and paste it in the writing area in my Full Unicode input utility, then click on the “Show U+” button there, and use Fileformat.Info character search to check out the name and other properties of the character, on the basis of its U+... number.)
The LEFT-TO-RIGHT MARK sets the writing direction of directionally neutral characters. It does not affect e.g. English or Arabic words, but it may mess up text that contains parentheses for example – though for text in English, there should be no confusion in this sense.
But, of course, when text is processed programmatically, as when a web server processes a .htaccess file, they are character data and make a big difference.

Arabic and other Right-to-left slugs ok?

I'm creating a multi-lingual site, where an item has a slug for each of the sites languages.
For Arabic slugs (and I assume any other right-to-left languages) it acts strange when you try to highlight it. The cursor moves opposite while in the RTL text..etc. This isn't a terribly big deal, but it made me think that maybe it's not "normal" or "ok".
I've seen some sites with Arabic slugs, but I've also seen completely Arabic sites that still use English slugs.
Is there a benefit one way or the other? A suggested method? Is doing it the way I am ok? Any suggestions welcome.
I suppose that by "slug" you mean a direct permanent URL to a page. If you don't, you can ignore the rest of this answer :)
Such URLs will work, but avoid them if you can. The fact that it's right-to-left is actually not the worst problem. The worst problem with any non-ASCII URL is that in a lot of contexts it will show like this: https://ar.wikipedia.org/wiki/%D9%86%D9%87%D8%B1_%D9%81%D8%A7%D8%B1%D8%AF%D8%A7%D8%B1 (it's just a link to a random article in the Arabic Wikipedia). You will see a long trail of percent signs and numbers, even if the title is short, because each non-ASCII characters will turn to about six ASCII characters. This gets very ugly when you have to paste it in an email, for example.
I write a blog in Hebrew and I manually change the slug of every post to some ASCII name.

space in url; did browser got smarter or server?

It looks like today you no longer to have to encode spaces by %20 in your html links or image links. For example, suppose you have this image at 〔http://example.com/i/my house.jpg〕. Notice the space there. In your html code, you can just do this:
<img src="http://example.com/i/my house.jpg" alt="my house">
It work in all current version of browsers. Though, what i'm not sure is that whether the browser encodes it before requesting the url, or a particular server will do the right with with paths with space? (apache)
Addendum:
sorry about the confusion. My real question is about HTTP protocol.
I'll leave this one as is and mark Answered.
I posted a new question here.
does HTTP protocol require space be encoded in file path?
The browser makes the correction.
You still have to encode the spaces though. Just because it works in the browsers you use doesn't make it valid, and doesn't mean it will work everywhere.
You can see a list of reserved characters and other characters that should be encoded here: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
RFC1738 specifically states:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
RFC2396 takes place over RFC1738 and expounds on space usage in URLs:
The space character is excluded because significant spaces may
disappear and insignificant spaces may be introduced when URI are
transcribed or typeset or subjected to the treatment of word-
processing programs. Whitespace is also used to delimit URI in many
contexts.

Resources