How to limit text in UTF-8 to only script characters? - text

I want to restrict a UTF-8 string to only script characters in any language. By script characters I mean only those characters in the language's written script, i.e. no symbols or special characters. Same as scripts here: http://www.unicode.org/charts/index.html
Would I have to go off and identify these character ranges for each and every language in UTF-8? Or is something e.g. regex, library... that I can make use of?

Depending on the language you're implementing this in, you might be able to use Unicode character categories in regular expressions.
The following expression should match all letters and numbers, but exclude punctuation, whitespace, symbols, etc.
[\p{L}\p{N}]*
Here's a small demo on regex101.

Related

Reading a text file with unicode characters - Python3

I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:
(u'B9781437714227000962', u'Definition\u2014Human papillomavirus
(HPV)\u2013related proliferation of the vaginal mucosa that leads to
extensive, full-thickness loss of maturation of the vaginal
epithelium.\n')
How can remove these unicode tags using python3 in Linux operating system?
To remove unicode escape sequence (or better: to translate them), in python3:
a.encode('utf-8').decode('unicode_escape')
The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.
But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.

Usable Unicode Ranges for Custom Text Process

I am working on a processor that parts texts into blocks with marks:
LOREM IPSUM SED AMED
will be parsed like:
{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}
But I dont want to use "{word}" etc, because it causes processor down, because it is an string again... I need to mark like these:
\E002\0001 LOREM \E003\0001 \E004\0002
\E002\0003 IPSUM \E003\0004 \E004\0005
\E002\0006 SED \E003\0006 \E004\0007
\E002\0008 AMED \E003\0008
First \E002 means element type number, its last bit represent element's close. So element number increments with +2.
Second \0001 means element index for stacking.
I am just used \E002 irrelevantly for this example.
But \0001 also using in Unicode Range, and this leads me to where I start again...
So which unicode range can I use? \ff0000? or how can I solve this?
Thanks!
The Unicode Consortium thought of this. There is a range of Unicode code points that are meant to never represent a displayable character, but meta-codes instead:
Noncharacters are code points that are permanently reserved and will never have characters
assigned to them.
...
Tag characters were intended to support a general scheme for the internal tagging of text
streams in the absence of other mechanisms, such as markup languages. The use of tag
characters for language tagging is deprecated.
(http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf)
You should be able to use regular control characters as "private" tags, because these should never occur in proper strings. This would be the range from U+0000 to U+001F, excluding tab (U+0009), the common "returns" (U+000A and U+000D), and, for safety, U+0000 itself (some libraries do not like Null characters in the middle of strings).
Non-characters
Noncharacters are code points that are permanently reserved in the Unicode Standard for
internal use. They are not recommended for use in open interchange of Unicode text data.
You can use U+FEFF (which is currently officially defined as Not-A-Character), or U+FFFE and U+FFFF. There are several more "officially not-a-characters" defined, and you can be fairly sure they would not occur in regular text strings.
A few random sequences with predefined definitions, and so highly unlikely to occur in plain text strings are:
Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for
special character definitions.
Annotation Characters: U+FFF9–U+FFFB
An interlinear annotation consists of annotating text that is related to a sequence of annotated
characters. For all regular editing and text-processing algorithms, the annotated characters
are treated as part of the text stream. The annotating text is also part of the content,
but for all or some text processing, it does not form part of the main text stream.
Tag Characters: U+E0000–U+E007F
This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCIIbased
string tags using characters that can be strictly separated from ordinary text content
characters in Unicode.
(all quotations from the chapter as above)
Staying within conventions, you can also use U+2028 (line separator) and/or U+2029 paragraph separator.
Technically, your use of U+E000–U+F8FF (the "Private Use Area") is okay-ish, because these code points only can define an unambiguous character in combination with a certain font. However, it is possible these codes may pop up if you get your plain text from a source where the font was included.
As for how to encode this into your strings: it doesn't really matter if the numerical code immediately following your private tag marker is a valid Unicode character or not. If you see one of your own tag markers, then the value immediately following is always your own private sequence number.
As you see, there are lots of possibilities. I guess the most important criterium is whether you want to use other functions on these strings. If you create a string that is technically invalid Unicode (for instance, because it includes not-a-character values), some external functions may choose to fail to work on them, or silently remove the bad values. In such a case, you'd need to rigorously stick to a system in which you only use 'valid' code points.

How can I find the character code of a special character in my text editor?

When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.

find non LaTeX characters (eg. acute accents) with regex in vim

I was pasting bibtex references into bibex. some names contain characters that latex skips. for example, á. is there a way in vim or regex to search for all characters that are skipped by latex? one way I would think is to write in regex to search for anything that doesn't contain 0-9, a-z, A-Z and some characters like / \ $
I am not familiar with which characters LaTeX ignores, but if the file you are editing is encoded in UTF-8, you might try searching for characters outside the ASCII repertoire (0–127; or 32–127).
As a search command in Vim:
/[^\d0-\d127]
/[^\d32-\d127]
You can also use hex or octal instead of decimal; see :help /[]. This requires that l and \ not be present in the value of cpoptions (they are not present in the default state).
This should work for any encoding that is “the same as ASCII (where it is defined)” (i.e. UTF-8 and most “latin” encodings). If you are dealing with an encoding that clashes with ASCII, then you will need to refine the range specification.

Bare-minimum text sanitation

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?
I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
The range 0x7F-0x9F (more control characters)
Ranges of characters that can safely be accepted would be even better to know.
There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.
See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:
U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
language override control codes that could also have scope outside of an element;
BOM.
Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.
The range 0x7F-0x9F (more control characters)
Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.
I suppose it depends on your purpose. In UTF-8, you could limit the user to the keyboard characters if that is your whim, which is 9,10,13,[32-126]. If you are using UTF-8, the 0x7f+ range signifies that you have a multi-byte Unicode character. In ASCII, 0x7f+ consists special display/format characters, and is localized to allow extensions depending on the language at the location.
Note that in UTF-8, the keyboard characters can differ depending on location, since users can input characters in their native language which will be outside the 0x00-0x7f range if their language doesn't use a Latin script without accents (Arabic, Chinese, Japanese, Greek, Crylic, etc.).
If you take a look here you can see what characters from UTF-8 will display.

Resources