The specification for RDF N-Triples states that string literals must be encoded.
https://www.w3.org/TR/n-triples/#grammar-production-STRING_LITERAL_QUOTE
Does this "encoding" have a name I can look up to use it in my programming language? If not, what does it mean in practice?
The grammar productions that you need are right in the document that you linked to:
[9] STRING_LITERAL_QUOTE ::= '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
[141s] BLANK_NODE_LABEL ::= '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)?
[10] UCHAR ::= '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
[153s] ECHAR ::= '\' [tbnrf"'\]
This means that a string literal begins and ends with a double quote ("). Inside of the double quotes, you can have:
any character except: #x22, #x5C, #xA, #xD. Offhand, I don't know what each of those is, but I'd assume that they're the space characters covered in the escapes;
a unicode character represented with a \u followed by four hex digits, or a \U followed by eight hex digits; or
an escape character, which is a \ followed by any of t, b, n, r, f, ", ', and \, which represent various characters.
You could use Literal#n3()
e.g.
# pip install rdflib
>>> from rdflib import Literal
>>> lit = Literal('This "Literal" needs escaping!')
>>> s = lit.n3()
>>> print(s)
"This \"Literal\" needs escaping!"
In addition to Josh's answer. It is almost always a good idea to normalize unicode data to NFC,e.g. in Java you can use the following routine
java.text.Normalizer.normalize("rdf literal", Normalizer.Form.NFKC);
For more information see: http://www.macchiato.com/unicode/nfc-faq
What is NFC?
For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:
U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ( Å ) ANGSTROM SIGN
U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).
Related
I know that gforth stores characters as their codepoints in the stack, but the material I'm learning from doesn't show any word that helps to convert each character to codepoint.
I also want to sum the codepoints of the string. What should I use to do that?
In Forth we distinguish primitive characters (usually an octet that covers ASCII) and extended characters (usually Unicode).
Any character is always represented in the stack as its code point, but how extended characters are represented in memory is implementation depended.
See also Extended-Character word set:
Extended characters are stored in memory encoded as one or more primitive characters (pchars).
So to convert a character into a code point it's enough to read this character from the memory.
To read a primitive character, we use c# ( c-addr -- char )
: sum-codes ( c-addr u -- sum ) 0 -rot over + swap ?do i c# + 1 chars +loop ;
\ test
"test passed" sum-codes .
NB: native string literals are supported in the recent versions of Gforth. Before that you need to use the word s" as s" test passed".
To read an extended character, we can use xc#+ ( xc-addr1 -- xc-addr2 xchar )
: sum-xcodes ( c-addr u -- sum )
over + >r 0 swap
begin ( sum xc-addr ) dup r# u< while
xc#+ ( sum xc-addr2 xchar ) swap >r + r>
repeat drop rdrop
;
\ test
"test ⇦ ⇨ ⇧ ⇩" 2dup dump cr sum-xcodes . cr
dump shows that in Gforth the extended characters are stored in the memory in UTF-8 encoding.
I'm a Julia newbie. When I was testing out the language, I got this error.
First of all, I'm defining String b to "he§y".
Julia seems behaving strangely when I have "special" characters in a String...
When I'm trying to get the third character of b (it's supposed to be '§'), everything is OK
However when I'm trying to get the fourth character of b (it's supposed to be 'y'), a "StringIndexError" is thrown.
I don't believe the compiler could throw you the error. Do you mean a runtime error?
I know nothing about Julian language but the symptoms seems to be related to indexing of string is not based on code point, but to some encoding.
The document from Julia lang seems supporting my hypothesis:
https://docs.julialang.org/en/stable/manual/strings/
The built-in concrete type used for strings (and string literals) in Julia is String. This supports the full range of Unicode characters via the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)
...
Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.
Edit: Quoted from Julia document, which is an example demonstrating exact "problem" you are facing.
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
Whether these Unicode characters are displayed as escapes or shown as
special characters depends on your terminal's locale settings and its
support for Unicode. String literals are encoded using the UTF-8
encoding. UTF-8 is a variable-width encoding, meaning that not all
characters are encoded in the same number of bytes. In UTF-8, ASCII
characters – i.e. those with code points less than 0x80 (128) – are
encoded as they are in ASCII, using a single byte, while code points
0x80 and above are encoded using multiple bytes – up to four per
character. This means that not every byte index into a UTF-8 string is
necessarily a valid index for a character. If you index into a string
at such an invalid byte index, an error is thrown:
julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> s[2]
ERROR: StringIndexError("∀ x ∃ y", 2)
[...]
julia> s[3]
ERROR: StringIndexError("∀ x ∃ y", 3)
Stacktrace:
[...]
julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
Get unicode point of character Ä.
Python3 version.
>>> str="Ä"
>>> str.encode("unicode-escape")
b'\\xc4'
How to get the single backslash format b'\xc4' instead of b'\\xc4' as my output ?
It's not entirely clear to me what you want, so I'll give you a few options.
Get the (Unicode) code point of a character as an integer:
>>> ord('Ä')
196
Display the integer in hex notation:
>>> hex(ord('Ä'))
'0xc4'
or with string formatting:
>>> '{:X}'.format(ord('Ä'))
'C4'
However, you talk about backslashes and show the bytestring b'\xc4'.
This is the Latin-1 encoding of 'Ä' (all characters with a Unicode codepoint below 256 can be encoded with Latin-1, and their byte value equals the Unicode codepoint).
>>> 'Ä'.encode('latin-1')
b'\xc4'
This is a bytestring of length 1.
It is displayed in a way in which you could type this character, ie. using an escape sequence with backslash-x and a two-digit hex number.
The "unicode-escape" codec produces these four ASCII characters (\, x, c 4), but not as str, but as a bytes object (because str.encode() returns bytes by definition).
To get a backslash in a str/bytes literal, you need to type two backslashes, so the representation form also uses two backslashes:
>>> 'Ä'.encode('unicode-escape')
b'\\xc4'
The "unicode-escape" codec is very Python-specific and I don't see a lot of applications; maybe if you want to write your own pickle protocol or parse fragments of Python source code.
I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.
Is it possible to do setlocale(LC_CTYPE, "ru_RU.utf8") and for each symbol of string "рус eng" do isaplha() check and to get as result following:
р alpha
у alpha
с alpha
not alpha
e not alpha
n not alpha
g not alpha
now when I am setting locale ru_RU.utf8 all symbols except space symbol are alpha
The isalpha function asks the question:
The isalpha() function shall test whether c is a character of class alpha in the program's current locale.
and goes on to note:
The c argument is an int, the value of which the application shall ensure is representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined.
Which means that it only works for ascii characters.
The test is pretty much is the character in the ranges [A-Z] or [a-z], nothing more.
Noe if you want to test characters outside of this range, then you need to use one of the wide character variants such as iswalpha.
What it looks like you're asking is if you can perform a test that will reject characters that are not explicit cyrillic letters? That's not going to work with the iswalpha() test because it assumes all alpha characters from pretty much all character sets are alpha characters - if you read the locale definition of ru_RU (glibc source localedata/locales/ru_RU), which uses the i18n file as it's data source for character types determines what is considered an alpha.
If the input data is truly only from the russian alphabet, then you can check if the character is non-ascii and if that is the case then accept it as a valid character; unfortunately there is a good chance that some characters that are typed e.g. е (i.e. CYRILLIC SMALL LETTER IE Unicode: U+0435, UTF-8: D0 B5) will be entered using the latin character e (i.e. LATIN SMALL LETTER E Unicode: U+0065, UTF-8: 65) and so would be missed by this test.
if you want to test for those cyrillic characters explicitly, then you need to test for the character ranges:
% CYRILLIC/
<U0400>..<U042F>;<U0460>..(2)..<U047E>;/
<U0480>;<U048A>..(2)..<U04BE>;<U04C0>;<U04C1>..(2)..<U04CD>;/
<U04D0>..(2)..<U04FE>;/
% CYRILLIC SUPPLEMENT/
<U0500>..(2)..<U0522>;/
% CYRILLIC SUPPLEMENT 2/
<UA640>..(2)..<UA65E>;<UA662>..(2)..<UA66C>;<UA680>..(2)..<UA696>;/
% CYRILLIC/
<U0430>..<U045F>;<U0461>..(2)..<U047F>;/
<U0481>;<U048B>..(2)..<U04BF>;<U04C2>..(2)..<U04CE>;/
<U04CF>;/
<U04D1>..(2)..<U0523>;/
% CYRILLIC SUPPLEMENT 2/
<UA641>..(2)..<UA65F>;<UA663>..(2)..<UA66D>;<UA681>..(2)..<UA697>;/