Confusion regarding UTF8 substring length - python-3.x

Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?
Using Python 3.5, I opened the XHTML file as UTF8 text:
inputTopicFile = open(inputFileName, "rt", encoding="utf8")
As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:
I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:
firstLine = firstLine[3:]
Didn't work -- the characters <? were no longer present at the start of the resulting line.
So I did this experiment:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, firstLine[charPos]))
Which printed:
charPos 0 ==
charPos 1 == <
charPos 2 == ?
I then added .encode to that loop as follows:
for charPos in range(0, 3):
print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))
Which gave me:
charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'
Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?
At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)
So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?
EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file:
inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")
That strips out the BOM. Voila!

As you mentioned in your edit, you can open the file with the utf8-sig encoding, but to answer your question of why it was behaving this way:
Python 3 distinguishes between byte strings (the ones with the b prefix) and character strings (without the b prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode, you turn the character string into a byte string.
Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode.

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

How to find arabic character all occurrences with its dialect?

I am trying to find the occurence of arabic character with its harakat in string such as "رَّ" in "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ".
Arabic characters can take harakat for example "ر" is the original arabic character but can have harakat so it can look something like this "رَّ"> I am using Python 3 to find the character occurence with a specific harakat but could not do that. I have tried for loop and tried converting the string to unicode but could not do that.
str = "مرة رجل حكيم قال بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
i=0
for s in str:
if s == "رَّ":
i = i + 1
print(i)
Expected output is 2 but 0 is what I get.
len("رَّ") returns 3, which means the glyph is represented by three characters. Your loop checks a single character at a time and so never finds a match.
You need to be looking for substrings, which is exactly what .count() is for.
i = str.count('رَّ')

how can I find out the ASCII code of a character in gnuplot

Does anybody know how to find out the ascii code of a character in gnuplot?
There is no official function to do that, so it must probably be some kind of trick.
(OK, i found a way, answer directly below)
In lieu of an ord(ch) function, one can build a string with all characters, and find the position of the one in question using the strstrt()function.
# make a string that contains all ASCII chars from 1 to 255
ALLCHARS = ''; do for [i=1:255] {ALLCHARS = ALLCHARS.sprintf('%c',i)}
# return position of character in ALLCHARS if ch contains 1 char, -1 otherwise
ord(ch) = (strlen(ch) == 1) ? strstrt(ALLCHARS,ch): -1
# test with ASCII char 12
pr n=12, testch = sprintf('%c',n), ord(testch)
The NUL (ASCII-code zero) character is missing, because gnuplot anyway has no single character variable type that could hold it. gnuplot strings are NUL terminated.

Remove non ASCII characters in octave

I'm trying to remove non ASCII characters read from a data file using OCTAVE but I can't make it work. I tried getting the ASCII codes of these "weird" characters and they do have random ASCII codes. An example string of characters is this:
asdqwФЕДЕРАЛЬ234НОЕ234 АГЕНТСqewwqedasТВО ПasdsadО ОБРАasdasdЗОВАНИЮ
Госудаsadasdsagwfрственная акадеasdмия профессиональной п
Do you guys have any suggestions on how can i remove the non ASCII characters from this string? Or better yet, how will I be able to determine if a given string has non ASCII characters?
Thanks in advance!
To remove all non ASCII chars in the range of 0..127 decimal use
a = "asdqwФЕДЕРАЛЬ234НОЕ234 АГЕНТСqewwqedasТВО ПasdsadО ОБРАasdasdЗОВАНИЮ Госудаsadasdsagwfрственная акадеasdмия профессиональной п";
a(! isascii (a)) = []
which gives
a = asdqw234234 qewwqedas asdsad asdasd sadasdsagwf asd
and if you just want to check if there are non ASCII chars:
any (! isascii("foobar"))
ans = 0
any (! isascii("foobaröäüß"))
ans = 1

Python: Printing unicode characters beyond FFFF

On Python 3 printing unicode characters can be printed like this:
print('\uFFFF')
But how can I print higher unicode characters like 001FFFFF? print('\u001FFFFF') will just print 001F as unicode character and then 4 times F. Trying to use print('\u001F\uFFFF') will result in 2 unicode characters instead of the wanted one. Is it possible to print somehow the unicode character 001FFFFF in Python 3?
Use an upper-case U.
print('\U001FFFFF')
There is another way in Python 3, using the built-in function chr(i), which
Return the string representing a character whose Unicode code point is
the integer i.
and
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
so there are no limitation for the hex digit value.
print(chr(97))
print(chr(0xFFFF))
print(chr(0x10080))

Resources