Pulling valid data from bytestring in Python 3 - python-3.x

Given the following bytestring, how can I remove any characters matching \xFF, and create a list object from what's left (by splitting on removed areas)?
b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
Desired result:
["~", "pts/5", "/5", "user"]
The above string is just an example - I'd like to remove any \x.. (non-decoded) bytes.
I'm using Python 3.2.3, and would prefer to use standard libraries only.

>>> a = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
>>> import re
>>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)
[b'~', b'pts/5', b'/5', b'user']
The results are still bytes objects. If you want the results to be strings:
>>> [i.decode("ascii") for i in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)]
['~', 'pts/5', '/5', 'user']
Explanation:
[^\x00-\x1f\x7f-\xff]+ matches one or more (+) characters that are not in the range ([^...]) between ASCII 0 and 31 (\x00-\x1F) or between ASCII 127 and 255 (\x7f-\xff).
Be aware that this approach only works if the "embedded texts" are ASCII. It will remove all extended alphabetic characters (like ä, é, € etc.) from strings encoded in an 8-bit codepage like latin-1, and it will effectively destroy strings encoded in UTF-8 and other Unicode encodings because those do contain byte values between 0 and 31/127 and 255 as parts of their character codes.
Of course, you can always manually fine-tune the exact ranges you want to remove according to the example given in this answer.

Related

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

Python3 and combining Diacritics

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
print(len(symbol))
>>>>2
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?
The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

How to get single backslash instead of double backslash with encode("unicode-escape")?

Get unicode point of character Ä.
Python3 version.
>>> str="Ä"
>>> str.encode("unicode-escape")
b'\\xc4'
How to get the single backslash format b'\xc4' instead of b'\\xc4' as my output ?
It's not entirely clear to me what you want, so I'll give you a few options.
Get the (Unicode) code point of a character as an integer:
>>> ord('Ä')
196
Display the integer in hex notation:
>>> hex(ord('Ä'))
'0xc4'
or with string formatting:
>>> '{:X}'.format(ord('Ä'))
'C4'
However, you talk about backslashes and show the bytestring b'\xc4'.
This is the Latin-1 encoding of 'Ä' (all characters with a Unicode codepoint below 256 can be encoded with Latin-1, and their byte value equals the Unicode codepoint).
>>> 'Ä'.encode('latin-1')
b'\xc4'
This is a bytestring of length 1.
It is displayed in a way in which you could type this character, ie. using an escape sequence with backslash-x and a two-digit hex number.
The "unicode-escape" codec produces these four ASCII characters (\, x, c 4), but not as str, but as a bytes object (because str.encode() returns bytes by definition).
To get a backslash in a str/bytes literal, you need to type two backslashes, so the representation form also uses two backslashes:
>>> 'Ä'.encode('unicode-escape')
b'\\xc4'
The "unicode-escape" codec is very Python-specific and I don't see a lot of applications; maybe if you want to write your own pickle protocol or parse fragments of Python source code.

How to convert string like "//u****" to text?

I want to convert a string like "//u****" to text (unicode) in Haskell.
I have a Java propertyes file, and it has the following content:
i18n.test.key=\u0050\u0069\u006e\u0067\u0020\uc190\uc2e4\ub960\u0020\ud50c\ub7ec\uadf8\uc778
I wanna convert it to text (Unicode) in Haskell.
I think I can do it like this:
Convert "\u****" to word8 array
Convert word8 array to ByteString
Use Text.Encoding.decodeUtf8 convert ByteString to text
But step 1 is little complicated for me.
How to do it in Haskell?
A simple solution may look like this:
decodeJava = T.decodeUtf16BE . BS.concat . gobble
gobble [] = []
gobble ('\\':'u':a:b:c:d:rest) = let sym = convert16 [a,b] [c,d]
in sym : gobble rest
gobble _ = error "decoding error"
convert16 hi lo = BS.pack [read $ "0x"++hi, read $ "0x"++lo]
Notes:
Your string is UTF16-encoded, therefore you need decodeUtf16BE.
Decoding will fail if there are other characters in the string. This code will work with your example only if you remove the trailing i.
Constructing the words by appending 0x and, in particular, using read is very slow, but will do the trick for small data.
If you replace \u with \x then this is a valid Haskell string literal.
my_string = "\x0050\x0069\x006e..."
You can then convert to Text if you want, or leave it as String, or whatever.
Watch out, Java normally uses UTF-16 to encode its strings, so interpreting the bytes as UTF-8 will probably not work.
If the codes in your file are UTF-16, you need to do the following:
find the numeric value (Unicode code point) for each quadrupel
check if this is a high surrogate character. If this is so, the following character will be a low surrogate character. The pair of surrogate characters can be mapped to a Unicode point.
make a String from your list of unicode numbers with map fromEnum
The following is a quote from the Java doc http://docs.oracle.com/javase/7/docs/api/ :
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Java has methods to combine a high surrogate character and a low surrogate character to get the Unicode point. You may want to check the source of the java.lang.Character class to find out how exactly they do this, but I guess it is some simple bit-operation.
Another possibility would be to check for a Haskell library that does UTF-16 decoding.

Convert string of encoded escape sequences to Unicode in Python 3

I'm attempting to decode a Python string containing a series of Shift-JIS escape sequences in Python. When I create a bytes literal containing the sequences, I can use decode('shift-jis') to get the expected result.
>>> seq = b'\201u\202\240\202\246\202\244\202\242\202\250\201v'
>>> seq.decode("shift-jis")
'「あえういお」'
The problem is that the sequences are passed in as a plain Python string. When I use str.encode, the sequence is interpreted as Unicode and extra bytes of \xc2 are inserted:
>>> seq = "\201u\202\240\202\246\202\244\202\242\202\250\201v"
>>> str.encode(seq)
b'\xc2\x81u\xc2\x82\xc2\xa0\xc2\x82\xc2\xa6\xc2\x82\xc2\xa4\xc2\x82\xc2\xa2\xc2\x82\xc2\xa8\xc2\x81v'
Is there a way to directly convert a Python string containing encoded escape sequences into a bytes literal, in the same way as placing a b in front of a string produces a bytes literal with the escaped characters?
Str.encode defaults to using utf-8 encoding. hence you get the utf-8 \xc2 prefixes. (Check Wikipedia for details if you want.) What you want instead is for codepoints 0 to 255 to be turned into bytes 0 to 255. In others words, the same data in an object of a different class. Latin-1 does this.
>>> seqb = seq.encode('latin-1')
>>> seqb.decode('shift-jis')
'「あえういお」'

Resources