While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).
I want to replace them with specific characters, but am unable to do so. Removing them won't work either.
What it looks like in the exported flat file: https://i.imgur.com/7MQpoMH.png
Another example: https://i.imgur.com/3ZUSGIr.png
This is what I've tried:
(and mind, <0x01> represents a none-editable entity. It's not recognized here.)
import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
s=p.read()
# included in case it bears any significance
import re
import binascii
s = "Some string with hex: <0x01>"
s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte
s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')
or something along these lines in hopes to get a grasp of it while iterating through the whole string:
for x in s:
try:
base64.encodebytes(x)
base64.decodebytes(x)
s.strip(binascii.unhexlify(x))
s.decode('utf-8')
s.encode('latin1').decode('utf-8')
except:
pass
Nothing seems to get the job done.
I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing?
NB: I have to preserve umlauts (äöüÄÖÜ)
-- edit:
Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?
with io.open('out.txt', 'w', encoding="utf-8") as temp:
temp.write(s)
Judging from the images, these are actually control characters.
Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation.
You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.
In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits.
The fragment from the first image is probably equal to the following string:
"sich z\x01 B. irgendeine"
Your attempts to remove them were close.
s = s.replace('\x01', '.') should work.
Related
i have a vector "char" of type |S1 like in the example below:
masked_array(data=[b'E', b'U', b'3', b'7', b'6', b'8', b' ', b' ', b' ', b' '],
mask=False,
fill_value=b'N/A',
dtype='|S1')
I want to get the string in it, in this example 'EU3768'
This example is taken from a netcdf file. Library used is netCDF4.
Further question: Why is there a b in front of all single letters?
Thanks for your help :)
First of all let's answer the most basic question: What is the meaning of the b in front of each letter. The b simply indicates that the character string is in bytes. The internal format of the data is being stored encoded as utf-8. So to convert it back to a string it must be decoded. So with that as a preamble, the following code will do the trick.
I am assuming that you can extract data from the masked_array. Then perform the following operations:
# Convert the list of bytes to a list of strings
ds = list(map(lambda x: x.decode('utf-8'), data))
# Covert List of strings to a String and strip any trailing spaces
sd = ''.join(ds).strip()
This could of course be performed in a single line of code as follows:
sd = ''.join(list(map(lambda x: x.decode('utf-8'), data))).strip()
as an answer to your follow-up question, you might be able to let Numpy do some of the work by just working with the underlying bytes. for example, I can create a large number of similar shaped objects via:
import numpy as np
from string import ascii_letters, digits
letters = np.array(list(ascii_letters + digits), dtype='S1')
v = np.random.choice(letters, (100_000, 10))
The first three elements of this look like:
[[b'W' b'B' b'W' b'4' b'O' b'B' b'A' b'4' b'Q' b'n']
[b'I' b'I' b'T' b'u' b'K' b'K' b'U' b'a' b'r' b'r']
[b'V' b'f' b'n' b'U' b'G' b'0' b'j' b'R' b'm' b'C']]
I can then convert these back to strings via some byte level shanigans:
[bytes.decode(s) for s in np.frombuffer(v, dtype='S10')]
The first three look like:
['WBW4OBA4Qn', 'IITuKKUarr', 'VfnUG0jRmC']
which hopefully makes sense. This takes ~20ms which is quicker than a version which goes through Python:
[b''.join(r).decode() for r in v]
taking ~200ms. This is still much faster than the version of code you posted, so maybe you could be accessing netcdf more efficiently.
I want to split a string , in two base on the last space in the string.
Example:
full_text = "0.808 um"
value, unit = full_text.rsplit(" ")
This should have worked bu I get the error:
ValueError: not enough values to unpack (expected 2, got 1)
So I printed, on what happens on split:
['0.808\xa0um']
In my example the string is static, but in reality I receive them from a database, and I don't know when a space is space or not.
I want to maintain the encoding for characters (not space) received, but also want to split.
You would simply need to expect more and different kinds of whitespace to split by. In your case you're dealing with a no-break space. The regular expression \s would match it and a few other kinds of whitespace:
>>> import re
>>> re.split(r'\s', '0.808\xa0um')
['0.808', 'um']
I'm converting strings to floats using float(x). However for some reason, one of the strings is "71.2\x0060". I've tried following this answer, but it does not remove the bytes character
>>> s = "71.2\x0060"
>>> "".join([x for x in s if ord(x) < 127])
'71.2\x0060'
Other methods I've tried are:
>>> s.split("\\x")
['71.2\x0060']
>>> s.split("\x")
ValueError: invalid \x escape
I'm not sure why this string is not formatted correctly, but I'd like to get as much precision from this string and move on.
Going off of wim's comment, the answer might be this:
>>> s.split("\x00")
['71.2', '60']
So I should do:
>>> float(s.split("\x00")[0])
71.2
Unfortunately the POSIX group \p{XDigit} does not exist in the re module. To remove the hex control characters with regular expressions anyway, you can try the following.
impore re
re.sub(r'[\x00-\x1F]', r'', '71.2\x0060') # or:
re.sub(r'\\x[0-9a-fA-F]{2}', r'', r'71.2\x0060')
Output:
'71.260'
'71.260'
r means raw. Take a look at the control characters up to hex 1F in the ASCII table: https://www.torsten-horn.de/techdocs/ascii.htm
I know there is tons of question about this, but somehow I could not find a solution to my problem (in python3) :
toto="//\udcc3\udca0"
fp = open('cool', 'w')
fp.write(toto)
I get:
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 2: surrogates not allowed
How can I make it work?
Some precision: the string "//\udcc3\udca0" is given to me and I have no control over it. '\udcc3\udca0' is supposed to represent the character 'à'.
'\udcc3\udca0' is supposed to represent the character 'à'
The proper way to write 'à' using Python Unicode escapes is '\u00E0'. Its UTF-8 encoding is b'\xc3\xa0'.
It seems that whatever process produced your string was trying to use the UTF-8 representation, but instead of properly converting it to a Unicode string, it put the individual bytes in the U+DCxx range used by Python 3's surrogateescape convention.
>>> 'à'.encode('UTF-8').decode('ASCII', 'surrogateescape')
'\udcc3\udca0'
To fix the string, invert the operations that mangled it.
toto="//\udcc3\udca0"
toto = toto.encode('ASCII', 'surrogateescape').decode('UTF-8')
# At this point, toto == '//à', as intended.
fp = open('cool', 'w')
fp.write(toto)
Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©çç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.