Remove double escape characters from a string and make it a binary - string

I have a string looking like this:
"\\xd6\\x83\\x8dd!VT\\x92\\xaaA\\x05\\xe0\\x9b\\x8b\\xf1"
and I want to remove the double escape characters in order to make it a proper binary. Is that even possible?

The source string looks pretty much like a bytes string, so you could do:
>>> import ast
>>> s = "\\xd6\\x83\\x8dd!VT\\x92\\xaaA\\x05\\xe0\\x9b\\x8b\\xf1"
>>> print(ast.literal_eval("b'''%s'''" % s))
b'\xd6\x83\x8dd!VT\x92\xaaA\x05\xe0\x9b\x8b\xf1'

Related

How to convert backslash string to forward slash string in python3?

I am using python3 in my ubuntu machine.
I have a string variable which consist a path with backslash and i need to convert it to forward slash string. So i tried
import pathlib
s = '\dir\wnotherdir\joodir\more'
x = repr(s)
p = pathlib.PureWindowsPath(x)
print(p.as_posix())
this will print correctly as
/dir/wnotherdir/joodir/more
but for different other string path, it acts weirdly. Ex, for the string,
'\dir\aotherdir\oodir\more'
it correctly replacing backslashes but the value is wrong because of the character 'a' in original string
/dir/x07otherdir/oodir/more
What is the reason for this behavior?
This has nothing to do with paths per-se. The problem here is that the \a is being interpreted as an ASCII BELL. As a rule of thumb whenever you want to disable the special interpretation of escaped string literals you should use raw strings:
>>> import pathlib
>>> r = r'\dir\aotherdir\oodir\more'
>>> pathlib.PureWindowsPath(r)
PureWindowsPath('/dir/aotherdir/oodir/more')
>>>

Convert string with "\x" character to float

I'm converting strings to floats using float(x). However for some reason, one of the strings is "71.2\x0060". I've tried following this answer, but it does not remove the bytes character
>>> s = "71.2\x0060"
>>> "".join([x for x in s if ord(x) < 127])
'71.2\x0060'
Other methods I've tried are:
>>> s.split("\\x")
['71.2\x0060']
>>> s.split("\x")
ValueError: invalid \x escape
I'm not sure why this string is not formatted correctly, but I'd like to get as much precision from this string and move on.
Going off of wim's comment, the answer might be this:
>>> s.split("\x00")
['71.2', '60']
So I should do:
>>> float(s.split("\x00")[0])
71.2
Unfortunately the POSIX group \p{XDigit} does not exist in the re module. To remove the hex control characters with regular expressions anyway, you can try the following.
impore re
re.sub(r'[\x00-\x1F]', r'', '71.2\x0060') # or:
re.sub(r'\\x[0-9a-fA-F]{2}', r'', r'71.2\x0060')
Output:
'71.260'
'71.260'
r means raw. Take a look at the control characters up to hex 1F in the ASCII table: https://www.torsten-horn.de/techdocs/ascii.htm

How to replace hex value in a string

While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).
I want to replace them with specific characters, but am unable to do so. Removing them won't work either.
What it looks like in the exported flat file: https://i.imgur.com/7MQpoMH.png
Another example: https://i.imgur.com/3ZUSGIr.png
This is what I've tried:
(and mind, <0x01> represents a none-editable entity. It's not recognized here.)
import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
s=p.read()
# included in case it bears any significance
import re
import binascii
s = "Some string with hex: <0x01>"
s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte
s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')
or something along these lines in hopes to get a grasp of it while iterating through the whole string:
for x in s:
try:
base64.encodebytes(x)
base64.decodebytes(x)
s.strip(binascii.unhexlify(x))
s.decode('utf-8')
s.encode('latin1').decode('utf-8')
except:
pass
Nothing seems to get the job done.
I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing?
NB: I have to preserve umlauts (äöüÄÖÜ)
-- edit:
Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?
with io.open('out.txt', 'w', encoding="utf-8") as temp:
temp.write(s)
Judging from the images, these are actually control characters.
Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation.
You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.
In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits.
The fragment from the first image is probably equal to the following string:
"sich z\x01 B. irgendeine"
Your attempts to remove them were close.
s = s.replace('\x01', '.') should work.

How to remove digits from the end of a string in Python 3.x?

I want to remove digits from the end of a string, but I have no idea.
Can the split() method work? How can I make that work?
The initial string looks like asdfg123,and I only want asdfg instead.
Thanks for your help!
No, split would not work, because split only can work with a fixed string to split on.
You could use the str.rstrip() method:
import string
cleaned = yourstring.rstrip(string.digits)
This uses the string.digits constant as a convenient definition of what needs to be removed.
or you could use a regular expression to replace digits at the end with an empty string:
import re
cleaned = re.sub(r'\d+$', '', yourstring)
You can use str.rstrip with digit characters you want to remove trailing characters of the string:
>>> 'asdfg123'.rstrip('0123456789')
'asdfg'
Alternatively, you can use string.digits instead of '0123456789':
>>> import string
>>> string.digits
'0123456789'
>>> 'asdfg123'.rstrip(string.digits)
'asdfg'
Regex!
import re
intialString = "asdfg123"
newString = re.search("[^\d]*", intialString).group()
print(newString)
Expected Outcome is "asdfg".

Convering double backslash to single backslash in Python 3

I have a string like so:
>>> t
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
That I made using a function that converts unicode to the representative Python escape sequences. Then, when I want to convert it back, I can't get rid of the double backslash so that it is interpreted as unicode again. How can this be done?
>>> t = unicode_encode("
>>> t
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
>>> print(t)
\u0048\u0065\u006c\u006c\u006f\u0020\u20ac\u0020\u00b0
>>> t.replace('\\','X')
'Xu0048Xu0065Xu006cXu006cXu006fXu0020Xu20acXu0020Xu00b0'
>>> t.replace('\\', '\\')
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
Of course, I can't do this, either:
>>> t.replace('\\', '\')
File "<ipython-input-155-b46c447d6c3d>", line 1
t.replace('\\', '\')
^
SyntaxError: EOL while scanning string literal
Not sure if this is appropriate for your situation, but you could try using unicode_escape:
>>> t
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
>>> type(t)
<class 'str'>
>>> enc_t = t.encode('utf_8')
>>> enc_t
b'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
>>> type(enc_t)
<class 'bytes'>
>>> dec_t = enc_t.decode('unicode_escape')
>>> type(dec_t)
<class 'str'>
>>> dec_t
'Hello € °'
Or in abbreviated form:
>>> t.encode('utf_8').decode('unicode_escape')
'Hello € °'
You take your string and encode it using UTF-8, and then decode it using unicode_escape.
Since a backslash is an escape character and you are searching for two backslashes you need to replace four backslashes with two - i.e.:
t.replace("\\\\", "\\")
This will replace every r"\\" with r"\". The r indicates raw string. So, for example, if you type print(r"\\") into idle or any python script (or print r"\\" in Python 2) you will get \\\\. This means that every "\\" is really just a r"\".
user1632861 suggested that you use .replace("\\", ""), but this replaces ever r"\" with nothing. Try the above method instead. :D
In this case, however, it appears as though you are reading/receiving data, and you probably want to use the correct encoding and then decode to unicode (as the person above me suggested).
You only got one backslash in your code, but backslashes are represent as \\. As you can see, when you use print(), there's only one backslash. So if you want to get rid of one of the two backslashes, don't do anything, it's not there. If you wanna get rid of both, just remove one. Again use \\ to represent one backslash: t.replace("\\", "")
So your string never has two backslashes in the first place, it shouldn't be the problem.

Resources