Raw string encoding from input() - python-3.x

I'm trying to scan a variable directory, said variable is defined by an input(), yet the program throws out this issue:
(unicode error) 'unicodeescape' codec can't decode bytes in position 320-321: truncated \UXXXXXXXX escape
Current code: Not Working
import os
import time
print("Enter directory name.\nDirectory name example:\nC:\Users\example\Documents")
dirname = input()
with os.scandir(dirname) as dir_entries:
for entry in dir_entries:
info = entry.stat()
file_name = os.path.basename(entry)
my_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(info.st_ctime))
rawmb = (info.st_size/(1024*1024))
truncated = round(rawmb, 3)
print(file_name)
print(my_time)
print(truncated,"MB")
print('===========================')
I considered using \\ or / instead of \ during the input, but that makes copying the directory from the file explorer impossible.
I have no idea how to include an r in front of the input() string.
.decode,.encode didn't seem to work for me either, but I most likely just used them wrong.
Edit #1
Tried the solution from J_H
Do this after input(): for ch in dirname: print(ch, ord(ch))
Result:
Same error.

Related

How to convert Hex to original file format?

I have a .tgz file that was formatted as shell code, it looks like this (Hex):
"\x1F\x8B\x08\x00\x44\x7A\x91\x4F\x00\x03\xED\x59\xED\x72.."
It was generated this way (python3):
import os
def main():
dump_src = "MyPlugin.tgz"
fc = ""
try:
with open(dump_src, 'rb') as fd:
fcr = fd.read()
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
except:
fcr = dump_src
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
print(fc)
# failed attempt:
fcback = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
print (fcback)
if __name__ == "__main__":
main()
How can I convert this back to the original tgz archive?
Edit: failed attempt in the last section outputs this:
b'\x8b\x00\x10]\x03\x93o0\x85%\xe2!\xa4H\xf1Fi\xa7\x15\xf61&\x13N\xd9[\xfag\x11V\x97\xd3\xfb%\xf7\xe3\\\xae\xc2\xff\xa4>\xaf\x11\xcc\x93\xf1\x0c\x93\xa4\x1b\xefxj\xc3?\xf9\xc1\xe8\xd1\xd9\x01\x97qB"\x1a\x08\x9cO\x7f\xe9\x19\xe3\x9c\x05\xf2\x04a\xaa\x00A,\x15"RN-\xb6\x18K\x85\xa1\x11\x83\xac/\xffR\x8a\xa19\xde\x10\x0b\x08\x85\x93\xfc]\x8a^\xd2-T\x92\x9a\xcc-W\xc7|\xba\x9c\xb3\xa6V0V H1\x98\xde\x03#\x14\'\n 1Y\xf7R\x14\xe2#\xbe*:\xe0\xc8\xbb\xc9\x0bo\x8bm\xed.\xfd\xae\xef\x9fT&\xa1\xf4\xcf\xa7F\xf4\xef\xbb"8"\xb5\xab,\x9c\xbb\xfc3\x8b\xf5\x88\xf4A\x0ek%5eO\xf4:f\x0b\xd6\x1bi\xb6\xf3\xbf\xf7\xf9\xad\xb5[\xdba7\xb8\xf9\xcd\xba\xdd,;c\x0b\xaaT"\xd4\x96\x17\xda\x07\x87& \xceH\xd6\xbf\xd2\xeb\xb4\xaf\xbd\xc2\xee\xfc\'3zU\x17>\xde\x06u\xe3G\x7f\x1e\xf3\xdf\xb6\x04\x10A\x04\x10A\x04\x10A\x04\x10A\xff\x9f\xab\xe8(\x00'
And when I output it to a file (e.g. via python3 main.py > MyFile.tgz) the file is corrupted.
Since you know the format of the data (each byte is encoded as a string of 4 characters in the format "\xAB") it's easy to revert the conversion and get the original bytes again. It'll only take one line of Python code:
data = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
This uses:
range(start, stop, step) with step 4 to iterate in groups of 4 characters through your string
slicing to get each group of 2 hexadecimal digits
int(x, base) to convert the hexadecimal string to an integer
a generator expression to immediately pass the converted elements to:
bytes() to create a bytes object with the data
The variable data is now of type bytes and you could directly write it to a file (to decompress with an external zip program), or pass it to zlib.decompress() (to further process it in Python).
UPDATE (follow-up on the comments and updated question):
Firstly, I have tested the above code and it does result in the same bytes as the input. Are you really sure that the example output in your question is the actual result of the code in your question? Please try to be careful when copying code and/or output. A few remarks:
Your code is not properly formatted, so I cannot run it without making modifications. And when I have made modifications to the code, I might run different code than you do, yielding different results. So next time please copy-paste your exact (working, tested) code without modifications.
The format string in your code uses lowercase hexadecimal format, and your first example output uses uppercase. So that output cannot be from this code.
I don't have access to your file "MyPlugin.tgz", but when I test your code with another .tgz file (after fixing the IndentationErrors), my output is correct. It starts with \x1f\x8b as expected (this is the magic number in the gzip header). I can't explain why your output is different...
Secondly, it seems like you don't fully understand how bytes and string representations work. When you write print(fcback), a string representation of the Python object fcback (in this case a bytes object) is printed. The string representation of a bytes object is not the same as the binary data! When printing a bytes object, each byte that corresponds to a printable ASCII character is replaced by that character, other bytes are escaped (similar to the formatted string that your code generates). Also, it starts with b' and ends with '.
You cannot print binary data to your terminal and then pipe the output to a file. This will result in a different file. The correct way to write the data to a file is using file.write(data) in your Python code.
Here's a fully working example:
def binary_to_text(data):
"""Convert a bytes object to a formatted text string."""
text = ""
for byte in data:
text += "\\x{:02x}".format(byte)
return text
def text_to_binary(text):
"""Convert a formatted text string to a bytes object."""
return bytes(int(text[i+2:i+4], 16) for i in range(0, len(text), 4))
def main():
# Read the binary data from input file:
with open('MyPlugin.tgz', 'rb') as input_file:
input_data = input_file.read()
# Convert binary to text (based on your original code):
text = binary_to_text(input_data)
print(text[0:100])
# Convert the text back to binary:
output_data = text_to_binary(text)
print(output_data[0:100])
# Write the binary data back to a file:
with open('MyPlugin-restored.tgz', 'wb') as output_file:
output_file.write(output_data)
if __name__ == '__main__':
main()
Note that I only print the first 100 elements to keep the output short. Also notice that the second print-statement prints a much longer text. This is because the first print gets 100 characters (which are printed "as is"), while the second print gets 100 bytes (of which most bytes are escaped, causing the output to be longer).

Trying to output hex data as readable text in Python 3.6

I am trying to read hex values from specific offsets in a file, and then show that as normal text. Upon reading the data from the file and saving it to a variable named uName, and then printing it, this is what I get:
Card name is: b'\x95\xdc\x00'
Here's the code:
cardPath = str(input("Enter card path: "))
print("Card name is: ", end="")
with open(cardPath, "rb+") as f:
f.seek(0x00000042)
uName = f.read(3)
print(uName)
How can remove the 'b' I am getting at the beginning? And how can I remove the '\x'es so that b'\x95\xdc\x00' becomes 95dc00? If I can do that, then I guess I can convert it to text using binascii.
I am sorry if my mistake is really really stupid because I don't have much experience with Python.
Those string started with b in python is a byte string.
Usually, you can use decode() or str(byte_string,'UTF-8) to decode the byte string(i.e. the string start with b') to string.
EXAMPLE
str(b'\x70\x79\x74\x68\x6F\x6E','UTF-8')
'python'
b'\x70\x79\x74\x68\x6F\x6E'.decode()
'python'
However, for your case, it raised an UnicodeDecodeError during decoding.
str(b'\x95\xdc\x00','UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
I guess you need to find out the encoding for your file and then specify it when you open the file, like below:
open("u.item", encoding="THE_ENCODING_YOU_FOUND")

How to read the file without encoding and extract desired urls with python3?

Environment :python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8.
I want to extract all the jpg with regular expression
For s.html encoding with gbk.
tree = open("/tmp/s.html","r").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte
tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)
['http://somesite/2017/06/0_56.jpg']
It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.
I had a similar problem, and how I solved this is as follows.
In get_file_encoding(filename), I first check if there is a BOM (Byte Order Mark) in the file, if so, get the encoding from the BOM. From the function: get_file_bom_encoding(filename)
If that did not return a value, I would get a list from the function: get_all_file_encodings(filename)
This list will have all encodings which the file can be opened in. For the purpose I was doing this, I just needed one, and I did not care about the rest, so I simply choose the first value of the list file_encoding = str(encoding_list[0][0])
This obviously is not accurate a 100% but at least it will give you either the correct encoding from the BOM, or it will give you a list of encodings in which the file can be opened. Or if you do this the same, it will give you one (first value) encoding the file can be opened with without errors.
Here it the code:
# -*- coding: utf-8 -*-
import codecs
def get_file_bom_encoding(filename):
with open (filename, 'rb') as openfileobject:
line = str(openfileobject.readline())
if line[2:14] == str(codecs.BOM_UTF8).split("'")[1]: return 'utf_8'
if line[2:10] == str(codecs.BOM_UTF16_BE).split("'")[1]: return 'utf_16'
if line[2:10] == str(codecs.BOM_UTF16_LE).split("'")[1]: return 'utf_16'
if line[2:18] == str(codecs.BOM_UTF32_BE).split("'")[1]: return 'utf_32'
if line[2:18] == str(codecs.BOM_UTF32_LE).split("'")[1]: return 'utf_32'
return ''
def get_all_file_encodings(filename):
encoding_list = []
encodings = ('utf_8', 'utf_16', 'utf_16_le', 'utf_16_be',
'utf_32', 'utf_32_be', 'utf_32_le',
'cp850' , 'cp437', 'cp852', 'cp1252', 'cp1250' , 'ascii',
'utf_8_sig', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp500',
'cp720', 'cp737', 'cp775', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864',
'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932',
'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1251',
'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257',
'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213',
'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp',
'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004',
'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5',
'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9',
'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15',
'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic',
'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004',
'shift_jisx0213'
)
for e in encodings:
try:
fh = codecs.open(filename, 'r', encoding=e)
fh.readlines()
except UnicodeDecodeError:
fh.close()
except UnicodeError:
fh.close()
else:
encoding_list.append([e])
fh.close()
continue
return encoding_list
def get_file_encoding(filename):
file_encoding = get_file_bom_encoding(filename)
if file_encoding != '':
return file_encoding
encoding_list = get_all_file_encodings(filename)
file_encoding = str(encoding_list[0][0])
if file_encoding[-3:] == '_be' or file_encoding[-3:] == '_le':
file_encoding = file_encoding[:-3]
return file_encoding
def main():
print('This Python script is only for import functionality, it does not run interactively')
if __name__ == '__main__':
main()
I am sure that there are modules/packages which can do this more accurately, but this is done with standard packages (which was another requirement I had)
It does mean that you are reading the files multiple times, which is not a very fast way of doing things. You may be able to use this to suite your own particular problem, or even improve on this, especially the "reading multiple times" is something which you could look at, and immediately open the file once one encoding is found.
If they have mixed encoding, you could try one encoding and fall back to another:
# first open as binary
with open(..., 'rb') as f:
f_contents = f.read()
try:
contents = f_contents.decode('UTF-8')
except UnicodeDecodeError:
contents = f_contents.decode('gbk')
...
If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:
contents = open(..., 'rb').read()
regex = re.compile(b'http://.+\.jpg')
result = regex.findall(contents)
# now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec
Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&) and may do better with something like pyquery
Here's a quick example using pyquery
contents = open(..., 'rb').read()
pq = pyquery.PyQuery(contents)
images = pq.find('img')
for img in images:
img = pyquery.PyQuery(img)
if img.attr('src').endswith('.jpg')
print(img.attr('src'))
Not on my computer with things installed, so mileage with these code samples may vary

urlopen() throwing error in python 3.3

from urllib.request import urlopen
def ShowResponse(param):
uri = str("mysite.com/?param="+param+"&submit=submit")
print(urlopen(uri).read())
file = open("myfile.txt","r")
if file.mode == "r":
filelines = file.readlines()
for line in filelines:
line = line.strip()
ShowResponse(line)
this is my python code but when i run this it causes an error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-49: ordinal not in range(128)"
i dont know how to fix this. im new to python
I'm going to assume that the stack trace shows that line 4 (uri = str(...) is throwing the given error and myfile.txt contains UTF-8 characters.
The error is because you're trying to convert a Unicode object (decoded from assumed UTF-8) to an ASCII string object. ASCII simply can not represent your character.
URIs (including the Query String) must encode non-ASCII chars as percent-encoded UTF-8 bytes. Example:
€ (EURO SIGN) is encoded in UTF-8 is:
0xE2 0x82 0xAC
Percent-encoded, it's:
%E2%82%AC
Therefore, your code needs to re-encode your parameter to UTF-8 then percent-encode it:
from urllib.request import urlopen, quote
def ShowResponse(param):
param_utf8 = param.encode("utf-8")
param_perc_encoded = quote(param_utf8)
# or uri = str("mysite.com/?param="+param_perc_encoded+"&submit=submit")
uri = str("mysite.com/?param={0}&submit=submit".format(param_perc_encoded) )
print(urlopen(uri).read())
You'll also see I've changed your uri = definition slightly to use String.format() (https://docs.python.org/2/library/string.html#format-string-syntax), which I find easier to create complex strings rather than doing string concatenation with +. In this example, {0} is replaced with the first argument to .format().

Python3 String has no decode for windows-1256

I have an Arabic string in windows-1256, that I need to convert into ascii, so that it can be sent into html2text. However upon execution an error returns stating str object has no attribute 'decode'
filename=codecs.open(keyworddir + "\\" + item, "r", encoding = "windows-1256")
outputfile=filename.readlines()
file=open(keyworddir + "\\" + item, "w")
for line in outputfile:
line=line.decode(encoding='windows-1256')
line=line.encode('UTF-8')
file.write(line)
file.close()
In Python 3, str is already a decoded Unicode string, so you cannot decode line again.
What you have missed, is decoding happening implicitly while reading the file. codecs.open with "r" mode allows for reading the file as a text file with given encoding and automatically decodes all text.
So. you can either:
open the file in binary mode: filename=open(keyworddir + "\\" + item, "rb"); the lines will now be bytes and they will be decodeable
or, better, simply remove superfluous decoding: line=line.decode(encoding='windows-1256')
Note:
you should consider opening the output file with codecs.open(keyworddir + "\\" + item, "w", encoding = "utf-8"), therefore making it unnecessary to manually encode the line
I had similar problems, It took me 5 days of work trying to solve this problem, finally I used following solution.
before opening the file run this command to commandline(it is of course in linux command line)
iconv -f 'windows-1256' -t 'uft-8' '[your file name]' -o '[output file name]'
so you can run commandline commands automaticly in python code using that python function
import subprocess
def run_cmd(cmd):
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
process.wait()

Resources