I'm trying to scan a variable directory, said variable is defined by an input(), yet the program throws out this issue:
(unicode error) 'unicodeescape' codec can't decode bytes in position 320-321: truncated \UXXXXXXXX escape
Current code: Not Working
import os
import time
print("Enter directory name.\nDirectory name example:\nC:\Users\example\Documents")
dirname = input()
with os.scandir(dirname) as dir_entries:
for entry in dir_entries:
info = entry.stat()
file_name = os.path.basename(entry)
my_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(info.st_ctime))
rawmb = (info.st_size/(1024*1024))
truncated = round(rawmb, 3)
print(file_name)
print(my_time)
print(truncated,"MB")
print('===========================')
I considered using \\ or / instead of \ during the input, but that makes copying the directory from the file explorer impossible.
I have no idea how to include an r in front of the input() string.
.decode,.encode didn't seem to work for me either, but I most likely just used them wrong.
Edit #1
Tried the solution from J_H
Do this after input(): for ch in dirname: print(ch, ord(ch))
Result:
Same error.
Maybe I have trival issue but I can't find solution.
I use Raspberry Pi to read value from serial port. Input from serial looks like " b'1\r\n' ".
In this input I need only the number. I tried this code:
data = str(data)
data = data[2:7]
data = data.replace("\r\n","")
print(data)
Output of this code is : "1\r\n". I can't get rid of this part of string, replace function just doesn't work and I don't know why.
you have bytes you can use the decode method of bytes to get back a string. you can then use rstrip method of str to remove the trailing new line chars.
data = b'1\r\n'
print(data)
data = data.decode(('utf-8;')).rstrip()
print(data)
OUTPUT
b'1\r\n'
1
Environment :python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8.
I want to extract all the jpg with regular expression
For s.html encoding with gbk.
tree = open("/tmp/s.html","r").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte
tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)
['http://somesite/2017/06/0_56.jpg']
It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.
I had a similar problem, and how I solved this is as follows.
In get_file_encoding(filename), I first check if there is a BOM (Byte Order Mark) in the file, if so, get the encoding from the BOM. From the function: get_file_bom_encoding(filename)
If that did not return a value, I would get a list from the function: get_all_file_encodings(filename)
This list will have all encodings which the file can be opened in. For the purpose I was doing this, I just needed one, and I did not care about the rest, so I simply choose the first value of the list file_encoding = str(encoding_list[0][0])
This obviously is not accurate a 100% but at least it will give you either the correct encoding from the BOM, or it will give you a list of encodings in which the file can be opened. Or if you do this the same, it will give you one (first value) encoding the file can be opened with without errors.
Here it the code:
# -*- coding: utf-8 -*-
import codecs
def get_file_bom_encoding(filename):
with open (filename, 'rb') as openfileobject:
line = str(openfileobject.readline())
if line[2:14] == str(codecs.BOM_UTF8).split("'")[1]: return 'utf_8'
if line[2:10] == str(codecs.BOM_UTF16_BE).split("'")[1]: return 'utf_16'
if line[2:10] == str(codecs.BOM_UTF16_LE).split("'")[1]: return 'utf_16'
if line[2:18] == str(codecs.BOM_UTF32_BE).split("'")[1]: return 'utf_32'
if line[2:18] == str(codecs.BOM_UTF32_LE).split("'")[1]: return 'utf_32'
return ''
def get_all_file_encodings(filename):
encoding_list = []
encodings = ('utf_8', 'utf_16', 'utf_16_le', 'utf_16_be',
'utf_32', 'utf_32_be', 'utf_32_le',
'cp850' , 'cp437', 'cp852', 'cp1252', 'cp1250' , 'ascii',
'utf_8_sig', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp500',
'cp720', 'cp737', 'cp775', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864',
'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932',
'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1251',
'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257',
'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213',
'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp',
'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004',
'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5',
'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9',
'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15',
'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic',
'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004',
'shift_jisx0213'
)
for e in encodings:
try:
fh = codecs.open(filename, 'r', encoding=e)
fh.readlines()
except UnicodeDecodeError:
fh.close()
except UnicodeError:
fh.close()
else:
encoding_list.append([e])
fh.close()
continue
return encoding_list
def get_file_encoding(filename):
file_encoding = get_file_bom_encoding(filename)
if file_encoding != '':
return file_encoding
encoding_list = get_all_file_encodings(filename)
file_encoding = str(encoding_list[0][0])
if file_encoding[-3:] == '_be' or file_encoding[-3:] == '_le':
file_encoding = file_encoding[:-3]
return file_encoding
def main():
print('This Python script is only for import functionality, it does not run interactively')
if __name__ == '__main__':
main()
I am sure that there are modules/packages which can do this more accurately, but this is done with standard packages (which was another requirement I had)
It does mean that you are reading the files multiple times, which is not a very fast way of doing things. You may be able to use this to suite your own particular problem, or even improve on this, especially the "reading multiple times" is something which you could look at, and immediately open the file once one encoding is found.
If they have mixed encoding, you could try one encoding and fall back to another:
# first open as binary
with open(..., 'rb') as f:
f_contents = f.read()
try:
contents = f_contents.decode('UTF-8')
except UnicodeDecodeError:
contents = f_contents.decode('gbk')
...
If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:
contents = open(..., 'rb').read()
regex = re.compile(b'http://.+\.jpg')
result = regex.findall(contents)
# now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec
Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&) and may do better with something like pyquery
Here's a quick example using pyquery
contents = open(..., 'rb').read()
pq = pyquery.PyQuery(contents)
images = pq.find('img')
for img in images:
img = pyquery.PyQuery(img)
if img.attr('src').endswith('.jpg')
print(img.attr('src'))
Not on my computer with things installed, so mileage with these code samples may vary
import csv
samsung = ['samsung','s1','s2','s3','s4','s5','s6','s7','galaxy']
iphone = ['iphone']
problemTypes = []
solution = []
instruction = []
def solve():
def foundProblem():
for queryPart in whatProblem:
for problem in problemTypes:
if queryPart == problem:
return True
solved = False
readCSV = csv.reader(csvfile, delimiter = ',')
for row in readCSV:
solution = row[0]
problemTypes = row[1].split()
instruction = row[2]
if foundProblem():
print('The solution is to '+solution)
print(instruction)
solved = True
if solved == False:
print('Solution not found.\nPlease contact your supplier')
whatProblem = str(input('What seems to be the issue with your smartphone?\n')).lower().split()
version = input('What type of phone do you have?\n').lower().split()
if version == iphone:
with open('iPhone.csv') as csvfile:
solve()
elif version in samsung:
with open('samsung.csv') as csvfile:
solve()
else:
print('Phone not supported')
This is an attempt at creating a trouble shooter using multiple csv files however I am met with the problem of the samsung part. It seems that it cannot notice that the input is actually part of the samsung variable. I am new here so if I have formatted this wrong please notify me and if the solution is extremely simple please know I am new to coding.
Try at least to change this line:
version = input('What type of phone do you have?\n').lower().split()
into:
version = input('What type of phone do you have?\n').lower().split()[0]
But reading input you currently force the user to enter 'samsung' which is not the most accessible approach. Keep on learning and trying and it will work out fine!
It's not entirely clear which bit you're having a problem with, but this extract looks wrong:
with open('iPhone.csv') as csvfile:
solve()
You probably intend to use csvfile within the block:
with open('iPhone.csv') as csvfile:
solve(csvfile)
and to change the implementation of solve to accept csvfile as an argument. As it is, it looks like you're trying (and failing) to communicate via a global variable; even if that did work, it's a poor practice that leads to unmaintainable code!
I'm not sure exactly what your problem is either but maybe try this statement instead of your other elif statement:
elif any(version in s for s in samsung):
Check if a Python list item contains a string inside another string
I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several 
 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the 
 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the
string. It appears as though 
 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this 
 I would appreciate the help. Thanks, George