Trouble opening files in python - python-3.x

I am trying to open a file in python. the code I am using is:
poem = open("C:\Users\Draco\OneDrive\Documents\Programming\Conan.txt") .
When I use this code I get the following error message:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape .
I have checked to see if the system can find the file using the code:
>>> import os.path
>>> os.path.isfile("Conan.txt")
This originally came back as false, but I have managed to get it to say true. However I have still been unable to get the code to work, any suggestions?

Ah you might want to escape the escape so to speak.
Backslashes are special characters in python and if you want them to be actual backslashes you need to escape them with a backslash
open("C:\\Users\\Draco\\OneDrive\\Documents\\Programming\\Conan.txt")
or use it raw (notice the 'r')
open(r"C:\Users\Draco\OneDrive\Documents\Programming\Conan.txt")

Related

Python3 UnicodeEncodeError when file name to write has croatian characters

I'm writing from pandas to csv like this:
df.to_csv(extractpath+'/extract_{}'.format(locname))
Now, when locname variable contains croatian characters, I get error
*UnicodeEncodeError: 'ascii' codec can't encode character '\u0161' in position 53: ordinal not in range(128)*
The only workaround I come up with is this:
df.to_csv(extractpath+'/extract_{}'.format(locname.encode('utf-8')))
However, although error is now gone, file names are not correct anymore, for example they look like:
*extract_b'Vara\xc5\xbedin'* instead of *extract_Varaždin*
How can I properly solve the problem?
I had wrongly set LANG variable in shell, without realizing it., so when I executed python script from shell script it didn't work. Now is fine after correcting LANG variable.

Syntax error:unicode error on trying to open image using PIL in Python

I am new to Python. I was trying to open an image using Python Pillow.But
>>> from PIL import Image
>>> im=Image.open("C:\Users\User1\Desktop\map3.jpg")
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I tried using the image name alone and it still didn’t work
>>> from PIL import Image
>>> im=Image.open("map3.jpg")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Deepthy\AppData\Local\Programs\Python\Python35-32\lib\site- packages\PIL\Image.py", line 2258, in open
fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'map3.jpg'
The problem originates with Windows' legacy. Try this instead:
im=Image.open("C:/Users/User1/Desktop/map3.jpg")
In Python source code, just as in most other languages including C, C++, Java, Javascript, C#, Ruby, Perl and PHP, the backslash is an escape character. It is used to indicate that the following character(s) have special meaning. In a string literal in Python, \u (or \U) begins a Unicode escape sequence. However, \Users is not a valid Unicode escape. Something like \u20ac would be valid (and is the euro currency symbol).
I say that the problem originates with Windows' legacy because Windows (and MS-DOS, before it) chose to use the backslash as the separator character in paths. This is unfortunate since the backslash already held special meaning in program source code so it needs to be escaped, for example print("How many do you see \\?"). Alternatively, the Windows API functions all accept a slash instead of a backslash even though the GUI programs (Explorer, cmd.exe) do not.
Your second problem (the FileNotFoundError) is correct: you specified a relative path but the file does not exist in the current working directory as you specified.

How to only keep Big5 characters in a text file

I scraped some text found in a Taiwanese website. I got rid of the HTML and only kept what I needed as txt files. The content of the txt file displays correctly in Firefox/Chrome. With Python3, if I do f = open(text_file).read() I get this error:
'utf-8' codec can't decode byte 0xa1 in position 29: invalid start byte
ETA: I use ubuntu, so I'm happy for any solution in Python or in the terminal!
And if I do f = codecs.open(os.path.join(path, 'my_text.txt'), 'r', encoding='Big5') and then read() I get this message:
'big5' codec can't decode byte 0xf9 in position 1724: illegal multibyte sequence
I only need the Chinese characters, how can I only keep those encoded as Big5? This would get rid of the error,yes?
The builtin open() function has errors parameter:
with open(filename, encoding='utf-8', errors='replace') as file:
text = file.read()
It is possible that your file uses some other character encoding or even (if the code that saves the text is buggy) a mixture of several character encodings.
You can see what encoding is used by your browser e.g., in Chrome: "More tools -> Encoding".

How to print() a string in Python3 without exceptions?

Seemingly simple question: How do I print() a string in Python3? Should be a simple:
print(my_string)
But that doesn't work. Depending on the content of my_string, environment variables and the OS you use that will throw an UnicodeEncodeError exception:
>>> print("\u3423")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u3423' in position 0: ordinal not in range(128)
Is there a clean portable way to fix this?
To expand a bit: The problem here is that a Python3 string contains Unicode encoded characters, while the Terminal can have any encoding. If you are lucky your Terminal can handle all the characters contained in the string and everything will be fine, if your Terminal can't (e.g. somebody set LANG=C), then you get an exception.
If you manually encode a string in Python3 you can supply an error handler that ignores or replaces unencodable characters:
"\u3423".encode("ascii", errors="replace")
For print() I don't see an easy way to plug in an error handler and even if there is, a plain error handler seems like a terrible idea as it would modify the data. A conditional error handler might work (i.e. check isatty() and decide based on that what to do), but it seems awfully hacky to go through all that trouble just to print() a string and I am not even sure that it wouldn't fail in some cases.
A real world example this problem would be for example this one:
Python3: UnicodeEncodeError only when run from crontab
The most practical way to solve this issue seems to be to force the output encoding to utf-8:surrogateescape. This will not only force UTF-8 output, but also ensure that surrogate escaped strings, as returned by os.fsdecode(), can be printed without throwing an exception. On command line this looks like this:
PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'print("\udcff")'
To do this from within the program itself one has to reassign stdout and stderr, this can be done with (the line_buffering=True is important, as otherwise the output won't get flushed properly):
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, errors="surrogateescape", line_buffering=True)
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, errors="surrogateescape", line_buffering=True)
print("\udcff")
This approach will cause characters to be incorrectly displayed on terminals not set to UTF-8, but this to me seems to be strongly prefered over randomly throwing exceptions and making it impossible to print filenames without corrupting them, as they might not be in any valid encoding at all on Linux systems.
I read in a few places that utf-8:surrogateescape might become the default in the future, but as of Python 3.6.0b2 that is not the case.
Is there a clean portable way to fix this?
Set PYTHONIOENCODING=<encoding>:<error_handler> e.g.,
$ PYTHONIOENCODING=utf-8 python your_script.py >output-in-utf-8.txt
In your case, I'd configure your environment (LANG, LC_CTYPE) to accept non-ascii input:
$ locale charmap
The reason it is giving you an error is because it is trying to decipher what \u is. Just like \r is ascii for carriage return, \n - newline \t - tab etc...
If:
my_string = '\u112'
print(my_string)
That will give you an error, to print the '\' without it trying to find out what \ is is like so:
my_string = '\\u122'
print(my_string)
Output:
\u122

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:
print(label, msg)
... in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>
If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:
print(label, msg.encode())
This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.
Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)
Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*
However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())
b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'
I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.
print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).
The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.
To do so, everytime you read something, use decode, everytime you output something, use encode.
For you regex, use the re.UNICODE flag.
I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.

Resources