i give rename csv location path to variable in python , it gives errror unicodedecodeError [duplicate] - python-3.x

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Related

Python3 detect that a file's encoding is not as specified in open statement

When opening a file and specifying encoding, is there a way to detect that the file you're opening is not encoded in the encoding specified.
NOTE: I'm not using 'with' here to try and make example clear
e.g.
try:
WorkF = open(MYFILE, 'r', encoding = 'utf-8')
except IOError as error:
print(f'Error opening file {MYFILE} : {error}')
sys.exit(1)
How do I detect here that the file is not in utf-8.
This is to avoid reading a compressed file, for example, which will nastily fall over.
Filenames are specified on the CLI, and sometimes users give a gzipped file instead of the gunzipped version.
i.e. trying to avoid.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
The only way I can think of doing this is to wrap the entire processing of the contents in a try: except UnicodeDecodeError: block, but that is ugly.
Perhaps there is a way to detect before opening it ?
So the most pythonic way is to just try to open it and catch the error, you can then present the user with information indicating they cannot use your program with compressed files, it is not clear why you would consider that "ugly". If you attempt to identify all the variations of magic bits for compressed or other binary files you will undoubtedly miss some anyway.
try:
with open(MYFILE, 'r', encoding = 'utf-8') as WorkF:
for line in WorkF:
print(line)
except UnicodeDecodeError:
print('Not for use with compressed files.')

Python: backslashes (\u...) in files

I have a file with unicode characters in the \u format. I want to write them to another file as ordinary unicode strings. But I can't get the backslash to be interpreted as an escape character.
So I have this in a file, for example,
\u1203\u1208\u1208 \u0074\u00E4\u0068\u0061\u006C\u00E4\u006C\u00E4,
which should print out like this.
>>> print("\u1203\u1208\u1208 \u0074\u00E4\u0068\u0061\u006C\u00E4\u006C\u00E4")
ሃለለ tähalälä
But instead I get this.
>>> with open('ti.txt') as f:
for line in f:
print(line)
\u1203\u1208\u1208 \u0074\u00E4\u0068\u0061\u006C\u00E4\u006C\u00E4
I've tried every combination of str(), repr(), encode().decode() that I can think of. But those backslashes still end up as backslashes.
Best answer (2021+):
Use the built-in codecs library which exists since Python 2.4+: https://docs.python.org/3/library/codecs.html#codecs.decode
Example:
import codecs
# This is True (successfully decoded):
print(codecs.decode(r"\u1234", "unicode_escape") == "\u1234")
With this answer, you don't need to convert strings into bytes-type objects to be able to decode them. It's a replacement for the bad "str.encode().decode()" pattern that many people wrongly use.
You want the built-in unicode_escape text encoding.
First, you should open the file in binary mode ('rb') so that it will let you call .decode() on the data. Then just do line.decode('unicode_escape').
Modified code:
with open('ti.txt', 'rb') as f:
for line in f:
print(line.decode('unicode_escape'))
In action:
$ cat ti.txt
\u1203\u1208\u1208 \u0074\u00E4\u0068\u0061\u006C\u00E4\u006C\u00E4
$ python parse_unicode_escape.py
ሃለለ tähalälä

Replace CRLF with LF in Python 3.6

I've tried searching the web, and a number of different things I've read on the web, but don't seem to get the desired result.
I'm using Windows 7 and Python 3.6.
I'm connecting to an Oracle db with cx_oracle and creating a text file with the query results. The file that is created (which I'll call my_file.txt to make it easy) has 3688 lines in it all with CRLF which needs to be converted to the unix LF.
If I run python crlf.py my_file.txt it is all converted correctly & there is no issues, but that means I need to run another command manually which I do not want to do.
So I tried adding the code below to my file.
filename = "NameOfFileToBeConverted"
fileContents = open(filename,"r").read()
f = open(filename,"w", newline="\n")
f.write(fileContents)
f.close()
This does convert the majority of the CRLF to LF but # line 3501 it has a NUL character 3500 times on the one line followed by a row of data from the database & it ends with the CRLF, every line from here on still has the CRLF.
So with that not working, I removed it and then tried
import subprocess
subprocess.Popen("crlf.py "+ filename, shell=True)
I also tried using
import os
os.system("crlf.py "+ filename)
The "+ filename" in the two examples above is just providing the filename that is created during the data extract.
I don't know what else to try from here.
Convert Line Endings in-place (with Python 3)
Windows to Linux/Unix
Here is a short script for directly converting Windows line endings (\r\n also called CRLF) to Linux/Unix line endings (\n also called LF) in-place (without creating an extra output file):
# replacement strings
WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'
# relative or absolute file path, e.g.:
file_path = r"c:\Users\Username\Desktop\file.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
Linux/Unix to Windows
Just swap the line endings to content.replace(UNIX_LINE_ENDING, WINDOWS_LINE_ENDING).
Code Explanation
Important: Binary Mode We need to make sure that we open the file both times in binary mode (mode='rb' and mode='wb') for the conversion to work.
When opening files in text mode (mode='r' or mode='w' without b), the platform's native line endings (\r\n on Windows and \r on old Mac OS versions) are automatically converted to Python's Unix-style line endings: \n. So the call to content.replace() couldn't find any line endings to replace.
In binary mode, no such conversion is done.
Binary Strings In Python 3, if not declared otherwise, strings are stored as Unicode (UTF-8). But we open our files in binary mode - therefore we need to add b in front of our replacement strings to tell Python to handle those strings as binary, too.
Raw Strings On Windows the path separator is a backslash \ which we would need to escape in a normal Python string with \\. By adding r in front of the string we create a so called raw string which doesn't need any escaping. So you can directly copy/paste the path from Windows Explorer.
Alternative We open the file twice to avoid the need of repositioning the file pointer. We also could have opened the file once with mode='rb+' but then we would have needed to move the pointer back to start after reading its content (open_file.seek(0)) and truncate its original content before writing the new one (open_file.truncate(0)).
Simply opening the file again in write mode does that automatically for us.
Cheers and happy programming,
winklerrr

How to write a unicode character to a file that is not supported by utf-8, python

I'm trying to write some Hebrew text to a .txt file using python, but seeing as Hebrew is non-Ascii and non utf-8 I am getting errors. I am trying to have get literal text שלום in the text file as opposed to a representation of it.
hebrew_word = "שלום"
file = open("file_with_hebrew.txt", "w")
file.writelines(hebrew_word)
file.close()
part of my stack trace:
UnicodeEncodeError: 'charmap' codec can't encode character '\u05e9' in position 0: character maps to <undefined>
hebrew_word = "שלום"
with open('file_with_hebrew.txt', 'w', encoding='utf-8') as file:
# ^^^^^^^^^^^^^^^^
file.writelines(hebrew_word)
Make sure you specify an encoding when opening the file; in your case it defaulted to an encoding not able to represent Hebrew.
Your script work just fine. You are doing right and UTF-8 is OK to print those characters. What Python version are you using on what platform?
From open() doc:
In text mode, if encoding is not specified the encoding used is
platform dependent: locale.getpreferredencoding(False) is called to
get the current locale encoding.
So you should specify the encoding when open the file to write in case that your platform doesn't have UTF-8 as default:
hebrew_word = "שלום"
with open("file_with_hebrew.txt", "w", encoding='UTF-8') as file
file.writelines(hebrew_word)

Python 3: Issue writing new lines along side unicode to text file

I ran into an issue when writing the header of a text file in python 3.
I have a header that contains unicode AND new line characters. The following is a minimum working example:
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\n\u2588\u2588\u2588\u2588\u2588'
str_2 = 'regular ascii\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(bytes(str_2, 'UTF-8'))
The above works, except the output file does not have the new lines (it basically looks like I replaced '\n' with ''). Like the following:
████████regular asciiregular ascii
I was expecting:
████
████
regular ascii
regular ascii
I have tried replacing '\n' with u'\u000A' and other characters based on similar questions - but I get the same result.
An additional, and probably related, question: I know I am making my life harder with the above encoding and byte methods. Still getting used to unicode in py3 so any advice regarding that would be great, thanks!
EDIT
Based on Ignacio's response and some more research: The following seems to produce the desired results (basically converting from '\n' to '\r\n' and ensuring the encoding is correct on all the lines):
with open('my_log.txt', 'wb') as my_file:
str_1 = '\u2588\u2588\u2588\u2588\u2588\r\n\u2588\u2588\u2588\u2588\u2588'
str_2 = '\r\nregular ascii\r\nregular ascii'
my_file.write(str_1.encode('utf8'))
my_file.write(str_2.encode('utf8'))
Since you mentioned wanting advice using Unicode on Python 3...
You are probably using Windows since the \n isn't working correctly for you in binary mode. Linux uses \n line endings for text, but Windows uses \r\n.
Open the file in text mode and declare the encoding you want, then just write the Unicode strings. Below is an example that includes different escape codes for Unicode:
#coding:utf8
str_1 = '''\
\u2588\N{FULL BLOCK}\U00002588█
regular ascii'''
with open('my_log.txt', 'w', encoding='utf8') as my_file:
my_file.write(str_1)
You can use a four-digit escape \uxxxx, eight-digit escape \Uxxxxxxxx, or the Unicode codepoint \N{codepoint_name}. The Unicode characters can also be directly used in the file as long as the #coding: declaration is present and the source code file is saved in the declared encoding.
Note that the default source encoding for Python 3 is utf8 so the declaration I used above is optional, but on Python 2 the default is ascii. The source encoding does not have to match the encoding used to open a file.
Use w or wt for writing text (t is the default). On Windows \n will translate to \r\n in text mode.
'wb'
The file is open in binary mode. As such \n isn't translated into the native newline format. If you open the file in a text editor that doesn't treat LF as a line break character then all the text will appear on a single line in the editor. Either open the file in text mode with an appropriate encoding or translate the newlines manually before writing.

Resources