How to preserve emojis when reading a csv file into pandas - python-3.x

I have a csv file with a column message which contains text (mainly in English but also with some special characters like Spanish or French) and emojis.
df = pd.read_csv('myfile.csv', encoding='utf-8') gives me this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 83: invalid start byte
df = pd.read_csv('myfile.csv', encoding='mac_roman') reads the file ok but replaces emojis with underscore _________. The same with windows-1252 and iso-8859-1.
I tried utf-16,32,cp1252 etc. Nothing works.
My aim is to keep emojis as there are and then decode them into words (smiley face, thumb up etc.) with the emoji python package.
Maybe someone had a similar problem and could suggest a way around? Thank you!

Related

A Utf8 encoded file produces UnicodeDecodeError during parsing

I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).
file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.
I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.
But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.
This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.
For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you.
It will cut your text at every newline character and try to find (guess) the right encoding for each portion.
It doesn't magically do the right thing always, but it's pretty good.
To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).

i give rename csv location path to variable in python , it gives errror unicodedecodeError [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Trouble opening files in python

I am trying to open a file in python. the code I am using is:
poem = open("C:\Users\Draco\OneDrive\Documents\Programming\Conan.txt") .
When I use this code I get the following error message:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape .
I have checked to see if the system can find the file using the code:
>>> import os.path
>>> os.path.isfile("Conan.txt")
This originally came back as false, but I have managed to get it to say true. However I have still been unable to get the code to work, any suggestions?
Ah you might want to escape the escape so to speak.
Backslashes are special characters in python and if you want them to be actual backslashes you need to escape them with a backslash
open("C:\\Users\\Draco\\OneDrive\\Documents\\Programming\\Conan.txt")
or use it raw (notice the 'r')
open(r"C:\Users\Draco\OneDrive\Documents\Programming\Conan.txt")

How to write a unicode character to a file that is not supported by utf-8, python

I'm trying to write some Hebrew text to a .txt file using python, but seeing as Hebrew is non-Ascii and non utf-8 I am getting errors. I am trying to have get literal text שלום in the text file as opposed to a representation of it.
hebrew_word = "שלום"
file = open("file_with_hebrew.txt", "w")
file.writelines(hebrew_word)
file.close()
part of my stack trace:
UnicodeEncodeError: 'charmap' codec can't encode character '\u05e9' in position 0: character maps to <undefined>
hebrew_word = "שלום"
with open('file_with_hebrew.txt', 'w', encoding='utf-8') as file:
# ^^^^^^^^^^^^^^^^
file.writelines(hebrew_word)
Make sure you specify an encoding when opening the file; in your case it defaulted to an encoding not able to represent Hebrew.
Your script work just fine. You are doing right and UTF-8 is OK to print those characters. What Python version are you using on what platform?
From open() doc:
In text mode, if encoding is not specified the encoding used is
platform dependent: locale.getpreferredencoding(False) is called to
get the current locale encoding.
So you should specify the encoding when open the file to write in case that your platform doesn't have UTF-8 as default:
hebrew_word = "שלום"
with open("file_with_hebrew.txt", "w", encoding='UTF-8') as file
file.writelines(hebrew_word)

How to only keep Big5 characters in a text file

I scraped some text found in a Taiwanese website. I got rid of the HTML and only kept what I needed as txt files. The content of the txt file displays correctly in Firefox/Chrome. With Python3, if I do f = open(text_file).read() I get this error:
'utf-8' codec can't decode byte 0xa1 in position 29: invalid start byte
ETA: I use ubuntu, so I'm happy for any solution in Python or in the terminal!
And if I do f = codecs.open(os.path.join(path, 'my_text.txt'), 'r', encoding='Big5') and then read() I get this message:
'big5' codec can't decode byte 0xf9 in position 1724: illegal multibyte sequence
I only need the Chinese characters, how can I only keep those encoded as Big5? This would get rid of the error,yes?
The builtin open() function has errors parameter:
with open(filename, encoding='utf-8', errors='replace') as file:
text = file.read()
It is possible that your file uses some other character encoding or even (if the code that saves the text is buggy) a mixture of several character encodings.
You can see what encoding is used by your browser e.g., in Chrome: "More tools -> Encoding".

Resources