BeautifulSoup UnicodeEncodeError - python-3.x

I'm trying to parse HTML page that I saved to my computer(windows 10)
from bs4 import BeautifulSoup
with open("res/JLPT N5 vocab list.html", "r", encoding="utf8") as f:
soup = BeautifulSoup(f, "html.parser")
tables = soup.find_all("table")
sectable= tables[1]
for tr in sectable.contents[1:]:
if tr.name == "tr":
try:
print(tr.td.a.get_text())
except(AttributeError):
continue
It should print all of japanese words in first column but error was raised at print(tr.td.a.get_text()) said UnicodeEncodeError: 'charmap" codec can't encode character in position 0-1: character maps to (undefined) so, how can I solve this error?

Finally, I solved it, according to Beautiful Soup Documentatioin's Miscellaneous.
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or just about any other UnicodeEncodeError) - This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode("utf8").
In my case, it because I tried to print a Unicode character that my console doesn't know how to display it.
So, I enabled TrueType font for console , changed system locale to Japanese(so that console encode was changed and can choose font that support japanese for console) and then changed console font to MSコシック(this font appeared after I changed system locale).
If I want to write it to file, I just open file and specify encoding to UTF-8.

Related

character recognition issue when parsing xml

When using python to extract contents of a xml file and import into a csv file, I found that some characters cannot be recognised, they were converted to some strange characters, such as é was transferred to é, Ü to Ãœ.
I suspect it was due to the character encoding. Because the XML's text locale is en-GB, in Python I set the NLS_LANG as AL32UTF8.
I tried to apply the decode method to following code but did not work at all.
abstract = i.find('abstract').text
Could anyone shed some lights on this?

Why is my Jupyter Notebook using ascii codec even on Python3?

I'm analyzing a dataframe that contains French characters.
I set up an IPython kernel based on Python3 within my jupyter notebook, so the default encoding should be utf-8.
However, I can no longer save my notebook as soon as an accented character appears in my .ipynb (like é, è...), even though those are handled in utf-8.
The error message that I'm getting when trying to save is this :
Unexpected error while saving file: Videos.ipynb 'ascii' codec can't encode characters in position 347-348: ordinal not in range(128)
Here is some minimal code that gives me the same error message in a basic Python3 kernel
import pandas as pd
d = {'EN': ['Hey', 'Wassup'], 'FR': ['Hé', 'ça va']}
df = pd.DataFrame(data=d)
(actually a simple cell with "é" as text does prevent me from saving already)
I've seen similar questions, but all of them were based on Python 2.7 so nothing relevant. I also tried several things in order to fix this :
Including # coding: utf-8 at the top of my notebook
Specifying the utf-8 encoding when reading the csv file
Trying to read the file with latin-1 encoding then saving (still not
supported by ascii codec)
Also checked my default encoding in python3, just to make sure
sys.getdefaultencoding()
'utf-8'
Opened the .ipynb in Notepad++ : the encoding is set to utf-8 in there. I can add accented characters and save there, but then can no longer open the notebook in jupyter (I get an "unknown error" message").
The problem comes from saving the notebook and not reading the file, so basically I want to switch to utf-8 encoding for saving my .ipynb files but don't know where.
I suspect the issue might come from the fact that I'm using WSL on Windows10 - just a gut feeling though.
Thanks a lot for your help!
Well, turns out uninstalling then reinstalling jupyter notebook did the trick. Not sure what happened but it's now solved.

ascii codec error is showing while running python3 code via apche wsgi

Objective: Insert a Japanese text to a .ini file
Steps:
Python version used 3.6 and used Flask framework
The library used for writing config file is Configparser
Issue:
When I try running the code via the "flask run" command, there are no issues. The Japanese text is inserted to ini file correctly
But when I try running the same code via apache(wsgi) I am getting the following error
'ascii' codec can't encode characters in position 17-23: ordinal not in range(128)
Never interact with text files without explicitly specifying the encoding.
Sadly, even Python's official documentation neglects to obey this simple rule.
import configparser
config_path = 'your_file.ini'
config = configparser.ConfigParser()
with open(config_path, encoding='utf8') as fp:
config.read_file(fp)
with open(config_path, 'w', encoding='utf8') as fp:
config.write(fp)
utf8 is a reasonable choice for storing Unicode characters, pick a different encoding if you have a preference.
Japanese characters consume up to five bytes per character in UTF-8, picking utf16 (always two bytes per character) can result in smaller ini files, but there is no functional difference.

Utf-8 codec can't decode byte 0xff in position 185: invalid start byte

Is there anyone else getting this error when running pyinstaller?
Utf-8 codec can't decode byte 0xff in position 185: invalid start byte
I saved my python file with a notepad++ in utf-8 without bom but that hasnt helped. Pyinstaller was working fine earlier and just suddenly I began getting this error. Is anyone experiencing the same issue?
Regards,
A bit late to the party, but I was having this exact issue. you can use open as 'rb' so that it won't try to convert the text to ANSI. I did mine like this:
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
The contents.split is just for formatting. when you decode the data within the file, it'll keep all the /r/n (if in windows) or /n (if in linux)
hope this helps!

open a unicode name file on linux from Python

I was trying to open a file on Ubuntu in Python using:-
open('<unicode_string>', "wb")
unicode_string is '\u9879\u76ee\u7ba1\u7406'. It is a Chinese text.
But I get the following error:-
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)
I am trying to understand what limits this behavior? Went through some links. I understand that the OS's filesystem has the responsibility to encode the string. For windows the 'mbcs' encoding handles possibly every character. What could be the problem with linux.
Does not fail for all linux setups. What should I be checking?

Resources