Degree character (°) encoding/decoding - python-3.x

I am on windows platform and I use Python 3.
I have a text file which contains degree characters (°)
I want to read the whole text file, do some processing and write it back with the performed modifications. Here is sample of my code :
with io.open('myTextFile.txt',encoding='ASCII') as f:
for item in allItem:
i=0
myData = pd.DataFrame(data=np.zeros((n,1)))
for line in f:
myRegex = "(AD"+item+")"
if re.match(myRegex,line):
myData.loc[i,0] = line
i+=1
myData = myData[(myData.T != 0).any()]
myData = myData.append(pd.DataFrame(["\n"],index=[myData.index[-1]+1]))
myData = myData[0].map(lambda x: x.strip()).to_frame()
myData.to_csv('myModifiedTextFile.txt', header = False, index = False, mode='a', quoting=csv.QUOTE_NONE, escapechar=' ', encoding = 'ASCII')
However I am getting unicode errors although I tried specifying encoding/decoding :
'ascii' codec can't decode byte 0xe9 in position 512: ordinal not in range(128)

ascii is not very useful here, since it only knows 128 characters, the ones you can find in this table. Notice there is no degree sign in that table. I am unsure what the actual encoding of your file is – Unicode and commonly used Windows code pages (1250/1252) have the degree sign at 0xB0.
I assume in your file, there is a degree sign at position 512 and it is causing the error. If this is the case, you need to be more specific with your encoding argument. Figure out which code page / encoding was used to save the file. Confirm this by looking up the code page and finding the degree sign at 0xE9.
If there is a different character at position 512 ("é" is a good candidate), then simply specify an encoding like cp1250, cp1252, or cp1257.

Related

Read .txt with emoji characters in python

I try to read a chat history with smilies in it, but I get the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 38: character maps to
My code looks like this:
file_name = "chat_file.txt"
chat = open(chat_file)
chatText = chat.read() # read data
chat.close()
print(chatText)
I am pretty certain that it's because of elements like: ❤
How can I implement the correct Transformation Format // what is the correct file encoding so python can read these elements?
Never open text files without specifying their encoding.
Also, use with blocks, these automatically call .close() so you don't have to.
file_name = "chat_file.txt"
with open(chat_file, encoding="utf8") as chat:
chat_text = chat.read()
print(chat_text)
iso-8859-1 is a legacy encoding, that means it cannot contain emoji. For emoji the text file has to be Unicode. And the most common encoding for Unicode is UTF-8.

'charmap' codec can't encode characters in position XX

I have a simple script that is attempting to extract mutiple json objects from a single file, and store it as a list:
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
with open(URL, 'r', encoding="utf-8") as handle:
json_data = [json.loads(line) for line in handle]
print(json_data) # Can't .encode() because it's a list
Even after specifying utf-8 encoding, I'm still running into a codec error. If possible, I would also like to change this object into a dictionary, but this is as far as I've got.
The exact error reads:
UnicodeEncodeError: 'charmap' codec can't encode characters in position
394-395: character maps to <undefined>
Thanks in advance.
I was able to solve this issue by removing one unicode character that was producing "/undefined>", the string '\ufeff', and then the rest was able to display nicely. This required me to iterate over the keys in the list of dictionaries, and replace as necessary.
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
json1_file = open(URL, encoding='utf-8')
json1_str = json1_file.read()
json1_str = [d.strip() for d in json1_str.splitlines()]
json1_data = [json.loads(i) for i in json1_str]
json1_data = [{key:value.replace(u'\ufeff', '') for
key, value in json1_data[index].items()} for
index in range(len(json1_data))]
print(json1_data[1]['text'].encode('utf-8'))
Still not sure why I have to open with utf-8 and then encode again with my print statement, but it produced the string nicely.

Trying to output hex data as readable text in Python 3.6

I am trying to read hex values from specific offsets in a file, and then show that as normal text. Upon reading the data from the file and saving it to a variable named uName, and then printing it, this is what I get:
Card name is: b'\x95\xdc\x00'
Here's the code:
cardPath = str(input("Enter card path: "))
print("Card name is: ", end="")
with open(cardPath, "rb+") as f:
f.seek(0x00000042)
uName = f.read(3)
print(uName)
How can remove the 'b' I am getting at the beginning? And how can I remove the '\x'es so that b'\x95\xdc\x00' becomes 95dc00? If I can do that, then I guess I can convert it to text using binascii.
I am sorry if my mistake is really really stupid because I don't have much experience with Python.
Those string started with b in python is a byte string.
Usually, you can use decode() or str(byte_string,'UTF-8) to decode the byte string(i.e. the string start with b') to string.
EXAMPLE
str(b'\x70\x79\x74\x68\x6F\x6E','UTF-8')
'python'
b'\x70\x79\x74\x68\x6F\x6E'.decode()
'python'
However, for your case, it raised an UnicodeDecodeError during decoding.
str(b'\x95\xdc\x00','UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
I guess you need to find out the encoding for your file and then specify it when you open the file, like below:
open("u.item", encoding="THE_ENCODING_YOU_FOUND")

urlopen() throwing error in python 3.3

from urllib.request import urlopen
def ShowResponse(param):
uri = str("mysite.com/?param="+param+"&submit=submit")
print(urlopen(uri).read())
file = open("myfile.txt","r")
if file.mode == "r":
filelines = file.readlines()
for line in filelines:
line = line.strip()
ShowResponse(line)
this is my python code but when i run this it causes an error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-49: ordinal not in range(128)"
i dont know how to fix this. im new to python
I'm going to assume that the stack trace shows that line 4 (uri = str(...) is throwing the given error and myfile.txt contains UTF-8 characters.
The error is because you're trying to convert a Unicode object (decoded from assumed UTF-8) to an ASCII string object. ASCII simply can not represent your character.
URIs (including the Query String) must encode non-ASCII chars as percent-encoded UTF-8 bytes. Example:
€ (EURO SIGN) is encoded in UTF-8 is:
0xE2 0x82 0xAC
Percent-encoded, it's:
%E2%82%AC
Therefore, your code needs to re-encode your parameter to UTF-8 then percent-encode it:
from urllib.request import urlopen, quote
def ShowResponse(param):
param_utf8 = param.encode("utf-8")
param_perc_encoded = quote(param_utf8)
# or uri = str("mysite.com/?param="+param_perc_encoded+"&submit=submit")
uri = str("mysite.com/?param={0}&submit=submit".format(param_perc_encoded) )
print(urlopen(uri).read())
You'll also see I've changed your uri = definition slightly to use String.format() (https://docs.python.org/2/library/string.html#format-string-syntax), which I find easier to create complex strings rather than doing string concatenation with +. In this example, {0} is replaced with the first argument to .format().

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Resources