Decode Unicode chars - string

I am using Stackoverflow API to fetch some data but the response contains Unicode and some other encodings which I am unable to decode. Here's one sample text:
you can use numpy \u0026#39;\u0026#39;\u0026#39;
python
import numpy as np
np.random.random((pN,C,K))\u0026#39;\u0026#39;\u0026#39;
What's the best way to decode this response in Go?
The sample text is a Python answer I received as response from Stackoverflow API.

Several layers of encoding seem to be in place here:
first, the \u0026 could be a & character
this gives ' which seems to be an XML character literal for '
the resulting ''' seems to mark up the enclosed text to be taken literally including the line breaks. It is also possible that the python indicates the language for syntax highlighting of the enclosed text.

Related

weird characters in Pandas dataframe - how to standardize to UTF-8?

I'm using Python + Camelot (OCR library) to read a PDF, clean up, and write to Excel or csv. There are some non-standard dashes that print out a weird character.
Using Camelot means I'm not calling "read_csv". It's coming from the PDF. A value that is supposed to be "1-4" prints out as 1–4.
I fixed this using a regular expression but a colleague mentioned I should standardize to UTF-8. I tried to do that for the header like this:
header = df.iloc[0, 1:].str.encode('utf-8')
but then that value becomes b'1\xe2\x80\x934'.
Any advice? The goal is to simply use standard text.

Arabic text replaced with escape sequences when creating CSV files using python

I am trying to create a CSV file that contains Arabic tweets collected using tweepy for a project I am doing. All is fine gathering the data, however, when i am writing to the CSV file all Arabic results are escaped with \xXXXX sequences
as follows:
b'#\xd8\xa7\xd9\x84\xd9\x8a\xd9\x88\xd9\x85_\xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a_\xd9\x84\xd9\x84\xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd9\x87_2017 \xd8\xa7\xd9\x84\xd8\xa5\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xad\xd9\x82\xd9\x8a\xd9\x82\xd9\x8a\xd8\xa9 \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x81\xd9\x83\xd8\xb1 \xd9\x88\xd9\x84\xd9\x8a\xd8\xb3\xd8\xaa \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9
I looked at many previously asked questions and all I could find was suggestions for python 2 or answers similar to the one I am writing. When I was creating JSON files instead I was using ensure_ascii=False but I couldn't find anything similar for CSV. Below is my code:
with codecs.open('tweets.csv', 'a', encoding='utf-8') as file:
fieldnames = ['tweet', 'country']
writer = csv.DictWriter(file, fieldnames=fieldnames)
data = {'tweet': status.text, 'country': status.place.full_name}
writer.writerow(data)
I tried adding .encoding='utf-8' to status.text and status.place as well but that also didn't work. Any suggestions?
You have to make sure the Arabic string you have is decoded into UTF-8 before you write it. Assuming status.text is of type bytes you should type text=status.text.decode('utf-8'). (Maybe you have to do this for status.place.full_name too.) But if it's of type str then it won't have an decode() method. To avoid escape sequences in your file, a str object should be written anyway.
If you try to specify the encoding of a bytes object (like the one you presumably have) as 'utf-8' that won't work because the text is already in UTF-8 bytes. So in order to get UTF-8 characters you must call decode() on the bytes object. That way it writes the UTF-8 characters and not the UTF-8 bytes.

Python 3.6 - Can't convert Unicode in a string

I am doing some scraping work with Python 3.6 and retrieved some URLs in strings following this format:
someURL = 'http:\u002F\u002Fsomewebsite.com\u002Fsomefile.jpg'
I have been trying to convert the Unicode backslash (\u002F) in those strings in order to use the URLs (using regex methods, encode() on the strings, etc.), to no avail. The string still keeps the Unicode backslash and, if I pass it on to Requests' get(), for example, I am greeted by the following error message:
InvalidURL: Failed to parse: http:\u002F\u002Fsomewebsite.com\u002Fsomefile.jpg"
I searched for solutions in this forum and and others, but can't seem to put the finger on it. I am sure it is simple...
Use codecs.decode with the encoding named 'unicode-escape':
import codecs
print(codecs.decode(someURL, 'unicode-escape'))
# prints 'http://somewebsite.com/somefile.jpg'

How to solve the ascii problems when I try to download pictures?

Here are my codes:
import re,urllib
from urllib import request, parse
def gh(url):
html=urllib.request.urlopen(url).read().decode('utf-8')
return html
def gi(x):
r=r'src="(.+?\.jpg)"'
imgre=re.findall(r, x)
y=0
for iu in imgre:
urllib.request.urlretrieve(iu, '%s.jpg' %y)
y=y+1
va=gh('http://tieba.baidu.com/p/3497570603')
print(gi(va))
when it is running, it occurs:
UnicodeEncodeError: 'ascii' codec can't encode character '\u65e5' in position 873: ordinal not in range(128)
I have decoded the content of website with 'utf-8' which turns into string, and where is the 'ascii codec' problem from?
The problem is that the HTML content of http://tieba.baidu.com/p/3497570603 contains references to .png images so the non-greedy regular expression matches long strings of text such as
http://static.tieba.baidu.com/tb/editor/images/client/image_emoticon28.png" ><br><br><br><br>
...
title="蓝钻"><img src="http://imgsrc.baidu.com/forum/pic/item/bede9735e5dde711c981db20a0efce1b9f1661d5.jpg
Calling the urlretrieve() method with URLs consisting of long strings containing non-ASCII characters results in the UnicodeEncodeError being thrown while trying to convert the URL argument to ASCII.
A better regular expression to avoid matching too much text would be
r=r'src="([^"]+?\.jpg)"'
Debugging
In the spirit of teaching someone to fish rather than simply giving them a fish for one day, I’d recommend that you use print statements to debug problems such as this. I was able to diagnose this particular issue by replacing the urllib.request.urlretrieve(iu, '%s.jpg' %y) line with print(iu).

Python3 Base64 decode of a var containing ==

So Ive got a string of:
YDNhZip1cDg1YWg4cCFoKg==
that needs to be decoded using Pythons Base64 module.
Ive written the code
import base64
test = 'YDNhZip1cDg1YWg4cCFoKg=='
print(test)
print(base64.b64decode(test))
which gives the answer
b'`3afup85ah8p!h'
when, according to the website decoders Ive used, its really
`3afup85ah8p!h
Im guessing that its decoding the additional quotes.
Is there some way that I can save this variable with a delimiter, as another type of variable, or run the b64encode on a section of the string as slice doesnt seem to work?
b' is Python's way of delimiting data from bytes, see: What does the 'b' character do in front of a string literal?
i.e., it is decoding it correctly.

Resources