Open URL with ASCII characters (emojis) on it as parameters with urllib - python-3.x

I'm making a telegram bot that sends messages to a Channel, everything works fine until I try to send a message with an ASCII character (like an emoji) inside the URL message parameter, whenever I try to run something like this:
botMessage = '🚨'
urlRequest = f'https://api.telegram.org/bot{telegram_token}/sendMessage?chat_id={chat_id}&text={botMessage}'
urlRequest = urlRequest.replace(" ", "%20")
urllib.request.urlopen(urlRequest)
I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f6a8' in position 95: ordinal not in range(128)
Here is the full pic of all the errors I got before the one I wrote above

Non-ASCII character is forbidden in URL. This is a limitation in HTTP protocol and it is not related to Telegram. Use urllib.parse.quote function to encode from UTF-8 to ASCII as follow:
import urllib.request
botMessage = urllib.parse.quote('🚨')
urlRequest = f'https://api.telegram.org/bot{token}/sendMessage?chat_id={chat_id}&text={botMessage}'
urlRequest = urlRequest.replace(" ", "%20")
urllib.request.urlopen(urlRequest)
There are many python library for Telegram Bot. They are easy to use and they hide these details.

Related

UnicodePython 3: EncodeError: 'ascii' codec can't encode character '\xe4'

I am trying to send some emails with pandas from an excel-file.
I have this error for over a week now and even after hours of searching through SO, google, forums and so on, I just can't come u with an answer to fix the problem.
Here is the code:
import pandas as pd
import smtplib
your_name = "myname"
your_email = "mymail"
your_password = "mypw"
server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
server.ehlo()
server.login(your_email, your_password)
# Read the file
email_list = pd.read_excel("myfile.xlsx")
# Get all the Names, Email Addreses, Subjects and Messages
all_emails = email_list['Email']
all_messages = email_list['Text']
# Loop through the emails
for idx in range(len(all_emails)):
# Get each records name, email, subject and message
email = all_emails[idx]
message = all_messages[idx]
# Create the email to send
full_email = ("From: {0} <{1}>\n"
"To: <{2}>\n"
"Subject: My_Subject_Title\n\n"
"{3}"
.format(your_name, your_email, email, message))
# In the email field, you can add multiple other emails if you want
# all of them to receive the same text
try:
server.sendmail(your_email, [email], full_email)
print('Email to {} successfully sent!\n\n'.format(email))
except Exception as e:
print('Email to {} could not be sent :( because {}\n\n'.format(email, str(e)))
server.close()
I am getting the error:
'ascii' codec can't encode character '\xe4'
So obviously the error is caused by some european letters inside my excel file.
What I tryied (along several others ways) was to encode the file:
email_list = pd.read_excel("myfile.xlsx", encoding=("utf-8"))
>>> TypeError: read_excel() got an unexpected keyword argument 'encoding'
or:
email_list = pd.read_excel("myfile.xlsx")
email_list.encode("utf-8")
>>> AttributeError: 'DataFrame' object has no attribute 'encode'
Non if it seems to work.
I'm happy if someone can help me out in what I`m doing wrong.
Very new to python and these are my first real trys to implement some actual work-related problems.
Thanks a lot in advance!

imaplib with Python 3.7.4 occasionally returns an attachment that fails to be decoded

Some background:
imaplib with Python 3.7.4 occasionally returns a photo attachment (jpg) that fails to be decoded from the server after being downloaded. I've confirmed that the photos are encoded when sent with byte64 encoding over multiple emails. Most Photos work; however, certain ones don't for whatever reason. At this time, I don't know which email client is being used to send this particular email that causes the crash or the source of the photo (phone, camera, pc, etc). I've tested every supported file type from python-pillow without any issues though. It's just this one photo/email. And lastly, if I remove the attachment there are no issues, so it's something to do with the photo. All python packages are the current versions.
The commented lines in the code below show things I've tried the following encodings:
utf-8 (which fails to decode it at all)
Traceback (most recent call last):
File "(file path)", line 514, in DoEmail
raw_email_string = raw_email.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 10922: invalid start byte
cp1252 (Which returns a NoneType object when trying to save the file.)
Traceback (most recent call last):
part.get_payload(decode=True))
TypeError: a bytes-like object is required, not 'NoneType'
I've looked at the documentation for email.parser Source and email.parser Docs and imaplib Docs. Also a good example by MattH and attachment example by John Paul Hayes.
My Question:
Why do certain photos, even though they seem to be encoded correctly, cause it to crash? And how do I fix it? Is there a better method to get and save the attachments?
Relevant Code:
# Site is the email server address
# Port is the email server port, usually 993.
mail = imaplib.IMAP4_SSL(host=Site, port=Port) # imaplib module implements connection based on IMAPv4 protocol
mail.login(Email, password)
mail.select('inbox', readonly=False) # Connected to inbox.
# SearchPhrase is the Phrase used when finding unique emails.
result, data = mail.uid('SEARCH', None, f'Subject "{SearchPhrase}"') # search and return uids instead
if result == 'OK':
EmailIdList = data[0].split() # EmailIdList is a space separated byte string of the ids
count = len(EmailIdList)
for x in range(count):
if GUI: GUI.resultStatus = resx.currentProgress(x+1, count)
latest_email_uid = EmailIdList[x] # unique ids wrt label selected
EmailID = latest_email_uid.decode('utf-8')
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
if result == 'OK':
raw_email = email_data[0][1]
# try:
# raw_email_string = raw_email.decode('utf-8')
# except:
# raw_email_string = raw_email.decode('cp1252')
# email_message = email.message_from_string(raw_email)
email_message = email.message_from_bytes(raw_email)
print(email_message)
dt = parse(email_message['Date']) #dateutil.parser.parse()
day = str(dt.strftime("%B %d, %Y")) #date())
msg.get_content_charset(), 'ignore').encode('utf8', 'replace')
# this will loop through all the available multiparts in email
for part in email_message.walk():
charset = part.get_content_charset()
if part.get_content_maintype() != 'multipart' and part.get('Content-Disposition') is not None:
fileName = part.get_filename().replace('\n','').replace('\r','')
if fileName != '' and fileName is not None:
print(fileName)
with open(fileName, 'wb') as f:
######## ---- HERE ---- ##########
f.write(part.get_payload(decode=True))
elif part.get_content_type() == "text/plain": # get only text/plain
body = str(part.get_payload(decode=True), str(charset), "ignore").replace('\r','')
print(body)
elif part.get_content_type() == "text/html": # get only html
html = str(part.get_payload(decode=True), str(charset), "ignore").replace('\n', '').replace('\r', ' ')
print(html)
else:
continue
Edit:
I believe these are the MIME Headers for the image in question.
------=_NextPart_000_14A6_01D55B4C.3FE8C840
Content-Type: image/jpeg;
name="8~a~0ff68d6a-12aa-49bf-9908-0b28ecd7ec83~634676194557918023.jpg"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="8~a~0ff68d6a-12aa-49bf-9908-0b28ecd7ec83~634676194557918023.jpg"
Edit: The location of the crash (when it decodes the byte64 data to save the file) is denoted by: ######## ---- HERE ---- ##########

bs4 can't recognize encoding python3

I am trying to scrape a few pages using Python3 for the first time. I have used Python2 many times with bs4 without any trouble, but I can't seem to be able to switch to python3, as I am always getting encoding errors.
For example, I am trying to scrape https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html
I have searched through a few threads here that have similar questions, without success.
Here is my code:
r = requests.get('https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html')
r.encoding = r.apparent_encoding
soup = bs.BeautifulSoup(r.text,'html5lib')
print(soup)
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xd7' in position 28935: ordinal not in range(128)
I also tried to change r.encoding = r.apparent_encoding to r.encoding = "utf-8", getting the same error.
You can change the encoding as follows. This should fix your error.
r = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html")
print(r.encoding)
soup = BS(r.content, 'html.parser').encode('utf-8')
print(soup)

json() on "requests" response raises UnicodeEncodeError

I'm querying Github's Jobs API with python3, using the requests library, but running into an error parsing the response.
Library: http://docs.python-requests.org/en/latest/
Code:
import requests
import json
url = 'https://jobs.github.com/positions.json?'
response = requests.get(url)
print(response.json())
Error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in
position 321: ordinal not in range(128)
Using this API in the past with Ruby, I've never run into this issue.
I also tried converting it to a dictionary but it resulted in the same errors.
There's other questions on SO about the UnicodeEncodeError (mostly re: opening files), but I'm not familiar with Python and didn't find them helpful.
First, check that the response is actually JSON. Try printing response.text and see if it looks like a valid JSON object.
Assuming it is JSON: it's very "hack"-ey, but you can replace the non ASCII characters with their escaped Unicode representation:
def escape_unicode(c):
return c.encode('ascii', 'backslashreplace').decode('ascii')
response = ...
text = response.text
escaped = re.sub(r'[^\x00-\x7F]', lambda m: escape_unicode(m.group(0)), text)
json_response = json.loads(escaped)

Python3 encode unicode

When I do get request with requests lib in Python3 I get this response:
{"status":true,"data":[{"koatuu":7121586600,"zona":8,"kvartal":2,"parcel":501,"cadnum":"7121586600:08:002:0501","ownershipcode":100,"purpose":"\u0414\u043b\u044f \u0432\u0435\u0434\u0435\u043d\u043d\u044f \u0442\u043e\u0432\u0430\u0440\u043d\u043e\u0433\u043e \u0441\u0456\u043b\u044c\u0441\u044c\u043a\u043e\u0433\u043e\u0441\u043f\u043e\u0434\u0430\u0440\u0441\u044c\u043a\u043e\u0433\u043e \u0432\u0438\u0440\u043e\u0431\u043d\u0438\u0446\u0442\u0432\u0430","use":"\u0414\u043b\u044f \u0432\u0435\u0434\u0435\u043d\u043d\u044f \u0442\u043e\u0432\u0430\u0440\u043d\u043e\u0433\u043e \u0441\u0456\u043b\u044c\u0441\u044c\u043a\u043e\u0433\u043e\u0441\u043f\u043e\u0434\u0430\u0440\u0441\u044c\u043a\u043e\u0433\u043e \u0432\u0438\u0440\u043e\u0431\u043d\u0438\u0446\u0442\u0432\u0430","area":"1.3397","unit_area":"\u0433\u0430 ","ownershipvalue":null,"id_office":630}]}
How can I get cp1252 letters as response?
My code is:
import requests
url = 'http://map.land.gov.ua/kadastrova-karta/get-parcel-Info?koatuu=7121586600&zone=08&quartal=002&parcel=0004'
page = requests.get(url)
print(page.text)
print(page.json())
solved my problem :D

Resources