Can't sent cyrillic text as the parameter of the request - python-3.x

I am trying to send get request with parameters which consist of some cyrillic text. I tried a lot of solutions and get other errors with encoding and decoding. Nothing helps.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import http.client
import json
import urllib.parse
data = {}
data["id"] = 123123
values = {}
values["someText"] = "йцколфыовал фыоварфылова фоывафлова"
data["values"] = values
conn = http.client.HTTPConnection("example.com")
params = {"data": json.dumps(data)}
conn.request("GET", "url_request?{}".format(urllib.parse.urlencode(params)))
res = conn.getresponse()
data = res.read()
I expect a successfull sending of the request with cyrillic text.
I got next error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position
254: invalid continuation byte "
Thanks.

Related

Encoding error trying to write file with python

Here is the full script:
import requests
import bs4
res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()
multiline_code = """{}""".format(page_HTML_code)
f = open("testfile.txt","w+")
f.write(multiline_code)
f.close()
So I'm trying to write the entire Downloaded HTML as a file while keeping it neat and clean.
I do understand that it has problems with the text and can't save certain characters, but I'm not sure how to encode the text correctly.
Can anyone help?
This is the error message that I will get
"C:\Location", line 16, in <module>
f.write(multiline_code)
File "C:\\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0421' in position 209: character maps to <undefined>
I did some digging around and this worked:
import requests
import bs4
res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()
multiline_code = """{}""".format(page_HTML_code)
#add the Encoding part when opening file and this did the trick
with open('testfile.html', 'w+', encoding='utf-8') as fb:
fb.write(multiline_code)

outputting python script results into text

I have a want to save my python script's result into a txt file.
My python code
from selenium import webdriver
bro = r"D:\Developer\Software\Python\chromedriver.exe"
driver=webdriver.Chrome(bro)
duo=driver.get('http://www.lolduo.com')
body=driver.find_elements_by_tag_name('tr')
for post in body:
print(post.text)
driver.close()
Some codes that I've tried
import subprocess
with open("output.txt", "w") as output:
subprocess.call(["python", "./file.py"], stdout=output);
I tried this code and it only makes a output.txt file and has nothing inside it
D:\PythonFiles> file.py > result.txt
Exception:
UnicodeEncodeError: 'charmap' codec can't encode character '\u02c9' in
position 0: character maps to
and only prints out 1/3 of the results of the script into a text file.
You can try below code to write data to text file:
from selenium import webdriver
bro = r"D:\Developer\Software\Python\chromedriver.exe"
driver = webdriver.Chrome(bro)
driver.get('http://www.lolduo.com')
body = driver.find_elements_by_tag_name('tr')
with open("output.txt", "w", encoding="utf8") as output:
output.write("\n".join([post.text for post in body]))
driver.close()
You can try this. This Is my Python Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
import time
bro = r"D:\Developer\Software\Python\chromedriver.exe"
driver = webdriver.Chrome(bro)
driver.get('http://www.lolduo.com')
body = driver.find_elements_by_tag_name('tr') .text
with open('output15.txt', mode='w') as f:
for post in body:
print(post)
f.write(post)
time.sleep(2)
driver.close()

python 3.6 ascii codec error in urllib request

I'm trying to download picture from website with my python script, but every time i use georgian alphabet in url it gets error "UnicodeEncodeError: 'ascii' codec can't encode characters"
here is my code:
import os
import urllib.request
def download_image(url):
fullfilename = os.path.join('/images', 'image.jpg')
urllib.request.urlretrieve(url, fullfilename)
download_image(u'https://example.com/media/სდასდსადადსაფა_8QXjrbi.jpg')
I think it's better to use requests library in your example which deals with utf-8 characters.
Here is the code:
import requests
def download_image(url):
request = requests.get(url)
local_path = 'images/images.jpg'
with open(local_path, 'wb') as file:
file.write(request.content)
my_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/ერეკლე_II_ბავშვობის_სურათი.jpgw/459px-ერეკლე_II_ბავშვობის_სურათი.jpg'
download_image(my_url)

Get API response chunk encoded. ERROR: data byte not string (ubuntu)

I have some code that works when I run it on a Windows machine, but when it runs in Ubuntu on a google ComputeEngine VM I get the following error.
Traceback (most recent call last): File "firehose_get.py", line 43,
in
print(json.dumps(json.loads(line),indent=2)) File "/home/stuartkirkup/anaconda3/lib/python3.5/json/init.py", line
312, in loads
s.class.name)) TypeError: the JSON object must be str, not 'bytes'
It's exactly the same code that runs fine on Windows. I've done quite a bit of reading and it looks like an encoding issue - and as you'll see from some of the commented out sections in my code I've tried some ways to change the encoding but without joy. I've tried various things but can't work out how to debug it ... I'm fairly new to Python
I'm using Anaconda which some further reading says it has an ill advised setdefaultencoding hack built in.
Here is the stream header showing it's chunked data, which I believe is why it's bytes
{'Transfer-Encoding': 'chunked', 'Date': 'Thu, 17 Aug 2017 16:53:35 GMT', 'Content-Type': 'application/json', 'x-se
rver': 'db220', 'Content-Encoding': 'gzip'}
Code file - firehose_requests.py (with api keys infor replaced by ####)
import requests
MAX_REDIRECTS = 1000
def get(url, **kwargs):
kwargs.setdefault('allow_redirects', False)
for i in range(0, MAX_REDIRECTS):
response = requests.get(url, **kwargs)
#response.encoding = 'utf-8'
print ("test")
print (response.headers)
if response.status_code == requests.codes.moved or \
response.status_code == requests.codes.found:
if 'Location' in response.headers:
url = response.headers['Location']
content_type_header = response.headers.get('content_type')
print (content_type_header)
continue
else:
print ("Error when reading the Location field from HTTP headers")
return response
Code file - firehose_get.py
import json
import requests
from time import sleep
import argparse
#import ConfigParser
import firehose_requests
from requests.auth import HTTPBasicAuth
# Make it work for Python 2+3 and with Unicode
import io
try:
to_unicode = unicode
except NameError:
to_unicode = str
#request a token from Adobe
request_access_token = requests.post('https://api.omniture.com/token', data={'grant_type':'client_credentials'}, auth=HTTPBasicAuth('##############-livestream-poc','488##############1')).json()
#print(request_access_token)
#grab the token from the JSON returned
access_token = request_access_token["access_token"]
print(access_token)
url = 'https://livestream.adobe.net/api/1/stream/eecoukvanilla-##############'
sleep_sec=0
rec_count=10
bearer = "Bearer " + access_token
headers = {"Authorization": bearer,"accept-encoding":"gzip,deflate"}
r = firehose_requests.get(url, stream=True, headers=headers)
#open empty file
with open('output_file2.txt', 'w') as outfile:
print('', file=outfile)
#Read the Stream
if r.status_code == requests.codes.ok:
count = 0
for line in r.iter_lines():
if line:
#write to screen
print ("\r\n")
print(json.dumps(json.loads(line),indent=2))
#append data to file
with open('output_file2.txt', 'a') as outfile:
print("\r\n", file=outfile)
print(json.dumps(json.loads(line),ensure_ascii = False),file=outfile)
#with io.open('output_file2.txt', 'w', encoding='utf8') as outfile:
# str_ = json.dumps(json.loads(line),
# indent=4, sort_keys=True,
# separators=(',', ': '), ensure_ascii=False)
# outfile.write(to_unicode(str_))
#Break the loop if there are is a -n argument
if rec_count is not None:
count = count + 1
if count >= rec_count:
break
#How long to wait between writes
if sleep_sec is not None :
sleep(sleep_sec)
else:
print ("There was a problem with the Request")
print ("Returned Status Code: " + str(r.status_code))
Thanks
OK I worked it out. I found a lot of people also getting this error but no solutions posted, so this is how I did it
parse and decode the JSON like this
json_parsed = json.loads(line.decode("utf-8"))
Full code:
import json
import requests
from time import sleep
import argparse
#import ConfigParser
import firehose_requests
from requests.auth import HTTPBasicAuth
# Make it work for Python 2+3 and with Unicode
import io
try:
to_unicode = unicode
except NameError:
to_unicode = str
#request a token from Adobe
request_access_token = requests.post('https://api.omniture.com/token', data={'grant_type':'client_credentials'}, auth=HTTPBasicAuth('##########-livestream-poc','488################1')).json()
#print(request_access_token)
#grab the token from the JSON returned
access_token = request_access_token["access_token"]
print(access_token)
url = 'https://livestream.adobe.net/api/1/stream/##################'
sleep_sec=0
rec_count=10
bearer = "Bearer " + access_token
headers = {"Authorization": bearer,"accept-encoding":"gzip,deflate"}
r = firehose_requests.get(url, stream=True, headers=headers, )
#open empty file
with open('output_file.txt', 'w') as outfile:
print('', file=outfile)
#Read the Stream
if r.status_code == requests.codes.ok:
count = 0
for line in r.iter_lines():
if line:
#parse and decode the JSON
json_parsed = json.loads(line.decode("utf-8"))
#write to screen
#print (str(json_parsed))
#append data to file
with open('output_file.txt', 'a') as outfile:
#write to file
print(json_parsed,file=outfile)
#Break the loop if there are is a -n argument
if rec_count is not None:
count = count + 1
if count >= rec_count:
break
#How long to wait between writes
if sleep_sec is not None :
sleep(sleep_sec)
else:
print ("There was a problem with the Request")
print ("Returned Status Code: " + str(r.status_code))

How to display proper output when using re.findall() in python?

I made a python script to get the latest stock price from the Yahoo Finance.
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
price = re.findall(b'<span id="yfs_l84_goog">(.+?)</span>',htmltext);
print(price);
It works smoothly but when I output the price it comes out like this[b'1,217.04']
This might be a petty issue to ask but I'm new to python scripting so please help me if you can.
I want to get rid of 'b'. If I remove 'b' from b'<span id="yfs_l84_goog">"then it shows this error.
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I want the output to be just
1,217.04
b'' is a syntax for bytes literals in Python. It is how you could define byte sequences in Python source code.
What you see in the output is the representation of the single bytes object inside the price list returned by re.findall(). You could decode it into a string and print it:
>>> for item in price:
... print(item.decode()) # assume utf-8
...
1,217.04
You could also write bytes directly to stdout e.g., sys.stdout.buffer.write(price[0]).
You could use an html parser instead of a regex to parse html:
#!/usr/bin/env python3
import cgi
from html.parser import HTMLParser
from urllib.request import urlopen
url = 'http://finance.yahoo.com/q?s=GOOG'
def is_price_tag(tag, attrs):
return tag == 'span' and dict(attrs).get('id') == 'yfs_l84_goog'
class Parser(HTMLParser):
"""Extract tag's text content from html."""
def __init__(self, html, starttag_callback):
HTMLParser.__init__(self)
self.contents = []
self.intag = None
self.starttag_callback = starttag_callback
self.feed(html)
def handle_starttag(self, tag, attrs):
self.intag = self.starttag_callback(tag, attrs)
def handle_endtag(self, tag):
self.intag = False
def handle_data(self, data):
if self.intag:
self.contents.append(data)
# download and convert to Unicode
response = urlopen(url)
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html = response.read().decode(params['charset'])
# parse html (extract text from the price tag)
content = Parser(html, is_price_tag).contents[0]
print(content)
Check whether yahoo provides API that doesn't require web scraping.
Okay after searching for a while. I found the solution.Works fine for me.
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
pattern = re.compile('<span id="yfs_l84_goog">(.+?)</span>');
price = pattern.findall(str(htmltext));
print(price);

Resources