Python3 encode unicode - python-3.x

When I do get request with requests lib in Python3 I get this response:
{"status":true,"data":[{"koatuu":7121586600,"zona":8,"kvartal":2,"parcel":501,"cadnum":"7121586600:08:002:0501","ownershipcode":100,"purpose":"\u0414\u043b\u044f \u0432\u0435\u0434\u0435\u043d\u043d\u044f \u0442\u043e\u0432\u0430\u0440\u043d\u043e\u0433\u043e \u0441\u0456\u043b\u044c\u0441\u044c\u043a\u043e\u0433\u043e\u0441\u043f\u043e\u0434\u0430\u0440\u0441\u044c\u043a\u043e\u0433\u043e \u0432\u0438\u0440\u043e\u0431\u043d\u0438\u0446\u0442\u0432\u0430","use":"\u0414\u043b\u044f \u0432\u0435\u0434\u0435\u043d\u043d\u044f \u0442\u043e\u0432\u0430\u0440\u043d\u043e\u0433\u043e \u0441\u0456\u043b\u044c\u0441\u044c\u043a\u043e\u0433\u043e\u0441\u043f\u043e\u0434\u0430\u0440\u0441\u044c\u043a\u043e\u0433\u043e \u0432\u0438\u0440\u043e\u0431\u043d\u0438\u0446\u0442\u0432\u0430","area":"1.3397","unit_area":"\u0433\u0430 ","ownershipvalue":null,"id_office":630}]}
How can I get cp1252 letters as response?
My code is:
import requests
url = 'http://map.land.gov.ua/kadastrova-karta/get-parcel-Info?koatuu=7121586600&zone=08&quartal=002&parcel=0004'
page = requests.get(url)
print(page.text)

print(page.json())
solved my problem :D

Related

How to get the final destination URL after redirections in python requests?

Response from the actual destination URL is needed.
I have tried solution mentioned SO question.
import requests
doi_link = 'https://doi.org/10.1016/j.artint.2018.07.007'
response = requests.get(url= doi_link ,allow_redirects=True )
print(response.status_code,response.url, response.history)
#Outputs: 200 https://linkinghub.elsevier.com/retrieve/pii/S0004370218305988 [<Response [302]>]
Why is allow_redirects getting stopped in the middle?
The final URL i get on when done manually on browser is https://www.sciencedirect.com/science/article/pii/S0004370218305988?via%3Dihub
I wanted to have this URL programmatically.
EDIT
As suggested in comments the final call to the destination is made using JS.
As suggested here: Python Requests library redirect new url
You can use the response history to get the final URL. In this case, the final URL will return a 200, however, it will have the "final final" redirect in the HTML. You can parse the final HTML to get the redirectURL.
I would use something like beautifulsoup4 to make parsing very easy - pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
from urllib.request import unquote
from html import unescape
doi_link = 'https://doi.org/10.1016/j.artint.2018.07.007'
response = requests.get(url= doi_link ,allow_redirects=True )
for resp in response.history:
print(resp.status_code, resp.url)
# use final response
# parse html and get final redirect url
soup = BeautifulSoup(response.text, 'html.parser')
redirect_url = soup.find(name="input" ,attrs={"name":"redirectURL"})["value"]
# get final response. unescape and unquote url from the HTML
final_url = unescape(unquote(redirect_url))
print(final_url)
article_resp = requests.get(final_url)

Unable to read wiki page by BeautifulSoup

I tried to read wiki page using urllib and beautiful soup as follows.
I tried according to this.
import urllib.parse as parse, urllib.request as request
from bs4 import BeautifulSoup
name = "メインページ"
root = 'https://ja.wikipedia.org/wiki/'
url = root + parse.quote_plus(name)
response = request.urlopen(url)
html = response.read()
print (html)
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
print (soup)
The code run without error but could not read Japanese characters.
Your approach seems correct and working for me.
Try printing soup parsed data using following code and check the output.
soup = BeautifulSoup(html.decode('UTF-8'), features="lxml")
some_japanese = soup.find('div', {'id': 'mw-content-text'}).text.strip()
print(some_japanese)
In my case, I am getting the following(this is part of the output) -
ウィリアム・バトラー・イェイツ(1865年6月13日 - 1939年1月28日)は、アイルランドの詩人・劇作家。幼少のころから親しんだアイルランドの妖精譚などを題材とする抒情詩で注目されたのち、民族演劇運動を通じてアイルランド文芸復興の担い手となった。……
If this is not working for you, then try to save html content to file, and check the page in browser, if japanese text is fetching properly or not. (Again, its working fine for me)

google search using python3 script

This is the code that I'm using:
import requests, sys, webbrowser, bs4
res = requests.get('https://google.com/search?q='+''.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElements = soup.select('.r a')
linkToOpen = min(3, len(linkElements))
for i in range(linkToOpen):
webbrowser.open('https://google.com'+linkElements[i].get('href'))
When I try to run this code:(python search.py 'something').
I'm getting the following error:
Use res.status_code to determine the status,
If it returns 200, and still you are getting the error, then you might be having a bad connection
Otherwise, you must try entering the URL as 'https://www.google.com/', which is the standard URL we usually see on the browser.
Let me know if it helped you or not.

bs4 can't recognize encoding python3

I am trying to scrape a few pages using Python3 for the first time. I have used Python2 many times with bs4 without any trouble, but I can't seem to be able to switch to python3, as I am always getting encoding errors.
For example, I am trying to scrape https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html
I have searched through a few threads here that have similar questions, without success.
Here is my code:
r = requests.get('https://www.pgatour.com/webcom/tournaments/the-bahamas-great-exuma-classic/leaderboard.html')
r.encoding = r.apparent_encoding
soup = bs.BeautifulSoup(r.text,'html5lib')
print(soup)
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xd7' in position 28935: ordinal not in range(128)
I also tried to change r.encoding = r.apparent_encoding to r.encoding = "utf-8", getting the same error.
You can change the encoding as follows. This should fix your error.
r = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html")
print(r.encoding)
soup = BS(r.content, 'html.parser').encode('utf-8')
print(soup)

BeautifulSoup parser and cirillic characters

guys!
I'm trying to parse this URL http://mapia.ua/ru/search?&city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&page=1&what=%D0%BE%D0%BE%D0%BE using BeautifulSoup.
But I have got a strange characters like this ��� �1 ��� "����"
Here is my code
from bs4 import BeautifulSoup
import urllib.request
URL = urllib.request.urlopen('http://mapia.ua/ru/search?city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&what=%D0%BE%D0%BE%D0%BE&page=1').read()
soup = BeautifulSoup(URL, 'html.parser')
print(soup.h3.get_text())
Can anybody help me?
P.S. I'm using python 3
I found this :
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
soup = BeautifulSoup(html.decode('utf-8', 'ignore').encode("utf-8"))
From:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
Also:
Delete every non utf-8 symbols froms string
Hope it helps ;)

Resources