Why Is JSON Truncated During Linux HTML Response Parsing? - linux

`
import requests
from bs4 import BeautifulSoup
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
r = response.text
soup = BeautifulSoup(response.text, "lxml")
textarea = soup.find('textarea', attrs={'id': 'song-list-pre-data'}).get_text()
print(textarea)
`
In the Linux environment, the matching result JSON is truncated.
the textarea :xxxxxx ee":0,"album":{"id":158052587,"name":"Sakana~( ˵>ㅿㅿ
I think it's probably because of the special symbols.
How do you deal with this situation?

You need to convert from string to JSON object list
then it can be print a song.
I tested Ubuntu 20.04 and Windows on VS code terminal.
Both are works.
Code
import requests
import json
from bs4 import BeautifulSoup
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
'User-Agent': "PostmanRuntime/7.15.2",
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text)
textarea = soup.find('textarea', attrs={'id': 'song-list-pre-data'}).get_text()
json_list = json.loads(textarea)
for song in json_list:
print("album:", song['album']['name'], ", artists: ", song['artists'][0]['name'], "duration: ", song['duration'])
Result on Ubuntu 20.04
Result on VS code Terminal at

Related

Download Specific file from Website with BeautifulSoup

Following the documentation of BeautifulSoup, I am trying to download a specific file from a webpage. First trying to find the link that contains the file name:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find("a", text=re.compile("TASA_DOLAR_REFERENCIA_MC.xls"))
path = link.get('href')
print(f"{path}")
But with no success. Then trying to print every link on that page, I get no links:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find_all('a')
for links in parsed.find_all("a href"):
print(links.get('a href'))
It looks like the url of the file is dynamic, it adds a ?v=123456789 parameter to the end of the url, like the file version, that's why I need to download the file using the file name.
(Eg https://cdn.bancentral.gov.do/documents/estadisticas/mercado-cambiario/documents/TASA_DOLAR_REFERENCIA_MC.xls?v=1612902983415)
Thanks.
Actually you are dealing with a dynamic JavaScript page which is fully loaded via an XHR request to the following url once the page loads.
Below is a direct call to the back-end API which identify the request using page id which is 2538 and then we can load your desired url.
import requests
from bs4 import BeautifulSoup
def main(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0'
}
with requests.Session() as req:
req.headers.update(headers)
data = {
"id": "2538",
"languageName": "es"
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.json()['result']['article']['content'], 'lxml')
target = soup.select_one('a[href*=TASA_DOLAR_REFERENCIA_MC]')['href']
r = req.get(target)
with open('data.xls', 'wb') as f:
f.write(r.content)
if __name__ == "__main__":
main('https://www.bancentral.gov.do/Home/GetContentForRender')

Exception has occurred: UnicodeDecodeError 'utf-8' codec can't decode byte 0xf1 in position

I'm doing scraping on this website but when I iterate I find the following message:
Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0xf1 in position 614: invalid continuation byte
my code:
import requests
from bs4 import BeautifulSoup as soup
links=['https://www.yapo.cl/vi/74410346.htm?ca=15_s', 'https://www.yapo.cl/vi/73845701.htm?ca=15_s']
for link in links:
uClient = requests.get(link)
soup = soup(uClient.content, "html.parser")
containers = soup.findAll("div",{"class":"price price-final"})
print(containers)
I tried to get the data for one URL and it worked for me.
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "tr,tr-TR,en-US,en;q=0.8",
}
with requests.Session() as session:
session.headers = headers
r = session.get('https://www.yapo.cl/vi/74410346.htm?ca=15_s', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
data = soup.find("div",{"class":"price price-final"})
response = session.get("https://www.yapo.cl/vi/74410346.htm?ca=15_s".format(data=data))
soup = BeautifulSoup(response.text, "html.parser")
print(data.text)

Unable to understand the 403 Error from HTML parsing using BeautifulSoup4 with Python3.x

I am in the Coursera Course Python For Everyone Course and I attempted one of the questions from the textbook:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://www.py4e.com/book.htm'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
I don't understand the error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
But according to the full error, it starts #Line 18. From reading other SO and this Similar Question that it probably has something to do with the SSL certificate and how the website thinks I'm a bot.
Why doesn't the code work?
import requests
from bs4 import BeautifulSoup
url = 'https://www.py4e.com/book.htm'
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
Link = requests.get(url, headers=headers)
soup =BeautifulSoup(Link.content,"lxml")
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Output:
http://amzn.to/1KkULF3
book/index.htm
http://amzn.to/1KkULF3
http://amzn.to/1hLcoBy
http://amzn.to/1KkV42z
http://amzn.to/1fNOnbd
http://amzn.to/1N74xLt
http://do1.dr-chuck.net/py4inf/EN-us/book.pdf
http://do1.dr-chuck.net/py4inf/ES-es/book.pdf
https://twitter.com/fertardio
translations/KO/book_009_ko.pdf
http://www.xwmooc.net/python/
http://fanwscu.gitbooks.io/py4inf-zh-cn/
book_270.epub
translations/ES/book_272_es4.epub
https://www.gitbook.com/download/epub/book/fanwscu/py4inf-zh-cn
html-270/
html_270.zip
http://itunes.apple.com/us/book/python-for-informatics/id554638579?mt=13
http://www-personal.umich.edu/~csev/books/py4inf/ibooks//python_for_informatics.
ibooks
http://www.py4inf.com/code
http://www.greenteapress.com/thinkpython/thinkCSpy/
http://allendowney.com/
Updated code for urllib:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://www.py4e.com/book.htm'
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))

Changing Scrapy/Splash user agent

How can I set the user agent for Scrapy with Splash in an equivalent way like below:
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.example.com"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
My spider would look similar to this:
import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 0.5}
)
You need to set user_agent attribute to override default user agent:
class ExampleSpider(scrapy.Spider):
name = 'example'
user_agent = 'Mozilla/5.0'
In this case UserAgentMiddleware (which is enabled by default) will override USER_AGENT setting value to 'Mozilla/5.0'.
You can also override headers per request:
scrapy_splash.SplashRequest(url, headers={'User-Agent': custom_user_agent})
The proper way is to to alter the splash script to included it... no add it to the spider though, if it works as well.
http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=agent
If you use pure splash (not scrapy-splash package), you can just pass headers param with 'User-Agent' key. And the requests on this page all will use this user-agent.
https://splash.readthedocs.io/en/stable/api.html?highlight=User-Agent
Here is an example:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0',
}
param = {
'url': your_aim_url,
'headers': headers,
'html': 1,
'har': 1,
'response_body': 1,
}
session = requests.Session()
session.headers.update({'Content-Type': 'application/json'})
response = self.session.post(url='http://127.0.0.1:8050/render.json', json=param)
response_json = json.loads(response.text, encoding='utf-8')
print(response_json.get('html')) # page html
print(response_json.get('har')) # har with respose body. if do not want respose body, set 'response_body' to 0
You can check the request header in har to see if the user-agent is correct.

Python3 requests.get ignoring part of my URL (BEAUTIFULSOUP + PYTHON WEBSCRAPING)

I'm using requests.get like so:
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings()
cookies = {
'PYPF': '3OyMLS2-xJlxKilWEOSvMQXAhyCgIhvAxYfbB8S_5lGBxxAS18Z7I8Q',
'_ga': 'GA1.2.227320333.1496647453',
'_gat': '1',
'_gid': 'GA1.2.75815641.1496647453'
}
params = {
'platform': 'xbox'
}
page = requests.get("http://www.rl-trades.com/#pf=xbox", headers={'Platform': 'Xbox'}, verify=False, cookies=cookies, params=params).text
page
soup = BeautifulSoup(page, 'html.parser')
... etc.
But, from my results in testing, it seems requests.get is ignoring '/#pf=xbox' in 'http://www.rl-trades.com/#pf=xbox'.
Is this because I am having to set verify to false? What is going on here?

Resources