BeautifulSoup Access Denied parsing error [duplicate]

BeautifulSoup Access Denied parsing error [duplicate] - python-3.x

This question already has answers here:
Scraper in Python gives "Access Denied"
(3 answers)
Closed 2 years ago.
This is the first time I've encountered a site where it wouldn't 'allow me access' to the webpage. I'm not sure why and I can't figure out how to scrape from this website.
My attempt:
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Output:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.g2a.com/" on this server.
<p>
Reference #18.4d24db17.1592006766.55d2bc1
</p>
</body>
</html>
I've looked into it for awhile now and I found that there is supposed to be some type of token [access, refresh, etc...].
Also, action="/search" but I wasn't sure what to do with just that.

This page needs to specify some HTTP headers to obtain the information (Accept-Language):
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.5'}
def html(url):
return BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="polyfill.g2a.com" rel="dns-prefetch"/>
<link href="images.g2a.com" rel="dns-prefetch"/>
<link href="id.g2a.com" rel="dns-prefetch"/>
<link href="plus.g2a.com" rel="dns-prefetch"/>
... and so on.

Related

CSV File download from Databricks Filestore in Python not working

I am using the Python code below to download a csv file from Databricks Filestore. Usually, files can be downloaded via the browser when kept in Filestore.
When I directly enter the url to the file in my browser, the file downloads ok. BUT when I try to do the same via the code below, the content of the downloaded file is not the csv but some html code - see far below.
Here is my Python code:
def download_from_dbfs_filestore(file):
url ="https://databricks-hot-url/files/{0}".format(file)
req = requests.get(url)
req_content = req.content
my_file = open(file,'wb')
my_file.write(req_content)
my_file.close()
Here is the html. It appears to be referencing a login page but am not sure what to do from here:
<!doctype html><html><head><meta charset="utf-8"/>
<meta http-equiv="Content-Language" content="en"/>
<title>Databricks - Sign In</title><meta name="viewport" content="width=960"/>
<link rel="icon" type="image/png" href="/favicon.ico"/>
<meta http-equiv="content-type" content="text/html; charset=UTF8"/><link rel="icon" href="favicon.ico">
</head><body class="light-mode"><uses-legacy-bootstrap><div id="login-page">
</div></uses-legacy-bootstrap><script src="login/login.xxxxx.js"></script>
</body>
</html>

Solved the problem by using base64 module b64decode:
import base64
DOMAIN = <your databricks 'host' url>
TOKEN = <your databricks 'token'>
jsonbody = {"path": <your dbfs Filestore path>}
response = requests.get('https://%s/api/2.0/dbfs/read/' % (DOMAIN), headers={'Authorization': 'Bearer %s' % TOKEN},json=jsonbody )
if response.status_code == 200:
csv=base64.b64decode(response.json()["data"]).decode('utf-8')
print(csv)

Getting meta property with beautifulsoup

I am trying to extract the property "og" from opengraph from a website. What I want is to have all the tags that start with "og" of the document in a list.
What I've tried is:
soup.find_all("meta", property="og:")
and
soup.find_all("meta", property="og")
But it does not find anything unless I specify the complete tag.
A few examples are:
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:url"/>,
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:secure_url"/>,
<meta content="text/html" property="og:video:type"/>,
<meta content="1280" property="og:video:width"/>,
<meta content="720" property="og:video:height"/>
Expected output would be:
l = ["og:video:url", "og:video:secure_url", "og:video:type", "og:video:width", "og:video:height"]
How can I do this?
Thank you

use CSS selector meta[property]
metas = soup.select('meta[property]')
propValue = [v['property'] for v in metas]
print(propValue)

Is this what you want?
from bs4 import BeautifulSoup
sample = """
<html>
<body>
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:url"/>,
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:secure_url"/>,
<meta content="text/html" property="og:video:type"/>,
<meta content="1280" property="og:video:width"/>,
<meta content="720" property="og:video:height"/>
</body>
</html>
"""
print([m["property"] for m in BeautifulSoup(sample, "html.parser").find_all("meta")])
Output:
['og:video:url', 'og:video:secure_url', 'og:video:type', 'og:video:width', 'og:video:height']

You can check if og exist in property as follows:
...
soup = BeautifulSoup(html, "html.parser")
og_elements = [
tag["property"] for tag in soup.find_all("meta", property=lambda t: "og" in t)
]
print(og_elements)

Can you insert a string literal that contains unicode (like a Chinese character) into a Jinja2 template?

In my Python code (.py file), I set a session variable to Chinese characters:
session.mystring = '你好'
At first, Flask choked. So, at the very top of the file, I put:
# -*- coding: utf-8 -*-
That solved the problem. I could then display the string in my Flask app by just doing:
{{ session.mystring }}
Here's the weird problem. If I write the Chinese characters directly into the template (.html file), like so:
<!-- this works -->
<p>{{ session.mystring }}</p>
<!-- this doesn't -->
<p>你好</p>
The browser (Edge) displays the following error:
'utf-8' codec can't decode byte 0xb5 in position 516: invalid start byte
I tried putting the following at the top of the template:
<head>
<meta charset="utf-8">
</head>
But that doesn't help. Is there any way I can insert the Chinese characters directly as a literal in the Jinja2 template?
EDIT (June 7, 2020):
Made my question clearer by actually putting the Chinese characters into the question (thought I couldn't previously).

This should work , by simply adding below in HTML file:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
working eg here :
with minimal flask app:
app.py:
from flask import Flask, render_template
app = Flask(name)
#app.route('/')
def test():
data = '中华人民共和国'
return render_template('index.html', data=data)
if(__name__) == '__main__':
app.run()
In templates/index.html :
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Document</title>
</head>
<body>
{{ data }}
</body>
</html>
o/p:

i am webscraping with bs4 and the urls wont show up

i am new to webscaping and wanted to scape all the charackterportraits from the lol site and when i examined one of the pictures in the browser it was in a "img scr="url" tag and i want to get the url to download the picture but when i do soup.select('img[src]') or soup.select('img') it returns an empty list and i dont know why
here is the code:
data=requests.get(website)
data.raise_for_status()
soup = bs4.BeautifulSoup(data.text,"lxml")
print(soup)
#soup returns html
elems = soup.select('img[src]')
print(elems)
#elems returns an empty list

It might be possible to do with request, but it seems that your get request does not get the full pageSource.
You can overcome this issue using selenium to just get the content.
from selenium import webdriver
import bs4
driver = webdriver.Chrome()
driver.get('https://na.leagueoflegends.com/en/game-info/champions/')
page_source = driver.page_source
driver.close()
soup = bs4.BeautifulSoup(page_source, "lxml")
print(soup)
elems = soup.find_all('img')
for elem in elems:
print(elem.attrs['src'])
Output:
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Aatrox.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Ahri.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Akali.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Alistar.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Amumu.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Anivia.png
...

Use the same endpoint the page does. Find it in network tab
import requests
base = 'https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/'
r = requests.get('https://ddragon.leagueoflegends.com/cdn/9.11.1/data/en_US/champion.json').json()
images = [base + r['data'][item]['image']['full'] for item in r['data']]
print(images)

Here is your answer
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('link') #these are your tags eg: a , script link
OUTPUT:
Out[21]:
[Get Started,
What is League of Legends?,
New Player Guide,
Chat Commands,
Community Interaction,
The Summoner's Code,
Champions,
Items,
Summoners,
Summoner Spells,
Game Modes,
Summoner's Rift,
The Twisted Treeline,
Howling Abyss,
Home,
Game Info]
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('script')
Out[22]:
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('a')
[<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-N98J');</script>,
<script>window.ga = window.ga || function(){(ga.q=ga.q||[]).push(arguments)};ga.l = +new Date;</script>,
<script src="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/modernizr.js" type="text/javascript"></script>,
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>,
<script src="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/riot-all.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/riot-kit-all.js" type="text/javascript"></script>,
<script type="text/javascript">rg_force_language = 'en_US';rg_force_manifest = 'https://ddragon.leagueoflegends.com/realms/na.js';rg_assets = 'https://lolstatic-a.akamaihd.net/game-info/1.1.9';</script>,
<script type="text/javascript">window.riotBarConfig = {touchpoints: {activeTouchpoint: 'game'},locale: {landingUrlPattern : 'https://na.leagueoflegends.com//game-info/'},footer: {enabled: true,container: {renderFooterInto: '#footer'}}};</script>,
<script async="" src="https://lolstatic-a.akamaihd.net/riotbar/prod/latest/en_US.js"></script>,
<script src="https://ddragon.leagueoflegends.com/cdn/dragonhead.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/riot-dd-utils.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/riot-dd-i18n.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/external/jquery.lazy-load.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/DDFilterApp.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/DDMarkupItem.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/DDMarkupContainer.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/champions/ChampionsListGridItem.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/champions/ChampionsListGridView.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/champions/ChampionsListApp.js" type="text/javascript"></script>]
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('link')
Out[23]:
[<link href="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/lol-kit.css" rel="stylesheet"/>,
<link href="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/css/base-styles.css" rel="stylesheet"/>,
<link href="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/resources/images/favicon.ico" rel="SHORTCUT ICON"/>]

What are these different errors I'm getting when scraping using BeautifulSoup?

I'm trying to scrape this site called whoscored.com and here's the simple code I'm using to scrape a particular page of it.
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page =
"https://www.whoscored.com/Teams/13/RefereeStatistics/England-Arsenal"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'lxml')
print(pageSoup)
The code runs just fine but here's what it's returning -
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>404 - File or directory not found.</title>
<style type="text/css">
<!--
body{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-
serif;background:#EEEEEE;}
fieldset{padding:0 15px 10px 15px;}
h1{font-size:2.4em;margin:0;color:#FFF;}
h2{font-size:1.7em;margin:0;color:#CC0000;}
h3{font-size:1.2em;margin:10px 0 0 0;color:#000000;}
#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-
family:"trebuchet MS", Verdana, sans-serif;color:#FFF;
background-color:#555555;}
#content{margin:0 0 0 2%;position:relative;}
.content-container{background:#FFF;width:96%;margin-
top:8px;padding:10px;position:relative;}
-->
</style>
</head>
<body>
<div id="header"><h1>Server Error</h1></div>
<div id="content">
<div class="content-container"><fieldset>
<h2>404 - File or directory not found.</h2>
<h3>The resource you are looking for might have been removed, had its name
changed, or is temporarily unavailable.</h3>
</fieldset></div>
</div>
<script type="text/javascript">
//<![CDATA[
(function() {
var _analytics_scr = document.createElement('script');
_analytics_scr.type = 'text/javascript'; _analytics_scr.async = true;
_analytics_scr.src = '/_Incapsula_Resource?
SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=1578388490';
var _analytics_elem = document.getElementsByTagName('script')[0];
_analytics_elem.parentNode.insertBefore(_analytics_scr, _analytics_elem);
})();
// ]]>
</script></body>
</html>
As you see, it returns 404 - file or directory not found or The resource you are looking for might have been removed, had its name
changed, or is temporarily unavailable.There's another bunch of error at the end which I'm not all too familiar with.
I have a few ideas why this might be happening. Maybe there's JavaScript(I see that at the end) or it's due to some sort of counter-measure by the website. However, I'd like to know exactly what's the problem and what can I do to solve this and make sure I'm getting the data I am trying to scrape from the page - which, by the way, is the entire table.
The little I got from reading similar questions on here is that I need to use Selenium but I'm not sure how. Any help would be appreciated.
I'm on IDLE. My Python version is 37(64-bit), and my computer is 64 bit.

In code you have England/Arsenal in url but it has to be England-Arsenal - see / and -
But page use JavaScript so using BeautifulSoup you can't get data. You will have to use Selenium to control web browser which will load page and run JavaScript. After rendering page you can get HTML from brower (using Selenium) and use BeautifulSoup to search your data.
Get tables with Selenium and BeautifulSoup
import selenium.webdriver
from bs4 import BeautifulSoup
url = "https://www.whoscored.com/Teams/13/RefereeStatistics/England-Arsenal"
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome()
driver.get(url)
#print(driver.page_source) # HTML
soup = BeautifulSoup(driver.page_source, 'lxml')
all_tables = soup.find_all('table')
print('len(all_tables):', len(all_tables))
for table in all_tables:
print(table)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

BeautifulSoup Access Denied parsing error [duplicate] - python-3.x

Related

CSV File download from Databricks Filestore in Python not working

Getting meta property with beautifulsoup

Can you insert a string literal that contains unicode (like a Chinese character) into a Jinja2 template?

i am webscraping with bs4 and the urls wont show up

What are these different errors I'm getting when scraping using BeautifulSoup?

Categories

Resources