i am webscraping with bs4 and the urls wont show up - python-3.x

i am new to webscaping and wanted to scape all the charackterportraits from the lol site and when i examined one of the pictures in the browser it was in a "img scr="url" tag and i want to get the url to download the picture but when i do soup.select('img[src]') or soup.select('img') it returns an empty list and i dont know why
here is the code:
data=requests.get(website)
data.raise_for_status()
soup = bs4.BeautifulSoup(data.text,"lxml")
print(soup)
#soup returns html
elems = soup.select('img[src]')
print(elems)
#elems returns an empty list

It might be possible to do with request, but it seems that your get request does not get the full pageSource.
You can overcome this issue using selenium to just get the content.
from selenium import webdriver
import bs4
driver = webdriver.Chrome()
driver.get('https://na.leagueoflegends.com/en/game-info/champions/')
page_source = driver.page_source
driver.close()
soup = bs4.BeautifulSoup(page_source, "lxml")
print(soup)
elems = soup.find_all('img')
for elem in elems:
print(elem.attrs['src'])
Output:
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Aatrox.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Ahri.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Akali.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Alistar.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Amumu.png
https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/Anivia.png
...

Use the same endpoint the page does. Find it in network tab
import requests
base = 'https://ddragon.leagueoflegends.com/cdn/9.11.1/img/champion/'
r = requests.get('https://ddragon.leagueoflegends.com/cdn/9.11.1/data/en_US/champion.json').json()
images = [base + r['data'][item]['image']['full'] for item in r['data']]
print(images)

Here is your answer
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('link') #these are your tags eg: a , script link
OUTPUT:
Out[21]:
[Get Started,
What is League of Legends?,
New Player Guide,
Chat Commands,
Community Interaction,
The Summoner's Code,
Champions,
Items,
Summoners,
Summoner Spells,
Game Modes,
Summoner's Rift,
The Twisted Treeline,
Howling Abyss,
Home,
Game Info]
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('script')
Out[22]:
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('a')
[<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-N98J');</script>,
<script>window.ga = window.ga || function(){(ga.q=ga.q||[]).push(arguments)};ga.l = +new Date;</script>,
<script src="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/modernizr.js" type="text/javascript"></script>,
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>,
<script src="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/riot-all.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/riot-kit-all.js" type="text/javascript"></script>,
<script type="text/javascript">rg_force_language = 'en_US';rg_force_manifest = 'https://ddragon.leagueoflegends.com/realms/na.js';rg_assets = 'https://lolstatic-a.akamaihd.net/game-info/1.1.9';</script>,
<script type="text/javascript">window.riotBarConfig = {touchpoints: {activeTouchpoint: 'game'},locale: {landingUrlPattern : 'https://na.leagueoflegends.com//game-info/'},footer: {enabled: true,container: {renderFooterInto: '#footer'}}};</script>,
<script async="" src="https://lolstatic-a.akamaihd.net/riotbar/prod/latest/en_US.js"></script>,
<script src="https://ddragon.leagueoflegends.com/cdn/dragonhead.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/riot-dd-utils.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/riot-dd-i18n.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/external/jquery.lazy-load.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/DDFilterApp.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/DDMarkupItem.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/DDMarkupContainer.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/champions/ChampionsListGridItem.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/champions/ChampionsListGridView.js" type="text/javascript"></script>,
<script src="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/js/champions/ChampionsListApp.js" type="text/javascript"></script>]
soup = BeautifulSoup(requests.get("https://na.leagueoflegends.com/en/game-info/champions/").text, 'lxml')
soup.find_all('link')
Out[23]:
[<link href="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/lol-kit.css" rel="stylesheet"/>,
<link href="https://lolstatic-a.akamaihd.net/frontpage/apps/prod/LolGameInfo-Harbinger/en_US/0d258ed5be6806b967afaf2b4b9817406912a7ac/assets/assets/css/base-styles.css" rel="stylesheet"/>,
<link href="https://lolstatic-a.akamaihd.net/lolkit/1.1.6/resources/images/favicon.ico" rel="SHORTCUT ICON"/>]

Related

BeautifulSoup Access Denied parsing error [duplicate]

This question already has answers here:
Scraper in Python gives "Access Denied"
(3 answers)
Closed 2 years ago.
This is the first time I've encountered a site where it wouldn't 'allow me access' to the webpage. I'm not sure why and I can't figure out how to scrape from this website.
My attempt:
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Output:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.g2a.com/" on this server.
<p>
Reference #18.4d24db17.1592006766.55d2bc1
</p>
</body>
</html>
I've looked into it for awhile now and I found that there is supposed to be some type of token [access, refresh, etc...].
Also, action="/search" but I wasn't sure what to do with just that.
This page needs to specify some HTTP headers to obtain the information (Accept-Language):
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.5'}
def html(url):
return BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="polyfill.g2a.com" rel="dns-prefetch"/>
<link href="images.g2a.com" rel="dns-prefetch"/>
<link href="id.g2a.com" rel="dns-prefetch"/>
<link href="plus.g2a.com" rel="dns-prefetch"/>
... and so on.

Python BS4 - NameError: name 'tagID' is not defined

I'm new to python. I'm building an app to parse and clean MSWord generated HTML.
In the following code, I pass in content as a BS4 object and I'attempt to update a specific span tag with new attributes.
content = ' <html>
<head></head>
<body>
<span style="background: #c0c0c0">Table 1</span>
<span style="background: #c0c0c0">Figure 1</span>
</body>
</html>'
def clean_table_figure_id_tags(content):
for element in content.findAll('span', style='background: #ccc0'):
# inspect the existing tag a to determine table or figure
if 'Table' in element.string:
tagID = 'TableId'
elif 'Figure' in element.string:
tagID = 'FigureId'
# tagID = content(elementString)
newTag = Tag(builder=content.builder, name='span', attrs={'id': tagID, 'class': 'variable'})
newTag.string = element.string
element.replace_with(newTag)
return content
However, I receive the following error: NameError: name 'tagID' is not defined
Any help is greatly appreciated.
If I understand you correctly, you are looking for something like this:
from bs4 import BeautifulSoup as bs
content = """[your html above]"""
soup = bs(content,'lxml')
for elem in soup.select('span[style="background: #c0c0c0"]'):
if "Table" in elem.text:
elem.attrs['id'] = 'TableId'
if "Figure" in elem.text:
elem.attrs['id'] = 'FigureId'
print(soup.prettify())
Output:
<html>
<head>
</head>
<body>
<span id="TableId" style="background: #c0c0c0">
Table 1
</span>
<span id="FigureId" style="background: #c0c0c0">
Figure 1
</span>
</body>
</html>

How to fix ```<function Render.__getattr__.<locals>.template at 0x7f1dafac8400>``` error in python3 web.py

I'm using web.py for python3. And gets an error.
Error:
function Render.getattr..template at 0x7fcf15a43840
main.py
import web
from web.template import render
urls = (
'/', 'Home'
)
render = web.template.render('Views/Templates/', base='MainLayout')
app = web.application(urls, globals())
#routes
class Home:
def GET(self):
return render.Home
if __name__ == '__main__':
app.run()
Views/Templates/MainLayout.html
def with (page)
<html>
<head>
<title> Title </title>
</head>
<body>
<div id='app'>
$:page
</div>
</body>
</html>
Views/Templates/Home.html
<h1> Hello World!</h1>

How do I access content in a script tag using bs4

I am new to python and I'm trying to use beautiful soup to find a script tag on a page that has the dataLayer and then retrieve the value of postNo and print it.
<head>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"></script>
<!-- Data Layer - Begin -->
<script>
dataLayer = [
{
'country': 'UnitedKingdom',
'site': 'Blog',
'postNo': '34',
'pageType': 'Home',
'pageType2': 'Blog',
'pageType3': 'Top Tips'
}
];
</script>
<!-- Data Layer - End -->
</head>
Any help or pointers would be highly appreciated.
Thanks
import requests
import bs4
import json
html = '''
<head>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"></script>
<!-- Data Layer - Begin -->
<script>
dataLayer = [
{
'country': 'UnitedKingdom',
'site': 'Blog',
'postNo': '34',
'pageType': 'Home',
'pageType2': 'Blog',
'pageType3': 'Top Tips'
}
];
</script>
<!-- Data Layer - End -->
</head>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'dataLayer = ' in script.text:
jsonStr = script.text.strip()
jsonStr = jsonStr.split('[')[1].strip()
jsonStr = jsonStr.split(']')[0].strip()
jsonStr = jsonStr.replace("'", '"')
jsonObj = json.loads(jsonStr)
print (jsonObj['postNo'])
Output:
print (jsonObj['postNo'])
34
Just pull the list from the html and parse, simple. See the code below.
from bs4 import BeautifulSoup
import ast
html = '''
<head>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"></script>
<!-- Data Layer - Begin -->
<script>
dataLayer = [
{
'country': 'UnitedKingdom',
'site': 'Blog',
'postNo': '34',
'pageType': 'Home',
'pageType2': 'Blog',
'pageType3': 'Top Tips'
}
];
</script>
<!-- Data Layer - End -->
</head>'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.findAll('script')[2].text.replace(';','').replace('dataLayer = ','').strip()
data = ast.literal_eval(content)
print([x['postNo'] for x in data])

An uncaught error when trying to update the result of a python script using brython

I'm new to brython. I got an uncaught error when trying to update the running result of a python script to my page using brython.
Uncaught Error
at Object._b_.AttributeError.$factory (eval at $make_exc (brython.js:7017), <anonymous>:31:340)
at attr_error (brython.js:6184)
at Object.$B.$getattr (brython.js:6269)
at CallWikifier0 (eval at $B.loop (brython.js:4702), <anonymous>:34:66)
at eval (eval at $B.loop (brython.js:4702), <anonymous>:111:153)
at $B.loop (brython.js:4702)
at $B.inImported (brython.js:4696)
at $B.loop (brython.js:4708)
at $B.inImported (brython.js:4696)
at $B.loop (brython.js:4708)
Here my page
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<head>
<script src="/brython-3.7.0rc1/www/src/brython.js"></script>
<script src="/brython-3.7.0rc1/www/src/brython_stdlib.js"></script>
</head>
<body onload="brython()">
<script type="text/python" src="vttToWiki.py"></script>
<div class="wrap">title</div>
<p id="content-title"></p>
<div class="wrap">subtitle</div>
<div id="content-subtitle" class="content-div">
<p id="message" font-color=white>loading...</p>
</div>
<button id="mybutton">click!</button>
</body>
</html>
Here my python script. This is script contact with http://www.wikifier.org/annotate-article, which is a website for semantic annotation. Given a text, the function CallWikifier got the annotation result and wiki url accordingly.
import sys, json, urllib.parse, urllib.request
from browser import alert, document, html
element = document.getElementById("content-title")
def CallWikifier(text, lang="en", threshold=0.8):
# Prepare the URL.
data = urllib.parse.urlencode([
("text", text), ("lang", lang),
("userKey", "trnwaxlgwchfgxqqnlfltzdfihiakr"),
("pageRankSqThreshold", "%g" % threshold), ("applyPageRankSqThreshold", "true"),
("nTopDfValuesToIgnore", "200"), ("nWordsToIgnoreFromList", "200"),
("wikiDataClasses", "true"), ("wikiDataClassIds", "false"),
("support", "true"), ("ranges", "false"),
("includeCosines", "false"), ("maxMentionEntropy", "3")
])
url = "http://www.wikifier.org/annotate-article"
# Call the Wikifier and read the response.
req = urllib.request.Request(url, data=data.encode("utf8"), method="POST")
with urllib.request.urlopen(req, timeout = 60) as f:
response = f.read()
response = json.loads(response.decode("utf8"))
# Save the response as a JSON file
with open("record.json", "w") as f:
json.dump(response, f)
# update the annotations.
for annotation in response["annotations"]:
alert(annotation["title"])
elt = document.createElement("B")
txt = document.createTextNode(f"{annotation}\n")
elt.appendChild(txt)
element.appendChild(elt)
document["mybutton"].addEventListener("click", CallWikifier("foreign minister has said Damascus is ready " +"to offer a prisoner exchange with rebels.")
If you have any idea about that please let me know. Thank you for advice!

Resources