HTML confusion with Python BeautifulSoup

HTML confusion with Python BeautifulSoup - python-3.x

I'm following thenewboston's tutorials on youtube and after compiling my code I get no errors.
I'm trying to print the "Generic Line List" and all the links following that list;can be found at the bottom of this link
http://playrustwiki.com/wiki/List_of_Items
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages: #makes our pages change everytime
url = 'http://playrustwiki.com/wiki/List_of_Items' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text) #find all the links in soup or all the titles
for link in soup.findAll('a', {'class': 'a href'}): #links are a for anchors in HTML
href = link.get('href') # href attribute
print(href)
page += 1
trade_spider(1)
I've tried different HTML attributes but I think thats where my confusion starts. I can't find the correct attribute to call for my scraper or I'm calling the wrong attribute.
Please help~
Thanks :)

The idea here would be to find the element that has Generic line list text. Then, find the next ul sibling via find_next_sibling() and get all links inside via find_all():
h3 = soup.find('h3', text='Generic Line List')
generic_line_list = h3.find_next_sibling('ul')
for link in generic_line_list.find_all('a', href=True):
print(link['href'])
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> url = 'http://playrustwiki.com/wiki/List_of_Items'
>>> soup = BeautifulSoup(requests.get(url).content)
>>>
>>> h3 = soup.find('h3', text='Generic Line List')
>>> generic_line_list = h3.find_next_sibling('ul')
>>> for link in generic_line_list.find_all('a', href=True):
... print(link['href'])
...
/wiki/Wood_Barricade
/wiki/Wood_Shelter
...
/wiki/Uber_Hunting_Bow
/wiki/Cooked_Chicken_Breast
/wiki/Anti-Radiation_Pills

Related

Is there a way how to extract data from response.content in python?

I'm trying to figure out how to scrape/extract a image url out of response.content.
This is the url I'm trying to extract <img src="/Content/images/asos-logo-2022-93x28.png"
The problem is that everything after the /Content/images/ part can change...
Any help appreciated !!!

You can use Beautiful Soup for this:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("https://stackoverflow.com/q/71636643/1416672")
>>> html = r.text
>>> soup = BeautifulSoup(html, 'html.parser')
>>> for item in soup.find_all('img'): print(item['src'])
...
https://cdn.sstatic.net/Img/teams/teams-illo-free-sidebar-promo.svg?v=47faa659a05e
https://www.gravatar.com/avatar/f96b33e2715bf57ba8e434140f0aeeba?s=64&d=identicon&r=PG&f=1
/posts/71636643/ivc/9bb6
https://sb.scorecardresearch.com/p?c1=2&c2=17440561&cv=3.6.0&cj=1
If you want to match a specific image then check the docs how to search by CSS class or any other CSS selectors.

How can I get the href of anchor tag using Beautiful Soup?

I am trying to get the href of anchor tag of the very first video search on YouTube using Beautiful Soup. I am searching it by using the "a" and class_="yt-simple-endpoint style-scope ytd-video-renderer".
But I am getting None output:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
# print(soup2.prettify())
a =soup.findAll("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
a_fin = soup.find("a", class_="compact-media-item-image")
#
print(a)

from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
first_serach_result_link = soup.findAll('a',attrs={'class':'yt-uix-tile-link'})[0]['href']
heavily inspired by
this answer

Another option is to render the page first with Selenium.
import bs4
from selenium import webdriver
url = 'https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing'
browser = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
browser.get(url)
source = browser.page_source
soup = bs4.BeautifulSoup(source,'html.parser')
hrefs = soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
for a in hrefs:
print (a['href'])
Output:
/watch?v=Jor09n2IF44
/watch?v=ym14AyqJDTg
/watch?v=g-2V1XJL0kg
/watch?v=eeVYaDLC5ik
/watch?v=StI92Bic3UI
/watch?v=2W_4LIAhbdQ
/watch?v=PH1WZPT5IKw
/watch?v=Au2EH3GsM7k
/watch?v=q-j1HEnDn7w
/watch?v=Usjg7IuUhvU
/watch?v=YizmwHibomQ
/watch?v=i2q6Fm0E3VE
/watch?v=OXNAMyEvcH4
/watch?v=vdcBtAeZsCk
/watch?v=E4v2StDdYqs
/watch?v=x7kCuRB0f7E
/watch?v=KERtHNoZrF0
/watch?v=TenbA4wWIJA
/watch?v=Ey9HfjUyUvY
/watch?v=hqsuOT0URJU

It dynamic html you can use Selenium or to get static html use GoogleBot user-agent
headers = {'User-Agent' : 'Googlebot/2.1 (+http://www.google.com/bot.html)'}
source = requests.get("https://.......", headers=headers).text
soup = BeautifulSoup(source, 'lxml')
links = soup.findAll("a", class_="yt-uix-tile-link")
for link in links:
print(link['href'])

Try looping over the matches:
import urllib2
data = urllib2.urlopen("some_url")
html_data = data.read()
soup = BeautifulSoup(html_data)
for a in soup.findAll('a',href=True):
print a['href']

The class which you're searching does not exist in the scraped html. You can identify it by printing the soup variable.
For example:
a =soup.findAll("a", class_="sign-in-link")
gives output as:
[<a class="sign-in-link" href="https://accounts.google.com/ServiceLogin?passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26feature%3Dplaylist%26hl%3Den%26next%3D%252Fresults%253Fsearch_query%253DMP%252Belection%252Bresults%252B2018%25253A%252BBJP%252Bminister%252Bblames%252Bconspiracy%252Bas%252Breason%252Bwhile%252Blosing&uilel=3&hl=en&service=youtube">Sign in</a>]

BeautifulSoup4 not findall() not getting all of the links on the webpage

I am trying to grab all of the 'a' links from a webpage:
from bs4 import BeautifulSoup
import requests
source_code = requests.get(starting_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
and the list printed out it not all links on the page. if I try and print out plain_text, I can sea all these links, but they are not printed as href.
First week learning python! All help is greatly appreciated. Thanks!
Update: I forgot to share the plaint_text file here. Sorry for the confusion.
The plain_text is pretty long so I'll just post the starting_url
starting_url = 'https://freeexampapers.com/index.php?option=com_content&view=article&id=1&Itemid=101&jsmallfib=1&dir=JSROOT/IB'
and yes I'm a high school student:-)

Since you have not given any data sample we can give you sample that
you could try :-
soup = BeautifulSoup(html_page,"html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print link.get('href')

This should do it.
import re
import requests
from bs4 import BeautifulSoup
import os
import fileinput
Link = 'https://animetosho.org/view/jacobswaggedup-kill-la-kill-bd-1280x720-mp4-batch.n677876'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
#print soup
subtitles = soup.findAll('div',{'class':'links'})
#print subtitles
with open("Anilinks.txt", "w") as f:
for link in subtitles:
x = link.find_all('a', limit=26)
for a in x:
url = a['href']
f.write(url+'\n')
Now, if you want to do something like store the links in a text file, do the following.
# Store the links we need in a list
links_to_keep = []
with open("Anilinks.txt", "r") as f:
for line in f.readlines():
if 'solidfiles.com' in line:
links_to_keep.append(line)
# Write all the links in our list to the file
with open("Anilinks.txt", "w") as f:
for link in links_to_keep:
f.write(link)

Converting Beautifulsoup scraped table to list

Scraping a column from Wikipedia with Beautifulsoup returns the last row, while I want all of them in a list:
from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})

for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col) > 0:
com = str(col[1].string.strip("\n"))

list(com)
com
Out: 'ZTS'
So it only shows the last row of the string, I was expecting to get a list with each line of the string as a string value. So that I can assign the list to new variable.
"Rice", "Buffalo milk", "Cow milk", "Wheat"
Can anyone help me?

Your method will not work because you are not "adding" anything to com.
One way to do what you desire is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
com=[]
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col)> 0:
temp=col[1].contents[0]
try:
to_append=temp.contents[0]
except Exception as e:
to_append=temp
com.append(to_append)
print(com)
This will give you what you require.
Explanation
col[1].contents[0] gives the first child of the tag. .contents gives you a list of children of the tag. Here we have a single child so 0.
In some cases, the content inside the <tr> tag is a <a href> tag. So I apply another .contents[0] to get the text.
In other cases it is not a link. For that I used an exception statement. If there is no descendant of the child extracted, we would get an exception.
See the official documentation for details

Why am I getting duplicate links ? And how do I fetch links on the next pages?

I am getting duplicate links when for the links I am trying to obtain, I am not sure why. Also I am trying to fetch all the links like the ones I am getting from all the pages. But I am not sure how to write the code to click next page. Could someone please help me understand how I would go about this?
import requests
from bs4 import BeautifulSoup
url = 'http://www.gosugamers.net/counterstrike/teams'
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
#all_teams = []
for team_links in soup.find_all('a', href=True):
if team_links['href'] == '' or team_links['href'].startswith('/counterstrike/teams'):
print (team_links.get('href').replace('/counterstrike/teams', url))

The team links are in anchor tags inside the h3 tags which are inside the div with the details class:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
base = "http://www.gosugamers.net"
url = 'http://www.gosugamers.net/counterstrike/teams'
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
print ( urljoin(base, team_links["href"]))
Which gives you:
http://www.gosugamers.net/counterstrike/teams/16338-motv
http://www.gosugamers.net/counterstrike/teams/16337-absolute-monster
http://www.gosugamers.net/counterstrike/teams/16258-immortals-cs
http://www.gosugamers.net/counterstrike/teams/16251-ireal-star-gaming
http://www.gosugamers.net/counterstrike/teams/16176-team-genesis
http://www.gosugamers.net/counterstrike/teams/16175-potadies
http://www.gosugamers.net/counterstrike/teams/16174-crowns-gg
http://www.gosugamers.net/counterstrike/teams/16173-visomvet
http://www.gosugamers.net/counterstrike/teams/16172-team-phenomenon
http://www.gosugamers.net/counterstrike/teams/16152-kriklekrakle
http://www.gosugamers.net/counterstrike/teams/16148-begenius
http://www.gosugamers.net/counterstrike/teams/16144-blubblub
http://www.gosugamers.net/counterstrike/teams/16142-team-1231
http://www.gosugamers.net/counterstrike/teams/16141-vsv
http://www.gosugamers.net/counterstrike/teams/16140-tbi
http://www.gosugamers.net/counterstrike/teams/16136-deadweight
http://www.gosugamers.net/counterstrike/teams/16135-me-myself-and-i
http://www.gosugamers.net/counterstrike/teams/16085-pur-esports
http://www.gosugamers.net/counterstrike/teams/15850-falken
http://www.gosugamers.net/counterstrike/teams/15815-team-abyssal
You are literally parsing all the links on the page, that is why you see the dupes.
To get all the teams we can parse the next page link until the span with the "Next" text is not there any more which only happens for the last page:
def get_all(url, base):
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
yield (urljoin(base, team_links["href"]))
nxt = soup.find("div", {"class": "pages"}).find("span", text="Next")
while nxt:
r = requests.get(urljoin(base, nxt.find_previous("a")["href"]))
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
yield (urljoin(base, team_links["href"]))
nxt = soup.find("div", {"class": "pages"}).find("span", text="Next")
If we run it for a couple of seconds, you can see we get the next pages:
In [26]: for link in (get_all(url, base)):
....: print(link)
....:
http://www.gosugamers.net/counterstrike/teams/16386-cantonese-cs
http://www.gosugamers.net/counterstrike/teams/16338-motv
http://www.gosugamers.net/counterstrike/teams/16337-absolute-monster
http://www.gosugamers.net/counterstrike/teams/16258-immortals-cs
http://www.gosugamers.net/counterstrike/teams/16251-ireal-star-gaming
http://www.gosugamers.net/counterstrike/teams/16176-team-genesis
http://www.gosugamers.net/counterstrike/teams/16175-potadies
http://www.gosugamers.net/counterstrike/teams/16174-crowns-gg
http://www.gosugamers.net/counterstrike/teams/16173-visomvet
http://www.gosugamers.net/counterstrike/teams/16172-team-phenomenon
http://www.gosugamers.net/counterstrike/teams/16152-kriklekrakle
http://www.gosugamers.net/counterstrike/teams/16148-begenius
http://www.gosugamers.net/counterstrike/teams/16144-blubblub
http://www.gosugamers.net/counterstrike/teams/16142-team-1231
http://www.gosugamers.net/counterstrike/teams/16141-vsv
http://www.gosugamers.net/counterstrike/teams/16140-tbi
http://www.gosugamers.net/counterstrike/teams/16136-deadweight
http://www.gosugamers.net/counterstrike/teams/16135-me-myself-and-i
http://www.gosugamers.net/counterstrike/teams/16085-pur-esports
http://www.gosugamers.net/counterstrike/teams/15850-falken
http://www.gosugamers.net/counterstrike/teams/15815-team-abyssal
http://www.gosugamers.net/counterstrike/teams/15810-ex-deathtrap
http://www.gosugamers.net/counterstrike/teams/15808-mix123
http://www.gosugamers.net/counterstrike/teams/15651-undertake-esports
http://www.gosugamers.net/counterstrike/teams/15644-five
http://www.gosugamers.net/counterstrike/teams/15630-five
http://www.gosugamers.net/counterstrike/teams/15627-inetkoxtv
http://www.gosugamers.net/counterstrike/teams/15626-tetr-s
http://www.gosugamers.net/counterstrike/teams/15625-rozenoir-esports-white
http://www.gosugamers.net/counterstrike/teams/15619-fragment-gg
http://www.gosugamers.net/counterstrike/teams/15615-monarchs-gg
http://www.gosugamers.net/counterstrike/teams/15602-ottoman-fire
http://www.gosugamers.net/counterstrike/teams/15591-respect
http://www.gosugamers.net/counterstrike/teams/15569-moonbeam-gaming
http://www.gosugamers.net/counterstrike/teams/15563-team-tilt
http://www.gosugamers.net/counterstrike/teams/15534-dynasty-uk
http://www.gosugamers.net/counterstrike/teams/15507-urbantech
http://www.gosugamers.net/counterstrike/teams/15374-innova
http://www.gosugamers.net/counterstrike/teams/15373-g3x
http://www.gosugamers.net/counterstrike/teams/15372-cnb
http://www.gosugamers.net/counterstrike/teams/15370-intz
http://www.gosugamers.net/counterstrike/teams/15369-2kill
http://www.gosugamers.net/counterstrike/teams/15368-supernova
http://www.gosugamers.net/counterstrike/teams/15367-biggods
http://www.gosugamers.net/counterstrike/teams/15366-playzone
http://www.gosugamers.net/counterstrike/teams/15365-pride
http://www.gosugamers.net/counterstrike/teams/15359-rising-orkam
http://www.gosugamers.net/counterstrike/teams/15342-team-foxez
http://www.gosugamers.net/counterstrike/teams/15336-angels
http://www.gosugamers.net/counterstrike/teams/15331-atlando-esports
http://www.gosugamers.net/counterstrike/teams/15329-xfinity-esports
http://www.gosugamers.net/counterstrike/teams/15326-nano-reapers
http://www.gosugamers.net/counterstrike/teams/15322-erase-team
http://www.gosugamers.net/counterstrike/teams/15318-heyguys
http://www.gosugamers.net/counterstrike/teams/15317-illusory
http://www.gosugamers.net/counterstrike/teams/15285-dismay
http://www.gosugamers.net/counterstrike/teams/15284-kingdom-esports
http://www.gosugamers.net/counterstrike/teams/15283-team-rival
http://www.gosugamers.net/counterstrike/teams/15282-ze-pug-godz
http://www.gosugamers.net/counterstrike/teams/15281-unlimited-potential1
You can see in the source for the first and any bar the last page the span with Next:
And when we get to the last, there is only spans with Previous and First:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

HTML confusion with Python BeautifulSoup - python-3.x

Related

Is there a way how to extract data from response.content in python?

How can I get the href of anchor tag using Beautiful Soup?

BeautifulSoup4 not findall() not getting all of the links on the webpage

Converting Beautifulsoup scraped table to list

Why am I getting duplicate links ? And how do I fetch links on the next pages?

Categories

Resources