Use beautifulsoup to download href links - python-3.x

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)

Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]

You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

Related

Extract max site number from web page

I need ixtract max page number from propertyes web site. Screen: .
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.nehnutelnosti.sk/vyhladavanie/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'inzeraty')
sitenums = soup.find_all('ul', class_='component-pagination__items d-flex align-items-center')
sitenums.find_all('li', class_='component-pagination__item')
My code returns error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Thanks for any help.
You could use a css selector and grab the second value from end.
Here's how:
import requests
from bs4 import BeautifulSoup
css = ".component-pagination .component-pagination__item a, .component-pagination .component-pagination__item span"
page = requests.get('https://www.nehnutelnosti.sk/vyhladavanie/')
soup = BeautifulSoup(page.content, 'html.parser').select(css)[-2]
print(soup.getText(strip=True))
Output:
2309
Similar idea but doing faster filtering within css selectors rather than indexing, using nth-last-child
The :nth-last-child() CSS pseudo-class matches elements based on their
position among a group of siblings, counting from the end.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nehnutelnosti.sk/vyhladavanie')
soup = bs(r.text, "lxml")
print(int(soup.select_one('.component-pagination__item:nth-last-child(2) a').text.strip()))

How can I scrape a <h1> tag using BeautifulSoup? [Python]

I am currently coding a price tracker for different websites, but I have run into an issue.
I'm trying to scrape the contents of a h1 tag using BeautifulSoup4, but I don't know how. I've tried to use a dictionary, as suggested in
https://stackoverflow.com/a/40716482/14003061, but it returned None.
Can someone please help? It would be appreciated!
Here's the code:
from termcolor import colored
import requests
from bs4 import BeautifulSoup
import smtplib
def choice_bwfo():
print(colored("You have selected Buy Whole Foods Online [BWFO]", "blue"))
url = input(colored("\n[ 2 ] Paste a product link from BWFO.\n", "magenta"))
url_verify = requests.get(url, headers=headers)
soup = BeautifulSoup(url_verify.content, 'html5lib')
item_block = BeautifulSoup.find('h1', {'itemprop' : 'name'})
print(item_block)
choice_bwfo()
here's an example URL you can use:
https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html
Thanks :)
This script will print content of <h1> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html'
# create `soup` variable from the URL:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# print text of first `<h1>` tag:
print(soup.h1.get_text())
Prints:
Organic Spanish Bee Pollen 250g
Or you can do:
print(soup.find('h1', {'itemprop' : 'name'}).get_text())

BeautifulSoup doesn't catch the complete link

When I try to get the link on a web page, bs4 doesn't catch the entire link, it stops before the **?ref**.....
I'll explain the question through the code:
imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
site = requests.get(imdb_link)
soup = BeautifulSoup(site.text,'lxml')
for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
link = items.find("a").get('href')
print(link)
The output is:
/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
...and so on..
But it's wrong, as you can see by seeing the web page, because it might be:
/title/tt0111161/?ref_=adv_li_tt
/title/tt0068646/?ref_=adv_li_tt
...and so on...
How can I get the entire link? I mean the ?ref_=adv_li_tt too?
I use Python 3.7.4
Overall it might be interesting to try and work out how to get the full link - which I think you will need selenium for to allow javascript to run on page, you don't need the full link as seen on rendered page. What you have, with addition of prefix https://www.imdb.com, is perfectly serviceable.
import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
r = s.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(r.content, 'lxml')
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
You could let selenium load page so content renders then pass over to bs4 to get links as on page:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(d.page_source, 'lxml')
d.quit()
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]

Why 'amp;' is include in link in many parts of links('a') that I'm trying to scrape using BeautifulSoup in phyton? whats the better way to remove it?

I am using findAll('a') or the variations of it to extract a particular tag or class but I'm getting 'amp;' in between the link in many parts.
Example:
The two links the actual and error('amp;') one
https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14311&CUST_PREV_CMD=null
https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=111)3&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14311&CUST_PREV_CMD=null
"selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=false&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14271&CUST_PREV_CMD=BROWSE_TOPIC"
I can get rid of it using regex, but is there a better way to do it?
The website I'm having a problem with is cybonline
I don't see that problem at all with lxml. Can you try running the following?
import requests
from bs4 import BeautifulSoup as bs
base_url = 'https://help.cybonline.co.uk/system/'
r = requests.get('https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956')
soup = bs(r.content, 'lxml')
links = [base_url + item['href'] for item in soup.select('.articleAnchor')]
print(links)
If not, you can use replace
base_url + item['href'].replace('amp;', '')
If you want remove that & value you can simply use replace while fetching the value.
import requests
from bs4 import BeautifulSoup
html=requests.get("https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956").text
soup=BeautifulSoup(html,'html.parser')
for a in soup.find_all('a' ,class_='articleAnchor'):
link=a['href'].replace('&' , '')
print(link)
OR
import requests
from bs4 import BeautifulSoup
html=requests.get("https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956").text
soup=BeautifulSoup(html,'html.parser')
for a in soup.select('a.articleAnchor'):
link=a['href'].replace('&' , '')
print(link)

web scraping with beautiful soup in python

I want to crawl the homepage of youtube to pull out all the links of videos. Following is the code
from bs4 import BeautifulSoup
import requests
s='https://www.youtube.com/'
html=requests.get(s)
html=html.text
s=BeautifulSoup(html,features="html.parser")
for e in s.find_all('a',{'id':'video-title'}):
link=e.get('href')
text=e.string
print(text)
print(link)
print()
Nothing is happenning when I run the above code. It seems like the id is not getting discovered. What am I doing wrong
It is because you are not getting the same HTML as your browser have.
import requests
from bs4 import BeautifulSoup
s = requests.get("https://youtube.com").text
soup = BeautifulSoup(s,'lxml')
print(soup)
Save this code's output to a file named test.html and run. You will see that it is not the same as the browser's, as it looks corrupted.
See these questions below.
HTML in browser doesn't correspond to scraped data in python
Python requests not giving me the same HTML as my browser is
Basically, I recommend you to use Selenium Webdriver as it reacts as a browser.
Yes, this is a strange scrape, but if you scrape at the 'div id="content"' level, you are able to get the data you are requesting. I was able to get the titles of each video, but it appears youtube has some rate limiting or throttling, so I do not think you will be able to get ALL of the titles and links. At any rate, below is what I got working for the titles:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')
for each in links:
print(each.text)
May be this could help for scraping all videos from youtube home page,
from bs4 import BeautifulSoup
import requests
r = 'https://www.youtube.com/'
html = requests.get(r)
all_videos = []
soup = BeautifulSoup(html.text, 'html.parser')
for i in soup.find_all('a'):
if i.has_attr('href'):
text = i.attrs.get('href')
if text.startswith('/watch?'):
urls = r+text
all_videos.append(urls)
print('Total Videos', len(all_videos))
print('LIST OF VIDEOS', all_videos)
This code snippet will selects all links from youtube.com homepage that contains /watch? in their href attribute (links to videos):
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://www.youtube.com/').text, 'lxml')
for a in soup.select('a[href*="/watch?"]'):
print('https://www.youtube.com{}'.format(a['href']))
Prints:
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU
...and so on

Resources