Web scraper crawler using Breadth First Search in Python - python-3.x

I want to create web crawler for a wikipedia page( all the links within the page gets opened and saved too) which needs to be implemented in Breadth First Search way. I have been looking at a lot sources and stackoverflow codes/problems but unable to implement it.
I tried the following code :
import requests
from parsel import Selector
import time
start = time.time()
### Crawling to the website fetch links and images -> store images -> crawl more to the fetched links and scrape more images
all_images = {} # website links as "keys" and images link as "values"
# GET request to recurship site
response = requests.get('https://en.wikipedia.org/wiki/Plant')
selector = Selector(response.text)
href_links = selector.xpath('//a/#href').getall()
image_links = selector.xpath('//img/#src').getall()
for link in href_links:
try:
response = requests.get(link)
if response.status_code == 200:
image_links = selector.xpath('//img/#src').getall()
all_images[link] = image_links
except Exception as exp:
print('Error navigating to link : ', link)
print(all_images)
end = time.time()
print("Time taken in seconds : ", (end-start))
but this throws an error saying "Error Navigating to link". How do I go about it ? I am a total newbie in this field.

Your href_links will be relative path for wiki links.
You must append the baseUrl of wikipedia.
base_url = 'https://en.wikipedia.org/'
href_links = [base_url + link for link in selector.xpath('//a/#href').getall()]
Note that this will work for wiki links if you have external links in href use something like this:
href_links = []
for link in selector.xpath('//a/#href').getall():
if not link.startswith('http'):
href_links.append(base_url + link)
else:
href_links.append(link)

Related

I can't get data from a website using beautiful soup(python)

i have a problem. I am trying to create a web scraping script using python that gets the titles and the links from the articles. The link i want to get all the data is https://ec.europa.eu/commission/presscorner/home/en . The problem is that when i run the code, i don't get anything. Why is that? Here is the code:
from bs4 import BeautifulSoup as bs
import requests
#url_1 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
url_2 = "https://ec.europa.eu/commission/presscorner/home/en"
links = [url_2]
for i in links:
site = i
page = requests.get(site).text
doc = bs(page, "html.parser")
# if site == url_1:
# h3 = doc.find_all("h3", class_="listing__title")
# for b in h3:
# title = b.text
# link = b.find_all("a")[0]["href"]
# if(link[0:5] != "https"):
# link = "https://ec.europa.eu" + link
# print(title)
# print(link)
# print()
if site == url_2:
ul = doc.find_all("li", class_="ecl-list-item")
for d in ul:
title_2 = d.text
link_2 = d.find_all("a")[0]["href"]
if(link_2[0:5] != "https"):
link_2 = "https://ec.europa.eu" + link_2
print(title_2)
print(link_2)
print()
(I am also want to get data from another url(the url i have on the script) but from that link, i get all the data i want).
Set a breakpoint after the line page = requests... and you will see the data you pull. The webpage is loading most of its contents via javascript. That's why you're not able to scrape any data.
You can either use Selenium or a proxy service that can render javascript- but these are paid services.

How to collect URL links for pages that are not numerically ordered

When URLs are ordered in a numeric order, it's simple to fetch all the articles in a given website.
However, when we have a website such as https://mongolia.mid.ru/en_US/novosti where there are articles with URLs like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
How do I fetch all the article URLs on this website? Where there's no numeric order or whatsoever.
There's order to that chaos.
If you take a good look at the source code you'll surely notice the next button. If you click it and inspect the url (it's long, I know) you'll see there's a value at the very end of it - _cur=1. This is the number of the current page you're at.
The problem, however, is that you don't know how many pages there are, right? But, you can programmatically keep checking for a url in the next button and stop when there are no more pages to go to.
Meanwhile, you can scrape for article urls while you're at the current page.
Here's how to do it:
import requests
from lxml import html
url = "https://mongolia.mid.ru/en_US/novosti"
next_page_xpath = '//*[#class="pager lfr-pagination-buttons"]/li[2]/a/#href'
article_xpath = '//*[#class="title"]/a/#href'
def get_page(url):
return requests.get(url).content
def extractor(page, xpath):
return html.fromstring(page).xpath(xpath)
def head_option(values):
return next(iter(values), None)
articles = []
while True:
page = get_page(url)
print(f"Checking page: {url}")
articles.extend(extractor(page, article_xpath))
next_page = head_option(extractor(page, next_page_xpath))
if next_page == 'javascript:;':
break
url = next_page
print(f"Scraped {len(articles)}.")
# print(articles)
This gets you 216 article urls. If you want to see the article urls, just uncomment the last line - # print(articles)
Here's a sample of 2:
['https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1', 'https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1']

Scraping Site Data with out Selenium

Currently I am trying to pull CMS historical data from there site. I have got some working code to pull the download links from the page. My problem is that the links are divided into pages. I need to iterate through all the available pages and extract the download links. The obvious choice here is to use Selenium to click next pages and get data. Due to company policy i can not run selenium in the environment. Is there a way I can got through the pages and extract link. The website does not show the post link once you try to go to next page. I am out of ideas to try and get to next page without post link or not using selenium.
Current working code to pull links from first page
import pandas as pd
from datetime import datetime
#from selenium import webdriver
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslink = "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report"
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath('//a/#href')
df1 = pd.DataFrame(headers,columns= ['links'])
df1SubSet = df1[df1['links'].str.contains('contract-summary', case=False)]
These are the two urls that will give you the total 166 entries. I have also changed the condition for capturing hrefs. Give this a try.
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
df1 = pd.DataFrame(headers,columns= ['links'])
df=df.append(df1)

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Python web crawler doesn't crawl all pages

I'm trying to make a web crawler that crawls a set number of pages, but it only crawls the first page, and prints it as many times as the amount of pages i want to crawl.
def web_spider (max_pages):
page = 1
while page <= max_pages:
url = 'http://www.forbes.com/global2000/list/#page:' + str(page) + '_sort:0_direction:asc_search:_filter:All%20industries_' \
'filter:All%20countries_filter:All%20states'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a'):
if link.parent.name == 'td':
href = link.get('href')
x = href[11:len(href)-1]
company_list.append(x)
page += 1
print(page)
return company_list
Edit: Did it another way.
In case you want the dataset, you can use your browsers developer tools to find what network resources are used by clicking on Record network traffic and refresh the page to see how the table is populated. In this case I found the following URL:
https://www.forbes.com/forbesapi/org/global2000/2020/position/true.json?limit=2000
Does that help you?

Resources