Python web crawler doesn't crawl all pages - python-3.x

I'm trying to make a web crawler that crawls a set number of pages, but it only crawls the first page, and prints it as many times as the amount of pages i want to crawl.
def web_spider (max_pages):
page = 1
while page <= max_pages:
url = 'http://www.forbes.com/global2000/list/#page:' + str(page) + '_sort:0_direction:asc_search:_filter:All%20industries_' \
'filter:All%20countries_filter:All%20states'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a'):
if link.parent.name == 'td':
href = link.get('href')
x = href[11:len(href)-1]
company_list.append(x)
page += 1
print(page)
return company_list
Edit: Did it another way.

In case you want the dataset, you can use your browsers developer tools to find what network resources are used by clicking on Record network traffic and refresh the page to see how the table is populated. In this case I found the following URL:
https://www.forbes.com/forbesapi/org/global2000/2020/position/true.json?limit=2000
Does that help you?

Related

I can't get data from a website using beautiful soup(python)

i have a problem. I am trying to create a web scraping script using python that gets the titles and the links from the articles. The link i want to get all the data is https://ec.europa.eu/commission/presscorner/home/en . The problem is that when i run the code, i don't get anything. Why is that? Here is the code:
from bs4 import BeautifulSoup as bs
import requests
#url_1 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
url_2 = "https://ec.europa.eu/commission/presscorner/home/en"
links = [url_2]
for i in links:
site = i
page = requests.get(site).text
doc = bs(page, "html.parser")
# if site == url_1:
# h3 = doc.find_all("h3", class_="listing__title")
# for b in h3:
# title = b.text
# link = b.find_all("a")[0]["href"]
# if(link[0:5] != "https"):
# link = "https://ec.europa.eu" + link
# print(title)
# print(link)
# print()
if site == url_2:
ul = doc.find_all("li", class_="ecl-list-item")
for d in ul:
title_2 = d.text
link_2 = d.find_all("a")[0]["href"]
if(link_2[0:5] != "https"):
link_2 = "https://ec.europa.eu" + link_2
print(title_2)
print(link_2)
print()
(I am also want to get data from another url(the url i have on the script) but from that link, i get all the data i want).
Set a breakpoint after the line page = requests... and you will see the data you pull. The webpage is loading most of its contents via javascript. That's why you're not able to scrape any data.
You can either use Selenium or a proxy service that can render javascript- but these are paid services.

How to collect URL links for pages that are not numerically ordered

When URLs are ordered in a numeric order, it's simple to fetch all the articles in a given website.
However, when we have a website such as https://mongolia.mid.ru/en_US/novosti where there are articles with URLs like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
How do I fetch all the article URLs on this website? Where there's no numeric order or whatsoever.
There's order to that chaos.
If you take a good look at the source code you'll surely notice the next button. If you click it and inspect the url (it's long, I know) you'll see there's a value at the very end of it - _cur=1. This is the number of the current page you're at.
The problem, however, is that you don't know how many pages there are, right? But, you can programmatically keep checking for a url in the next button and stop when there are no more pages to go to.
Meanwhile, you can scrape for article urls while you're at the current page.
Here's how to do it:
import requests
from lxml import html
url = "https://mongolia.mid.ru/en_US/novosti"
next_page_xpath = '//*[#class="pager lfr-pagination-buttons"]/li[2]/a/#href'
article_xpath = '//*[#class="title"]/a/#href'
def get_page(url):
return requests.get(url).content
def extractor(page, xpath):
return html.fromstring(page).xpath(xpath)
def head_option(values):
return next(iter(values), None)
articles = []
while True:
page = get_page(url)
print(f"Checking page: {url}")
articles.extend(extractor(page, article_xpath))
next_page = head_option(extractor(page, next_page_xpath))
if next_page == 'javascript:;':
break
url = next_page
print(f"Scraped {len(articles)}.")
# print(articles)
This gets you 216 article urls. If you want to see the article urls, just uncomment the last line - # print(articles)
Here's a sample of 2:
['https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1', 'https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1']

Web scraper crawler using Breadth First Search in Python

I want to create web crawler for a wikipedia page( all the links within the page gets opened and saved too) which needs to be implemented in Breadth First Search way. I have been looking at a lot sources and stackoverflow codes/problems but unable to implement it.
I tried the following code :
import requests
from parsel import Selector
import time
start = time.time()
### Crawling to the website fetch links and images -> store images -> crawl more to the fetched links and scrape more images
all_images = {} # website links as "keys" and images link as "values"
# GET request to recurship site
response = requests.get('https://en.wikipedia.org/wiki/Plant')
selector = Selector(response.text)
href_links = selector.xpath('//a/#href').getall()
image_links = selector.xpath('//img/#src').getall()
for link in href_links:
try:
response = requests.get(link)
if response.status_code == 200:
image_links = selector.xpath('//img/#src').getall()
all_images[link] = image_links
except Exception as exp:
print('Error navigating to link : ', link)
print(all_images)
end = time.time()
print("Time taken in seconds : ", (end-start))
but this throws an error saying "Error Navigating to link". How do I go about it ? I am a total newbie in this field.
Your href_links will be relative path for wiki links.
You must append the baseUrl of wikipedia.
base_url = 'https://en.wikipedia.org/'
href_links = [base_url + link for link in selector.xpath('//a/#href').getall()]
Note that this will work for wiki links if you have external links in href use something like this:
href_links = []
for link in selector.xpath('//a/#href').getall():
if not link.startswith('http'):
href_links.append(base_url + link)
else:
href_links.append(link)

How to change the page no. of a page's search result for Web Scraping using Python?

I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?
At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()
Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

Beautifulsoup get rid of embedded js and css in html

I need to parse multiple html through requests.get(). I just need to keep the content of the page and get rid of the embedded javascript and css. I saw the following post but no solution works for me.
http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python, http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text, and http://stackoverflow.com/questions/2081586/web-scraping-with-python
I got a working code that doesn't strip js either css... here is my code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
I tried to include solutions from the links above mention but no code works for me.
What line of code can get rid of the embedded js and embedded css
Question Update 4 OCT 2016
The file that read.csv is something like this...
trump,clinton
data science, operating system
windows,linux
diabetes,cancer
I hit gigablast.com with those terms to search one row at the time. One search will be trump clinton. The result is a list of urls. I requests.get(url) and I process those urls getting rid of timeouts, status_code = 400s, and building a clean list of clean_urls = []. After that I fire the following code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
On this line of code page = BeautifulSoup(page.content, 'html.parser').text I have the text of the entire web page, including styles and scripts if they were embedded. I can't target them with BeautifulSoup because the tags are no longer there. I did try page = BeautifulSoup(page.content, 'html.parser') and find_all('<script>') and try to get rid of the script but I ended up erasing the entire file. The desired outcome will be all the text of the html without any...
body {
font: something;
}
or any javascript...
$(document).ready(function(){
$some code
)};
The final file should have no code what so ever, just the content of the document.
I used this code to get rid of javascript and CSS code while scraping HTML page
import requests
from bs4 import BeautifulSoup
url = 'https://corporate.walmart.com/our-story/our-business'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.title.string
for script in soup(["script", "style"]):
script.decompose()
with open('output_file.txt', "a") as text_file:
text_file.write("\nURL : "+ url)
text_file.write("\nTitle : " + title)
for p_tag_data in soup.find_all('p'):
text_file.write("\n"+p_tag_data.text)
for li_tag_data in soup.find_all('li'):
text_file.write("\n"+li_tag_data.text)
for div_tag_data in soup.find_all('div'):
text_file.write("\n"+div_tag_data.text)

Resources