I can't get data from a website using beautiful soup(python) - python-3.x

i have a problem. I am trying to create a web scraping script using python that gets the titles and the links from the articles. The link i want to get all the data is https://ec.europa.eu/commission/presscorner/home/en . The problem is that when i run the code, i don't get anything. Why is that? Here is the code:
from bs4 import BeautifulSoup as bs
import requests
#url_1 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
url_2 = "https://ec.europa.eu/commission/presscorner/home/en"
links = [url_2]
for i in links:
site = i
page = requests.get(site).text
doc = bs(page, "html.parser")
# if site == url_1:
# h3 = doc.find_all("h3", class_="listing__title")
# for b in h3:
# title = b.text
# link = b.find_all("a")[0]["href"]
# if(link[0:5] != "https"):
# link = "https://ec.europa.eu" + link
# print(title)
# print(link)
# print()
if site == url_2:
ul = doc.find_all("li", class_="ecl-list-item")
for d in ul:
title_2 = d.text
link_2 = d.find_all("a")[0]["href"]
if(link_2[0:5] != "https"):
link_2 = "https://ec.europa.eu" + link_2
print(title_2)
print(link_2)
print()
(I am also want to get data from another url(the url i have on the script) but from that link, i get all the data i want).

Set a breakpoint after the line page = requests... and you will see the data you pull. The webpage is loading most of its contents via javascript. That's why you're not able to scrape any data.
You can either use Selenium or a proxy service that can render javascript- but these are paid services.

Related

Output from web scraping with bs4 returns empty lists

I am trying to scrape specific information from a website of 25 pages but when I run my code i get empty lists. My output is supposed to be dictionary with the specific information scraped. Please any help would be appreciated.
# Loading libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import mitosheet
# Assigning column names using class_ names
name_selector = "af885_1iPzH"
old_price_selector = "f6eb3_1MyTu"
new_price_selector = "d7c0f_sJAqi"
discount_selector = "._6c244_q2qap"
# Placeholder list
data = []
# Looping over each page
for i in range(1,26):
url = "https://www.konga.com/category/phones-tablets-5294?brand=Samsung&page=" +str(i)
website = requests.get(url)
soup = BeautifulSoup(website.content, 'html.parser')
name = soup.select(name_selector)
old_price = soup.select(old_price_selector)
new_price = soup.select(new_price_selector)
discount = soup.select(discount_selector)
# Combining the elements into a zipped list to be able to pull the data simultaneously
for names, old_prices, new_prices, discounts in zip(name, old_price, new_price, discount):
dic = {"Phone Names": names.getText(),"New Prices": new_prices.getText(),"Old Prices": old_prices.getText(),"Discounts": discounts.getText()}
data.append(dic)
data
I tested the below and it works for me getting 40 name values.
I wasn't able to get the values using beautiful soup but directly through selenium.
If you decide to use Chrome and PyCharm as I have then:
Open Chrome. Click on three dots near top right. Click on Settings then About Chrome to see the version of your Chrome. Download the corresponding driver here. Save the driver in the PyCharm PATH folder
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Assigning column names using class_ names
name_selector = "af885_1iPzH"
# Looping over each page
for i in range(1, 27):
url = "https://www.konga.com/category/phones-tablets-5294?brand=Samsung&page=" +str(i)
driver.get(url)
xPath = './/*[#class="' + name_selector + '"]'
name = driver.find_elements(By.XPATH, xPath)

Web scraping from multiple sites

i have a problem. I want to take the title of news article and the links from the article from multiple websites. Here is the code:
from bs4 import BeautifulSoup as bs
import requests
url_1 = "https://ec.europa.eu/commission/presscorner/home/en"
url_2 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
link = [url_1, url_2]
i = 1;
while i <= len(link):
site = link[i-1]
page = requests.get(site).text
doc = bs(page, "html.parser")
h3 = doc.find_all("h3", class_="listing__title")
for b in h3:
print(b.text)
link = b.find_all("a")[0]["href"]
if(link[0:5] != "https"):
link = "https://ec.europa.eu" + link
print(link)
print()
i +=1
The problem is that i get an error for invalid link and i don't know how to solve the problem(i know that for the first link, i have to search for different tags but when i use if function so as to define which site i am searching, i don't get anything as a result). What can i do in order to solve the problem?
The problem is that you are using the variable link for two things. First you set it to the list of URLs and later, inside the for b in h3 loop you overwrite it.
Change to
from bs4 import BeautifulSoup as bs
import requests
url_1 = "https://ec.europa.eu/commission/presscorner/home/en"
url_2 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
links = [url_1, url_2]
i = 1;
while i <= len(links):
site = links[i-1]
page = requests.get(site).text
doc = bs(page, "html.parser")
h3 = doc.find_all("h3", class_="listing__title")
for b in h3:
print(b.text)
link = b.find_all("a")[0]["href"]
if(link[0:5] != "https"):
link = "https://ec.europa.eu" + link
print(link)
print()
i +=1

BeautifulSoup: Is there a way to set the starting point of find_all() method?

Given a soup I need to get n elements with class="foo".
This can be done by:
soup.find_all(class_='foo', limit=n)
However, this is a slow process, as the elements I'm trying to find are located at the very bottom of the document.
Here is my code:
main_num = 1
main_page = 'https://rawdevart.com/search/?page={p_num}&ctype_inc=0'
# get_soup returns bs4 soup of a link
main_soup = get_soup(main_page.format(p_num=main_num))
# get_last_page returns the number of pages which is 64
last_page_num = get_last_page(main_soup)
for sub_num in range(1, last_page_num+1):
sub_soup = get_soup(main_page.format(p_num=sub_num))
arr_links = sub_soup.find_all(class_='head')
# process arr_links
The class head is an attribute of the a tag on this page, so I assume you want to grab all follow links and keep moving thru all the search pages.
Here's how you might want to get that done:
import requests
from bs4 import BeautifulSoup
base_url = "https://rawdevart.com"
total_pages = BeautifulSoup(
requests.get(f"{base_url}/search/?page=1&ctype_inc=0").text,
"html.parser",
).find(
"small",
class_="d-block text-muted",
).getText().split()[2]
pages = [
f"{base_url}/search/?page={n}&ctype_inc=0"
for n in range(1, int(total_pages) + 1)
]
all_follow_links = []
for page in pages[:2]:
r = requests.get(page).text
all_follow_links.extend(
[
f'{base_url}{a["href"]}' for a in
BeautifulSoup(r, "html.parser").find_all("a", class_="head")
]
)
print(all_follow_links)
Output:
https://rawdevart.com/comic/my-death-flags-show-no-sign-ending/
https://rawdevart.com/comic/tsuki-ga-michibiku-isekai-douchuu/
https://rawdevart.com/comic/im-not-a-villainess-just-because-i-can-control-darkness-doesnt-mean-im-a-bad-person/
https://rawdevart.com/comic/tensei-kusushi-wa-isekai-wo-meguru/
https://rawdevart.com/comic/iceblade-magician-rules-over-world/
https://rawdevart.com/comic/isekai-demo-bunan-ni-ikitai-shoukougun/
https://rawdevart.com/comic/every-class-has-been-mass-summoned-i-strongest-under-disguise-weakest-merchant/
https://rawdevart.com/comic/isekai-onsen-ni-tensei-shita-ore-no-kounou-ga-tondemosugiru/
https://rawdevart.com/comic/kubo-san-wa-boku-mobu-wo-yurusanai/
https://rawdevart.com/comic/gabriel-dropout/
and more ...
Note: to get all the pages just remove the slicing from this line:
for page in pages[:2]:
# the rest of the loop body
So it looks like this:
for page in pages:
# the rest of the loop body

Web scraper crawler using Breadth First Search in Python

I want to create web crawler for a wikipedia page( all the links within the page gets opened and saved too) which needs to be implemented in Breadth First Search way. I have been looking at a lot sources and stackoverflow codes/problems but unable to implement it.
I tried the following code :
import requests
from parsel import Selector
import time
start = time.time()
### Crawling to the website fetch links and images -> store images -> crawl more to the fetched links and scrape more images
all_images = {} # website links as "keys" and images link as "values"
# GET request to recurship site
response = requests.get('https://en.wikipedia.org/wiki/Plant')
selector = Selector(response.text)
href_links = selector.xpath('//a/#href').getall()
image_links = selector.xpath('//img/#src').getall()
for link in href_links:
try:
response = requests.get(link)
if response.status_code == 200:
image_links = selector.xpath('//img/#src').getall()
all_images[link] = image_links
except Exception as exp:
print('Error navigating to link : ', link)
print(all_images)
end = time.time()
print("Time taken in seconds : ", (end-start))
but this throws an error saying "Error Navigating to link". How do I go about it ? I am a total newbie in this field.
Your href_links will be relative path for wiki links.
You must append the baseUrl of wikipedia.
base_url = 'https://en.wikipedia.org/'
href_links = [base_url + link for link in selector.xpath('//a/#href').getall()]
Note that this will work for wiki links if you have external links in href use something like this:
href_links = []
for link in selector.xpath('//a/#href').getall():
if not link.startswith('http'):
href_links.append(base_url + link)
else:
href_links.append(link)

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Resources