Web scraping from multiple sites - python-3.x

i have a problem. I want to take the title of news article and the links from the article from multiple websites. Here is the code:
from bs4 import BeautifulSoup as bs
import requests
url_1 = "https://ec.europa.eu/commission/presscorner/home/en"
url_2 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
link = [url_1, url_2]
i = 1;
while i <= len(link):
site = link[i-1]
page = requests.get(site).text
doc = bs(page, "html.parser")
h3 = doc.find_all("h3", class_="listing__title")
for b in h3:
print(b.text)
link = b.find_all("a")[0]["href"]
if(link[0:5] != "https"):
link = "https://ec.europa.eu" + link
print(link)
print()
i +=1
The problem is that i get an error for invalid link and i don't know how to solve the problem(i know that for the first link, i have to search for different tags but when i use if function so as to define which site i am searching, i don't get anything as a result). What can i do in order to solve the problem?

The problem is that you are using the variable link for two things. First you set it to the list of URLs and later, inside the for b in h3 loop you overwrite it.
Change to
from bs4 import BeautifulSoup as bs
import requests
url_1 = "https://ec.europa.eu/commission/presscorner/home/en"
url_2 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
links = [url_1, url_2]
i = 1;
while i <= len(links):
site = links[i-1]
page = requests.get(site).text
doc = bs(page, "html.parser")
h3 = doc.find_all("h3", class_="listing__title")
for b in h3:
print(b.text)
link = b.find_all("a")[0]["href"]
if(link[0:5] != "https"):
link = "https://ec.europa.eu" + link
print(link)
print()
i +=1

Related

I can't get data from a website using beautiful soup(python)

i have a problem. I am trying to create a web scraping script using python that gets the titles and the links from the articles. The link i want to get all the data is https://ec.europa.eu/commission/presscorner/home/en . The problem is that when i run the code, i don't get anything. Why is that? Here is the code:
from bs4 import BeautifulSoup as bs
import requests
#url_1 = "https://ec.europa.eu/info/news_en?pages=159399#news-block"
url_2 = "https://ec.europa.eu/commission/presscorner/home/en"
links = [url_2]
for i in links:
site = i
page = requests.get(site).text
doc = bs(page, "html.parser")
# if site == url_1:
# h3 = doc.find_all("h3", class_="listing__title")
# for b in h3:
# title = b.text
# link = b.find_all("a")[0]["href"]
# if(link[0:5] != "https"):
# link = "https://ec.europa.eu" + link
# print(title)
# print(link)
# print()
if site == url_2:
ul = doc.find_all("li", class_="ecl-list-item")
for d in ul:
title_2 = d.text
link_2 = d.find_all("a")[0]["href"]
if(link_2[0:5] != "https"):
link_2 = "https://ec.europa.eu" + link_2
print(title_2)
print(link_2)
print()
(I am also want to get data from another url(the url i have on the script) but from that link, i get all the data i want).
Set a breakpoint after the line page = requests... and you will see the data you pull. The webpage is loading most of its contents via javascript. That's why you're not able to scrape any data.
You can either use Selenium or a proxy service that can render javascript- but these are paid services.

Scraping reports from a website using BeautifulSoup in python

I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.
What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))

Python/Selenium - how to loop through hrefs in <li>?

Web URL: https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times
I want to parse the HTML as below:
I want to get all hrefs within the < li > elements and the highlighted text. I tried the code
elementList = driver.find_element_by_class_name('block-wysiwyg').find_elements_by_tag_name("li")
for i in range(len(elementList)):
driver.find_element_by_class_name('blcokwysiwyg').find_elements_by_tag_name("li").get_attribute("href")
But the block returned none.
Can anyone please help me with the above code?
I suppose it will fetch you the required content.
import requests
from bs4 import BeautifulSoup
link = 'https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times'
r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select(".block-wysiwyg li"):
item_text = item.get_text(strip=True)
item_link = item.select_one("a[href]").get("href")
print(item_text,item_link)
Try is this way:
coronas = driver.find_element_by_xpath("//div[#class='block-wysiwyg']/ul/li")
hr = coronas.find_element_by_xpath('./a')
print(coronas.text)
print(hr.get_attribute('href'))
Output:
The coronavirus is touching the lives of all Americans, but race, age, and income play a big role in the exact ways the virus — and the stalled economy — are affecting people. Here's what that means.
https://www.ipsos.com/en-us/america-under-coronavirus

Scraping wikipedia infobox geography vcard

I have been trying too scrape the data from the Website section from the various cities' vcard table on wikipedia but somehow I get the results for the Co-ordinates section which is located at the beginning off the table
I have tried specifying "Website" while selecting the specific tags in the table.
def getAdditionalInfo(url):
try:
city_page = PageContent('https://en.wikipedia.org' + url)
table = city_page.find('table', {'class' : 'infobox geography vcard'})
additional_details = []
read_content = False
for tr in table.find_all('tr'):
if (tr.get('class') == ['mergedtoprow'] and not read_content):
link = tr.find('th')
if (link and (link.get_text().strip() == 'Website')):
read_content = True
elif ((tr.get('class') == ['mergedbottomrow']) or tr.get('class') == ['mergedrow'] and read_content):
additional_details.append(tr.find('td').get_text().strip('\n'))
return additional_details
except Exception as error:
print('Error occured: {}'.format(error))
return []
I want to append this data into a new column which shows the website link for each city's official page which I would be getting from this function
With bs4 4.7.1 you can use :contains to target the table header of website and then get the next td's a tag href attribute. Clearly there are other cases where this pattern could match so perhaps some other form of validation is required on input values.
You could add an additional class selector for the vcard if you wish: result = soup.select_one('.vcard th:contains(Website) + td > [href]')
Python
import requests
from bs4 import BeautifulSoup as bs
cities = ['Paris', 'Frankfurt', 'London']
base = 'https://en.wikipedia.org/wiki/'
with requests.Session() as s:
for city in cities:
r = s.get(base + city)
soup = bs(r.content, 'lxml')
result = soup.select_one('th:contains(Website) + td > [href]')
if result is None:
print(city, 'selector failed to find url')
else:
print(city, result['href'])
As I understand the problem correctly, you want to extract official URL of the city from Wikipedia:
import requests
from bs4 import BeautifulSoup
def getAdditionalInfo(url):
soup = BeautifulSoup(requests.get('https://en.wikipedia.org' + url).text, 'lxml')
for th in soup.select('.vcard th'):
if not th.text.lower() == 'website':
continue
yield th.parent.select_one('td').text
cities = ['/wiki/Paris', '/wiki/London', '/wiki/Madrid']
for city in cities:
for info in getAdditionalInfo(city):
print(f'{city}: {info}')
This prints:
/wiki/Paris: www.paris.fr
/wiki/London: london.gov.uk
/wiki/Madrid: www.madrid.es

Scraping the stackoverflow user data

import requests
from bs4 import BeautifulSoup
import csv
response = requests.get('https://stackoverflow.com/users?page=3&tab=reputation&filter=week').text
soup = BeautifulSoup(response, 'lxml')
for items in soup.select('.user-details'):
name = items.select("a")[0].text
location = items.select(".user-location")[0].text
reputation = items.select(".reputation-score")[0].text
print(name,location,reputation)
with open('stackdata.csv','a',newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name,location,reputation])
When we change the url of this code the output remains same.
I came across a similar problem. The solution that works for me is using selenium. Though I used headless browser i.e phantomjs I assume it should work for other browsers too.
driver = webdriver.PhantomJS('/home/practice/selenium/webdriver/phantomjs/bin/phantomjs')
users = []
page_num = 1
driver.get('https://stackoverflow.com/users?page={page_num}&tab=reputation&filter=week'.format(page_num=page_num))
content = driver.find_element_by_id('content')
for details in content.find_elements_by_class_name('user-details'):
users.append(details.text)
print(users)
Change the page_num to get the desired result.
Hope this will help!

Resources