Scraping reports from a website using BeautifulSoup in python - python-3.x

I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.

What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))

Related

How can I ensure that relative links are saved as absolute URLs in the output file?

I need to develop a web links scraper program in Python that extracts all of the unique web links that point out to other web pages from the HTML code of the "Current Estimates" web link, both from the "US Census Bureau" website (see web link below) and outside that domain, and that populates them in a comma-separated values (CSV) file as absolute uniform resource indicators (URIs).
I use the code below in Jupyter Notebook and it seems it generates a CSV but part of my code is generating a double https:// on the already absolute links when it should just be adding it to the relative links.
http:https://www.census.gov/data/training-workshops.html
http:https://www.census.gov/programs-surveys/sis.html
I need a better code that can change the relative links to absolute I believe the full_url = urljoin(url, link.get("href")) should be doing this, but something is incorrect.
How can I ensure that relative links are saved as absolute URLs in the output file?
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
from urllib.parse import urljoin
import re
url = 'https://www.census.gov/programs-surveys/popest.html'
r = requests.get(url)
raw_html = r.text
print(r.text)
soup = BeautifulSoup(raw_html, 'html.parser')
print(soup.prettify())
for link in soup.find_all('a',href=True):
full_url = urljoin(url, link.get("href"))
print(link.get('href'))
links_set = set()
for link in soup.find_all(href=re.compile('a')):
print(link.get('href'))
for item in soup.find_all('a',href=re.compile(r'html')):
links_set.add(item.get('href'))
links = [x[:1]=='http' and x or 'http:'+x for x in links_set]
with open("C996FinalAssignment.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(links)
Try this.
import requests
import csv
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://www.census.gov/programs-surveys/popest.html'
r = requests.get(url)
raw_html = r.text
print(r.text)
doc = SimplifiedDoc(raw_html)
lstA = doc.listA(url=url) # It will help you turn relative links into absolute links
links = [a.url for a in lstA]
with open("C996FinalAssignment.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(links)

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....
Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

AttributeError: 'NoneType' object has no attribute 'text' python3 + proxy

There are some question like this online but I looked at them and none of them have helped me I am currently working on a script that pulls an item name from http://www.supremenewyork.com/shop/all/accessories
I want it to pull this information from supreme uk but im having trouble with the proxy stuff but right now im strugglinh with this script everytime I run it I get the error listed above in the title.
Here is my Script:
import requests
from bs4 import BeautifulSoup
URL = ('http://www.supremenewyork.com/shop/all/accessories')
proxy_script = requests.get(URL).text
soup = BeautifulSoup(proxy_script, 'lxml')
for item in soup.find_all('div', class_='inner-article'):
name = soup.find('h1', itemprop='name').text
print(name)
I am always getting this error and when I run the script without the .text at the end of the itemprop=name I just get a bunch of None's
like this:
None
None
None etc......
The exact amount of Nones as there are Items available to print
Here we go, I've commented the code that I've used below.
and the reason we use class_='something is because the word class is reserved for classes in Python.
URL = ('http://www.supremenewyork.com/shop/all/accessories')
#UK_Proxy1 = '178.62.13.163:8080'
#proxies = {
# 'http': 'http://' + UK_Proxy1,
#'https': 'https://' + UK_Proxy1
#}
#proxy_script = requests.get(URL, proxies=proxies).text
proxy_script = requests.get(URL).text
soup = BeautifulSoup(proxy_script, 'lxml')
thetable = soup.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
only_text = item.h1.a.text
# by doing .<tag> we extract information just from that tag
# example bsobject = <html><body><b>ey</b></body</html>
# if we print bsobject.body.b it will return `<b>ey</b>`
color = item.p.a.text
print(only_text, color)

How can I extract the Foursquare url location from Swarm webpage in python3?

suppose we have this swarm url "https://www.swarmapp.com/c/dZxqzKerUMc" how we can get the url under Apple Williamsburg hyperlink in link above.
I tried to filter it out according to html tags but there are many tags and lots of foursquare.com links.
below is a part of source code of the given link above
<h1><strong>Kristin Brooks</strong> at <a
href="https://foursquare.com/v/apple-williamsburg/57915fa838fab553338ff7cb"
target="_blank">Apple Williamsburg</a></h1>
the url foursquare in the code not always the same, so what is the best way to get that specific url uniquely for every given Swarm url.
I tried this:
import bs4
import requests
def get_4square_url(link):
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, "html.parser")
link = [a.attrs.get('href') for a in
soup.select('a[href=https://foursquare.com/v/*]')]
return link
print (get_4square_url('https://www.swarmapp.com/c/dZxqzKerUMc'))
I used https://foursquare.com/v/ as a pattern to get the desirable url
def get_4square_url(link):
try:
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, "html.parser")
for elem in soup.find_all('a',
href=re.compile('https://foursquare\.com/v/')): #here is my pattern
link = elem['href']
return link
except requests.exceptions.HTTPError or
requests.exceptions.ConnectionError or requests.exceptions.ConnectTimeout \
or urllib3.exceptions.MaxRetryError:
pass

Beautifulsoup get rid of embedded js and css in html

I need to parse multiple html through requests.get(). I just need to keep the content of the page and get rid of the embedded javascript and css. I saw the following post but no solution works for me.
http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python, http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text, and http://stackoverflow.com/questions/2081586/web-scraping-with-python
I got a working code that doesn't strip js either css... here is my code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
I tried to include solutions from the links above mention but no code works for me.
What line of code can get rid of the embedded js and embedded css
Question Update 4 OCT 2016
The file that read.csv is something like this...
trump,clinton
data science, operating system
windows,linux
diabetes,cancer
I hit gigablast.com with those terms to search one row at the time. One search will be trump clinton. The result is a list of urls. I requests.get(url) and I process those urls getting rid of timeouts, status_code = 400s, and building a clean list of clean_urls = []. After that I fire the following code...
count = 1
for link in clean_urls[:2]:
page = requests.get(link, timeout=5)
try:
page = BeautifulSoup(page.content, 'html.parser').text
webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
webpage_out.write(clean_page)
count += 1
except:
pass
webpage_out.close()
On this line of code page = BeautifulSoup(page.content, 'html.parser').text I have the text of the entire web page, including styles and scripts if they were embedded. I can't target them with BeautifulSoup because the tags are no longer there. I did try page = BeautifulSoup(page.content, 'html.parser') and find_all('<script>') and try to get rid of the script but I ended up erasing the entire file. The desired outcome will be all the text of the html without any...
body {
font: something;
}
or any javascript...
$(document).ready(function(){
$some code
)};
The final file should have no code what so ever, just the content of the document.
I used this code to get rid of javascript and CSS code while scraping HTML page
import requests
from bs4 import BeautifulSoup
url = 'https://corporate.walmart.com/our-story/our-business'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.title.string
for script in soup(["script", "style"]):
script.decompose()
with open('output_file.txt', "a") as text_file:
text_file.write("\nURL : "+ url)
text_file.write("\nTitle : " + title)
for p_tag_data in soup.find_all('p'):
text_file.write("\n"+p_tag_data.text)
for li_tag_data in soup.find_all('li'):
text_file.write("\n"+li_tag_data.text)
for div_tag_data in soup.find_all('div'):
text_file.write("\n"+div_tag_data.text)

Resources