Web scraping - non href - python-3.x

I have a list of website in a csv, on which I'd like to capture all pdfs.
BeautifulSoup select works fine on <a href> but there is this website that starts the pdf link with <data-url="https://example.org/abc/qwe.pdf"> and soup couldn't catch anything.
Is there any codes that I could use to get everything that starts with "data-url" and ends with .pdf?
I apologize for the messy codes. I'm still learning. Please let me know if I can provide clarification.
Thank you :D
The csv looks like this
123456789 https://example.com
234567891 https://example2.com
import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#Write csv into tuples
with open('links.csv') as f:
url=[tuple(line) for line in csv.reader(f)]
print(url)
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
def url_response(url):
global i
final = a
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Translating captured URLs into local addresses
filename = os.path.join(folder_location,link['href'].split('/')[-1])
print(filename)
#Writing files into said addresses
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
#Rename files
os.rename(filename,str(final)+"_"+ str(i)+".pdf")
i = i + 1
#Loop the csv
for a,b in url:
i = 0
url_response(b)
`

If beautifulsoup is not helping you, a regex solution to find the links would be as follows:
Sample HTML:
txt = """
<html>
<body>
<p>
<data-url="https://example.org/abc/qwe.pdf">
</p>
<p>
<data-url="https://example.org/def/qwe.pdf">
</p>
</html>
"""
Regex code to extract links inside data-url:
import re
re1 = '(<data-url=")' ## STARTS WITH
re2 = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))' # HTTP URL
re3 = '(">)' ## ENDS WITH
rg= re.compile(re1 + re2 + re3 ,re.IGNORECASE|re.DOTALL)
links = re.findall(rg, txt)
for i in range(len(links)):
print(links[i][1])
Output:
https://example.org/abc/qwe.pdf
https://example.org/def/qwe.pdf

Yes attribute = value selector with $ ends with operator. It is simply another type of attribute as with your existing href selector
soup.select('[data-url$=".pdf"]')
Combining with Or syntax
soup.select('[href$=".pdf"],[data-url$=".pdf"]')
You can then test with has_attr to determine what action to then do with elements retrieved.

Related

Use beautifulsoup to download href links

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]
You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

How can I scrape a <h1> tag using BeautifulSoup? [Python]

I am currently coding a price tracker for different websites, but I have run into an issue.
I'm trying to scrape the contents of a h1 tag using BeautifulSoup4, but I don't know how. I've tried to use a dictionary, as suggested in
https://stackoverflow.com/a/40716482/14003061, but it returned None.
Can someone please help? It would be appreciated!
Here's the code:
from termcolor import colored
import requests
from bs4 import BeautifulSoup
import smtplib
def choice_bwfo():
print(colored("You have selected Buy Whole Foods Online [BWFO]", "blue"))
url = input(colored("\n[ 2 ] Paste a product link from BWFO.\n", "magenta"))
url_verify = requests.get(url, headers=headers)
soup = BeautifulSoup(url_verify.content, 'html5lib')
item_block = BeautifulSoup.find('h1', {'itemprop' : 'name'})
print(item_block)
choice_bwfo()
here's an example URL you can use:
https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html
Thanks :)
This script will print content of <h1> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html'
# create `soup` variable from the URL:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# print text of first `<h1>` tag:
print(soup.h1.get_text())
Prints:
Organic Spanish Bee Pollen 250g
Or you can do:
print(soup.find('h1', {'itemprop' : 'name'}).get_text())

How can I ensure that relative links are saved as absolute URLs in the output file?

I need to develop a web links scraper program in Python that extracts all of the unique web links that point out to other web pages from the HTML code of the "Current Estimates" web link, both from the "US Census Bureau" website (see web link below) and outside that domain, and that populates them in a comma-separated values (CSV) file as absolute uniform resource indicators (URIs).
I use the code below in Jupyter Notebook and it seems it generates a CSV but part of my code is generating a double https:// on the already absolute links when it should just be adding it to the relative links.
http:https://www.census.gov/data/training-workshops.html
http:https://www.census.gov/programs-surveys/sis.html
I need a better code that can change the relative links to absolute I believe the full_url = urljoin(url, link.get("href")) should be doing this, but something is incorrect.
How can I ensure that relative links are saved as absolute URLs in the output file?
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
from urllib.parse import urljoin
import re
url = 'https://www.census.gov/programs-surveys/popest.html'
r = requests.get(url)
raw_html = r.text
print(r.text)
soup = BeautifulSoup(raw_html, 'html.parser')
print(soup.prettify())
for link in soup.find_all('a',href=True):
full_url = urljoin(url, link.get("href"))
print(link.get('href'))
links_set = set()
for link in soup.find_all(href=re.compile('a')):
print(link.get('href'))
for item in soup.find_all('a',href=re.compile(r'html')):
links_set.add(item.get('href'))
links = [x[:1]=='http' and x or 'http:'+x for x in links_set]
with open("C996FinalAssignment.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(links)
Try this.
import requests
import csv
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://www.census.gov/programs-surveys/popest.html'
r = requests.get(url)
raw_html = r.text
print(r.text)
doc = SimplifiedDoc(raw_html)
lstA = doc.listA(url=url) # It will help you turn relative links into absolute links
links = [a.url for a in lstA]
with open("C996FinalAssignment.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(links)

Beautifulsoup parse thousand pages

I have a script parsing a list with thousands of URLs.
But my problem is, that it would take ages to be done with that list.
The URL request takes around 4 seconds before the page is loaded and can be parsed.
Is there any way to parse really a large amount of URLs fast?
My code looks like this:
from bs4 import BeautifulSoup
import requests
#read url-list
with open('urls.txt') as f:
content = f.readlines()
# remove whitespace characters
content = [line.strip('\n') for line in content]
#LOOP through urllist and get information
for i in range(5):
try:
for url in content:
#get information
link = requests.get(url)
data = link.text
soup = BeautifulSoup(data, "html5lib")
#just example scraping
name = soup.find_all('h1', {'class': 'name'})
EDIT:
how to handle Asynchronous Requests with hooks in this example? I tried the following as mentioned on this site Asynchronous Requests with Python requests:
from bs4 import BeautifulSoup
import grequests
def parser(response):
for url in urls:
#get information
link = requests.get(response)
data = link.text
soup = BeautifulSoup(data, "html5lib")
#just example scraping
name = soup.find_all('h1', {'class': 'name'})
#read urls.txt and store in list variable
with open('urls.txt') as f:
urls= f.readlines()
# you may also want to remove whitespace characters
urls = [line.strip('\n') for line in urls]
# A list to hold our things to do via async
async_list = []
for u in urls:
# The "hooks = {..." part is where you define what you want to do
#
# Note the lack of parentheses following do_something, this is
# because the response will be used as the first argument automatically
rs = grequests.get(u, hooks = {'response' : parser})
# Add the task to our list of things to do via async
async_list.append(rs)
# Do our list of things to do via async
grequests.map(async_list, size=5)
This doesn't work for me. I don't even get any error in the console, it is just running for long time until it stops.

BeautifulSoup returns urls of pages on same website shortened

My code for reference:
import httplib2
from bs4 import BeautifulSoup
h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
urls.append(tag['href'])
responses = []
contents = []
for url in urls:
try:
response1, content1 = h.request(url)
responses.append(response1)
contents.append(content1)
except:
pass
The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'
However the result I'm getting from BeautifulSoup is:
>>> urls
['http://www.yahoo.com/', '../../index.html']
This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?
That's because the link on the webpage is actually of that form. The HTML from the page is:
<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>
This is called a relative link.
To convert this to an absolute link, you can use urljoin from the standard library.
from urllib.parse import urljoin # Python3
urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
'../../index.html')
# returns http://csb.stanford.edu/class/public/index.html

Resources