Beautifulsoup parse thousand pages - python-3.x

I have a script parsing a list with thousands of URLs.
But my problem is, that it would take ages to be done with that list.
The URL request takes around 4 seconds before the page is loaded and can be parsed.
Is there any way to parse really a large amount of URLs fast?
My code looks like this:
from bs4 import BeautifulSoup
import requests
#read url-list
with open('urls.txt') as f:
content = f.readlines()
# remove whitespace characters
content = [line.strip('\n') for line in content]
#LOOP through urllist and get information
for i in range(5):
try:
for url in content:
#get information
link = requests.get(url)
data = link.text
soup = BeautifulSoup(data, "html5lib")
#just example scraping
name = soup.find_all('h1', {'class': 'name'})
EDIT:
how to handle Asynchronous Requests with hooks in this example? I tried the following as mentioned on this site Asynchronous Requests with Python requests:
from bs4 import BeautifulSoup
import grequests
def parser(response):
for url in urls:
#get information
link = requests.get(response)
data = link.text
soup = BeautifulSoup(data, "html5lib")
#just example scraping
name = soup.find_all('h1', {'class': 'name'})
#read urls.txt and store in list variable
with open('urls.txt') as f:
urls= f.readlines()
# you may also want to remove whitespace characters
urls = [line.strip('\n') for line in urls]
# A list to hold our things to do via async
async_list = []
for u in urls:
# The "hooks = {..." part is where you define what you want to do
#
# Note the lack of parentheses following do_something, this is
# because the response will be used as the first argument automatically
rs = grequests.get(u, hooks = {'response' : parser})
# Add the task to our list of things to do via async
async_list.append(rs)
# Do our list of things to do via async
grequests.map(async_list, size=5)
This doesn't work for me. I don't even get any error in the console, it is just running for long time until it stops.

Related

I am trying to delete lines of text in python that starts with /

I am trying to scrape a website and then save the links to a text file. in the text file, I would like to delete any line that does not start with "/". How could I do that?
This is everything I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://wiki.stardewvalley.net/Stardew_Valley_Wiki")
soup = BeautifulSoup(page.content, 'html.parser')
wikilinks = []
for con in soup.find_all('div', class_="mainmenuwrapper"):
for links in soup.find_all('a', href=True):
if links.text:
wikilinks.append(links['href'])
# print(wikilinks)
with open('./scrapeNews/output.txt', 'w') as f:
for item in wikilinks:
f.write("%s\n" % item)
You can use the built-in startswith() method to check if a link startswith a "/". However, since there is also other information besides links, you can filter to only write links that start with "http", instead of just filtering for "/".
...
with open("./scrapeNews/output.txt", "w") as f:
for item in wikilinks:
if not str(item).startswith("http"):
continue
f.write("%s\n" % item)

BeautifulSoup: Is there a way to set the starting point of find_all() method?

Given a soup I need to get n elements with class="foo".
This can be done by:
soup.find_all(class_='foo', limit=n)
However, this is a slow process, as the elements I'm trying to find are located at the very bottom of the document.
Here is my code:
main_num = 1
main_page = 'https://rawdevart.com/search/?page={p_num}&ctype_inc=0'
# get_soup returns bs4 soup of a link
main_soup = get_soup(main_page.format(p_num=main_num))
# get_last_page returns the number of pages which is 64
last_page_num = get_last_page(main_soup)
for sub_num in range(1, last_page_num+1):
sub_soup = get_soup(main_page.format(p_num=sub_num))
arr_links = sub_soup.find_all(class_='head')
# process arr_links
The class head is an attribute of the a tag on this page, so I assume you want to grab all follow links and keep moving thru all the search pages.
Here's how you might want to get that done:
import requests
from bs4 import BeautifulSoup
base_url = "https://rawdevart.com"
total_pages = BeautifulSoup(
requests.get(f"{base_url}/search/?page=1&ctype_inc=0").text,
"html.parser",
).find(
"small",
class_="d-block text-muted",
).getText().split()[2]
pages = [
f"{base_url}/search/?page={n}&ctype_inc=0"
for n in range(1, int(total_pages) + 1)
]
all_follow_links = []
for page in pages[:2]:
r = requests.get(page).text
all_follow_links.extend(
[
f'{base_url}{a["href"]}' for a in
BeautifulSoup(r, "html.parser").find_all("a", class_="head")
]
)
print(all_follow_links)
Output:
https://rawdevart.com/comic/my-death-flags-show-no-sign-ending/
https://rawdevart.com/comic/tsuki-ga-michibiku-isekai-douchuu/
https://rawdevart.com/comic/im-not-a-villainess-just-because-i-can-control-darkness-doesnt-mean-im-a-bad-person/
https://rawdevart.com/comic/tensei-kusushi-wa-isekai-wo-meguru/
https://rawdevart.com/comic/iceblade-magician-rules-over-world/
https://rawdevart.com/comic/isekai-demo-bunan-ni-ikitai-shoukougun/
https://rawdevart.com/comic/every-class-has-been-mass-summoned-i-strongest-under-disguise-weakest-merchant/
https://rawdevart.com/comic/isekai-onsen-ni-tensei-shita-ore-no-kounou-ga-tondemosugiru/
https://rawdevart.com/comic/kubo-san-wa-boku-mobu-wo-yurusanai/
https://rawdevart.com/comic/gabriel-dropout/
and more ...
Note: to get all the pages just remove the slicing from this line:
for page in pages[:2]:
# the rest of the loop body
So it looks like this:
for page in pages:
# the rest of the loop body

Web scraping - non href

I have a list of website in a csv, on which I'd like to capture all pdfs.
BeautifulSoup select works fine on <a href> but there is this website that starts the pdf link with <data-url="https://example.org/abc/qwe.pdf"> and soup couldn't catch anything.
Is there any codes that I could use to get everything that starts with "data-url" and ends with .pdf?
I apologize for the messy codes. I'm still learning. Please let me know if I can provide clarification.
Thank you :D
The csv looks like this
123456789 https://example.com
234567891 https://example2.com
import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#Write csv into tuples
with open('links.csv') as f:
url=[tuple(line) for line in csv.reader(f)]
print(url)
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
def url_response(url):
global i
final = a
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Translating captured URLs into local addresses
filename = os.path.join(folder_location,link['href'].split('/')[-1])
print(filename)
#Writing files into said addresses
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
#Rename files
os.rename(filename,str(final)+"_"+ str(i)+".pdf")
i = i + 1
#Loop the csv
for a,b in url:
i = 0
url_response(b)
`
If beautifulsoup is not helping you, a regex solution to find the links would be as follows:
Sample HTML:
txt = """
<html>
<body>
<p>
<data-url="https://example.org/abc/qwe.pdf">
</p>
<p>
<data-url="https://example.org/def/qwe.pdf">
</p>
</html>
"""
Regex code to extract links inside data-url:
import re
re1 = '(<data-url=")' ## STARTS WITH
re2 = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))' # HTTP URL
re3 = '(">)' ## ENDS WITH
rg= re.compile(re1 + re2 + re3 ,re.IGNORECASE|re.DOTALL)
links = re.findall(rg, txt)
for i in range(len(links)):
print(links[i][1])
Output:
https://example.org/abc/qwe.pdf
https://example.org/def/qwe.pdf
Yes attribute = value selector with $ ends with operator. It is simply another type of attribute as with your existing href selector
soup.select('[data-url$=".pdf"]')
Combining with Or syntax
soup.select('[href$=".pdf"],[data-url$=".pdf"]')
You can then test with has_attr to determine what action to then do with elements retrieved.

Making workaround of try-except to apply on many statement in single line by creating a separate function

I am scraping dictionary data from https://www.dictionary.com/ website. The purpose is to remove the unwanted elements from the dictionary pages and save them offline for further processing. Because of the webpages are somewhat unstructured there may and may not be the elements present that are mentioned in the code below to remove; the absence of the elements gives an exception (In snippet 2). And since in the actual code, there are many elements to be removed and they may be present or absent, if we apply the try - except to every such statement the lines of code will increase drasticly.
Thus I am working on a work-around for this problem by creating a separate function for try - except (In snippet 3), the idea of which I got from here. But I am unable to get the code in snippet 3 working as the command such as soup.find_all('style') is returning None where as it should return the list of all the style tags similar to snippet 2. I cannot apply the refered solution directly as sometime I have to reach the intended element to remvove indirectly by refering to its parent or sibling such as in soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent
Snippet 1 is used to set the environment for code execution.
It would be great if you could provide some suggestion to get snippet 3 working.
Snippet 1 (Setting the environment for executing code):
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}
folder = "dictionary_com"
Snippet 2 (working):
def makedefinition(url):
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = soup.find_all("style") # style tags
remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
Snippet 3 (not working):
soup = None
def safe_execute(command):
global soup
try:
print(soup) # correct soup is printed
print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
return exec(command) # None is being returned for style
except Exception:
print(Exception.with_traceback())
return []
def makedefinition(url):
global soup
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = safe_execute("soup.find_all('style')") # style tags
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
In your code in snippet 3 you use the exec builtin method which returns None regardless of what it does with its argument. For details see this SO thread.
Remedy:
Use exec to modify a variable and return it instead of returning the output of exec itself.
def safe_execute(command):
d = {}
try:
exec(command, d)
return d['output']
except Exception:
print(Exception.with_traceback())
return []
Then call it as something like this:
remove = safe_execute("output = soup.find_all('style')")
EDIT:
Upon execution of this code, again None is returned. Upon debugging however, inside try section if we print(soup) a correct soup value is printed, but exec(command,d) gives NameError: name 'soup' is not defined.
This disparity have been overcome by using eval() instead of exec(). The function defined is:
def safe_execute(command):
global soup
try:
output = eval(command)
return(output)
except Exception:
return []
And the call looks like:
remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))

Problem exporting Web Url results into CSV using beautifulsoup3

Problem: I tried to export results (Name, Address, Phone) into CSV but the CSV code not returning expected results.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import json
import re
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
#Find all Companies Name under h2tag
company_name_list_heading = soup.findAll("h2")
#Find all Address on page Name under a tag
company_name_list_items = soup.findAll("a",{"class":"address"})
#Find all Phone numbers on page Name under ul
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Created for loop to print out all company Data
for company_address in company_name_list_items:
print(company_address.prettify())
# Create for loop to print out all company Names
for company_name in company_name_list_heading:
print(company_name.prettify())
# Create for loop to print out all company Numbers
for company_numbers in company_name_list_numbers:
print(company_numbers.prettify())
Below is the code to export the results (name, address & phonenumber) into CSV
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "Address", "Phone"])
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
company_name_list_heading = soup.findAll("h2")
company_name_list_items = soup.findAll("a",{"class":"address"})
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Here is the for loop to loop over data.
for company_name in company_name_list_heading:
names = company_name.contents[0]
for company_numbers in company_name_list_numbers:
names = company_numbers.contents[1]
for company_address in company_name_list_items:
address = company_address.contents[1]
writer.writerow([name, Address, Phone])
outfile.close()
You need to work on understanding how for loops work, and also the difference between strings, and variables and other datatypes. You also need to work on using what you have seen from other stackoverflow questions and learn to apply that. This is essentially the same as youre other 2 questions you already posted, but just a different site you're scraping from (but I didn't flag it as a duplicate, as you're new to stackoverflow and web scrpaing and I remember what it was like to try to learn). I'll still answer your questions, but eventually you need to be able to find the answers on your own and learn how to adapt it and apply (coding isn't a paint by colors. Which I do see you are adapting some of it. Good job in finding the "div",{"class":"CompanyInfo"} tag to get the company info)
That data you are pulling (name, address, phone) needs to be within a nested loop of the div class=CompanyInfo element/tag. You could theoretically have it the way you have it now, by putting those into a list, and then writing to the csv file from your lists, but theres a risk of data missing and then your data/info could be off or not with the correct corresponding company.
Here's what the full code looks like. notice that the variables are stored with in the loop, and then written. It then goes to the next block of CompanyInfo and continues.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
# Now loop through those elements
for element in product_name_list:
# Takes 1 block of the "div",{"class":"CompanyInfo"} tag and finds/stores name, address, phone
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
# writes the name, address, phone to csv
writer.writerow([name, address, phone])
# now will go to the next "div",{"class":"CompanyInfo"} tag and repeats
outfile.close()

Resources