Optimizing a webcrawl

Optimizing a webcrawl - python-3.x

The following crawl, though very short, is painfully slow. I mean, "Pop in a full-length feature film," slow.
def bestActressDOB():
# create empty bday list
bdays = []
# for every base url
for actress in getBestActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress"):
# use actress list to create unique actress url
URL = "http://en.wikipedia.org"+actress
# connect to html
html = urlopen(URL)
# create soup object
bsObj = BeautifulSoup(html, "lxml")
# get text from <span class='bday">
try:
bday = bsObj.find("span", {"class":"bday"}).get_text()
except AttributeError:
print(URL)
bdays.append(bday)
print(bday)
return bdays
It grabs the name of every actress nominated for an Academy Award from a table on one Wikipedia page, then converts that to a list, uses those names to create URLs to visit each actresses' wiki, where it grabs her date of birth. The data will be used to calculate the age at which each actress was nominated for, or won, the Academy Award for Best Actress. Beyond Big O, is there a way to speed this up in real time. I have little experience with this sort of thing, so I am unsure of how normal this is. Thoughts?
Edit: Requested sub-routine
def getBestActresses(URL):
bestActressNomineeLinks = []
html = urlopen(URL)
try:
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", {"class":"wikitable sortable"})
except AttributeError:
print("Error creating/navigating soup object")
table_row = table.find_all("tr")
for row in table_row:
first_data_cell = row.find_all("td")[0:1]
for datum in first_data_cell:
actress_name = datum.find("a")
links = actress_name.attrs['href']
bestActressNomineeLinks.append(links)
#print(bestActressNomineeLinks)
return bestActressNomineeLinks

I would reccomend trying on a faster computer or even running on a service like google cloud platform, microsoft azure, or amazon web services. There is no code that will make it go faster.

Related

Beautiful Soup Value not extracting properly

Recently i was working with python beautiful soup to extract some data and put it into pandas DataFrame.
I used python beautiful soup to extract some of the hotel data from the website booking.com.
I was able to extract some of the attributes very correctly without any empty.
Here is my code snippet:
def get_Hotel_Facilities(soup):
try:
title = soup.find_all("div", attrs={"class":"db29ecfbe2 c21a2f2d97 fe87d598e8"})
new_list = []
# Inner NavigatableString Object
for i in range(len(title)):
new_list.append(title[i].text.strip())
except AttributeError:
new_list=""
return new_list
The above code is my function to retrieve the Facilities of a hotel and return the facilitites List items.
page_no=0
d = {"Hotel_Name":[], "Hotel_Rating":[], "Room_type":[],"Room_price":[],"Room_sqft":[],"Facilities":[],"Location":[]}
while (page_no<=25):
URL = f"https://www.booking.com/searchresults.html?aid=304142&label=gen173rf-1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ&sid=2214b1422694e7b065e28995af4e22d9&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.html%3Faid%3D304142%26label%3Dgen173rf1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ%26sid%3D2214b1422694e7b065e28995af4e22d9%26&ss=goa&is_ski_area=0&checkin_year=2023&checkin_month=1&checkin_monthday=13&checkout_year=2023&checkout_month=1&checkout_monthday=14&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&offset{page_no}"
new_webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(new_webpage.content,"html.parser")
links = soup.find_all("a", attrs={"class":"e13098a59f"})
for link in links:
new_webpage = requests.get(link.get('href'), headers=HEADERS)
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
d["Hotel_Name"].append(get_Hotel_Name(new_soup))
d["Hotel_Rating"].append(get_Hotel_Rating(new_soup))
d["Room_type"].append(get_Room_type(new_soup))
d["Room_price"].append(get_Price(new_soup))
d["Room_sqft"].append(get_Room_Sqft(new_soup))
d["Facilities"].append(get_Hotel_Facilities(new_soup))
d["Location"].append(get_Hotel_Location(new_soup))
page_no += 25
The above code is the main one where the while loop will traverse the linked pages and retrieve the URL's of the pages. After retrieving ,it goes to every page to retrieve the corresponding atrributes.
I was able to retrieve the rest of the attributes correctly but i am not able to retrive the facilities, Like only some of the room facilities are being returned and some are not returning.
Here is my below o/p after making it into a pandas data frame.
Facilities o/p image
Please help me in this Problem as why some are coming and some are not coming.
P.S:- The facilities are available in the website
I have Tried using all the corresponding classes and attributes for retrieval but i am not getting the facilities column properly.

Probably as a predictive measure, the html fetched by the requests don't seem to consistent in their layouts or even the contents.
There might be more possible selectors, but try
def get_Hotel_Facilities(soup):
selectors = ['div[data-testid="property-highlights"]', '#facilities',
'.hp-description~div div.important_facility']
new_list = []
for sel in selectors:
for sect in soup.select(sel):
new_list += list(sect.stripped_strings)
return list(set(new_list)) # set <--> unique
But even with this, the results are inconsistent. E.g.: I tested on this page with
for i in range(10):
soup = BeautifulSoup(cloudscraper.create_scraper().get(url).content)
fl = get_Hotel_Facilities(soup) if soup else []
print(f'[{i}] {len(fl)} facilities: {", ".join(fl)}')
(But the inconsistencies might be due to using cloudscraper - maybe you'll get better results with your headers?)

Not getting any data entry with 'find_all' while scraping Spotify Charts webpage

I am trying to scrape the spotify charts containing top 200 songs in India on 2022-02-01. My python code :
#It reads the webpage.
def get_webpage(link):
page = requests.get(link)
soup = bs(page.content, 'html.parser')
return(soup)
#It collects the data for each country, and write them in a list.
#The entries are (in order): Song, Artist, Date, Play Count, Rank
def get_data():
rows = []
soup = get_webpage('https://spotifycharts.com/regional/in/daily/2022-02-01')
entries = soup.find_all("td", class_ = "chart-table-track")
streams = soup.find_all("td", class_= "chart-table-streams")
print(entries)
for i, (entry, stream) in enumerate(zip(entries,streams)):
song = entry.find('strong').get_text()
artist = entry.find('span').get_text()[3:]
play_count = stream.get_text()
rows.append([song, artist, date, play_count, i+1])
return(rows)
I tried printing the entries and streams but get a blank value
entries = soup.find_all("td", class_ = "chart-table-track")
streams = soup.find_all("td", class_= "chart-table-streams")
I have copied/referenced this from Here
and tried running the full script but that gives error : 'NoneType' object has no attribute 'find_all' in the country function. Hence I tried for a smaller section as above.

NoneType suggests that is doesn't find the "Entries" or "Streams", if you print soup it will show you that the selectors set up for entries and streams does not exist.
After checking your soup object, it seems that Cloudflare is blocking your access to Spotify and you need to complete a CAPTCHA to get around this. There is a library set up for bypassing cloudflare called "cloudscraper".

Website server redirects me but get 200 as status code

I'm learning web scraping with python, and as a way to do an all-in-one exercise I'm trying to make a game catalog by utilizing Beautiful Soup and requests modules as my main tools. Though, the problem lies while handling sentences related to requests module.
DESCRIPTION:
The exercise is about getting all genres tags used for classifying games starting with A letter in the first page. Each page shows around or exactly 30 games, so if one wants to access a specific page independently of a letter, has to access to an url in this form.
https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/1
https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/2
https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/3
And so on...
As a matter of fact, each alphabet letter main page has the form:
URL: https://vandal.elespanol.com/juegos/13/pc/letra/ which is equivalent to https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/.
Making my way to scrape genres from some pages is not big deal but what if i want to scrape them all of a letter, how do i know when I'm done scraping genres from all games of a letter?
When you request the url https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/200 for example, you get redirected to a corresponding letter main page, which means the first 30 games, since in the end it doesn't have more games to return.
So while bearing that in mind.. i was thinking about verifying the status_code got from requests.get() response, but get a 200 as status code whereas when analizing packages received with Chrome Tools i got 301 as status code. In the end of the program i save to a file the scraped genres.
Here's the picture.
And here's the code:
from bs4 import BeautifulSoup
import string
import requests
from string import ascii_lowercase
def write_genres_to_file(site_genres):
with open('/home/l0new0lf/Desktop/generos.txt', 'w') as file_:
print(f'File "{file_.name}" OPENED to write {len(site_genres)} GENRES')
counter = 1
site_genres_length = len(site_genres)
for num in range(site_genres_length):
print('inside File Loop')
if counter != 2:
if counter == 3:
file_.write(f'{site_genres[num]}' + '\n')
print('wrote something')
counter = 0
else: file_.write(f'{site_genres[num]}')
else: file_.write(f'{site_genres[num]:^{len(site_genres[num])+8}}')
print(f'Wrote genre "{site_genres[num]}" SUCCESSFULLY!')
counter +=1
def get_tags():
#TITLE_TAG_SELECTOR = 'tr:first-child td.ta14b.t11 div a strong'
#IMG_TAG_SELECTOR = 'tr:last-child td:first-child a img'
#DESCRIPTION_TAG_SELECTOR = 'tr:last-child td:last-child p'
GENRES_TAG_SELECTOR = 'tr:last-child td:last-child div.mt05 p'
GAME_SEARCH_RESULTS_TABLE_SELECTOR = 'table.mt1.tablestriped4.froboto_real.blanca'
GAME_TABLES_CLASS = 'table transparente tablasinbordes'
site_genres = []
for i in ['a']:
counter = 1
while True:
rq = requests.get(f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}')
if rq:
print('Request GET: from ' + f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}' + ' Got Workable Code !')
if rq.status_code == 301 or rq.status_code == 302 or rq.status_code == 303 or rq.status_code == 304:
print(f'No more games in letter {i}\n**REDIRECTING TO **')
break
counter +=1
soup = BeautifulSoup(rq.content, 'lxml')
main_table = soup.select_one(GAME_SEARCH_RESULTS_TABLE_SELECTOR)
#print('This is the MAIN TABLE:\n' + str(main_table))
game_tables = main_table.find_all('table', {'class': GAME_TABLES_CLASS})
#print('These are the GAME TABLES:\n' + str(game_tables))
for game in game_tables:
genres_str = str(game.select_one(GENRES_TAG_SELECTOR).contents[1]).strip().split(' / ')
for genre in genres_str:
if not genre in site_genres:
site_genres.append(genre)
write_genres_to_file(site_genres)
get_tags()
So, roughly, my question is: How do could i know when i'm done scraping all games starting with a certain letter in order to start scraping the games from the next one?.
NOTE: I only could think of comparing every time in the loop if returned html structure is the same compared with first page of a letter or maybe evaluating if I'm receiving repeated games. But i think this shouldn't the way i go about.
Any help is truly welcomed, and I'm very sorry for the very looong problem description, but thought that it was necessary.

I simply would not just rely on a status code. You might get a non 200 status even for pages that are there. For example if you exceed a certain amount described in their robots.txt, or if your network has a delay or error.
So, to reply to your question: "How do I ensure that I scraped all pages corresponding to a certain letter?". To ensure it you may save all the "visible text" as in this reply BeautifulSoup Grab Visible Webpage Text and hash its content. When you hit the same hash, then you know that you already crawled/scraped that page. Therefore you can then incrementally go on the next letter.
As an example of hash snippet, I would use the following:
def from_text_to_hash(url: str) -> str:
""" Getting visible-text and hashing it"""
url_downloaded = urllib.request.urlopen(url)
soup = BeautifulSoup(url_downloaded, "lxml")
visible_text = soup.title.text + "\t" + soup.body.text
current_hash = str(hash(visible_text))
return current_hash
And you keep track in a set of the variable current_hash

Could not find a class name for BeautifulSoup on the website to crawl on

I am new to Beautiful Soup.
I am trying to get the "Ranking Criteria" class in the below link.
Unfortunately, I used the "criteria" as its class for soup_findAll(),
but it showed no content there.
I could not find any other class names which can give me the data I want
(Overall score, academic reputation and so on)
I actually wanted to do web crawling for multiple universities,
so I hope to use the URL which I can format for various universities(Just change the universities name).
Otherwise, I would have just used the outerHTML for that(tested and it worked, but I did not know
how to customize that for multiple universities)
My code is as below. I ended up using get_text():
r = requests.get("https://www.topuniversities.com/universities/california-institute-technology-caltech")
html = r.text
soup = BeautifulSoup(html, 'html.parser')
tds = soup.get_text()
print(tds)
It was not successful, as it got too much stuff which left me hard to evaluate the information I want.
Any help would be highly appreciated! Thanks!
The link I am trying to scrape:

The data is loaded dynamically via JavaScript Ajax request. But you could use requests module to simulate it.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.topuniversities.com/universities/california-institute-technology-caltech'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
ajax_url = 'https://www.topuniversities.com' + soup.select_one('a.use-ajax')['href'].replace('nojs', 'ajax')
data = requests.post(ajax_url).json()
for d in data:
if 'data' in d:
soup = BeautifulSoup(d['data'], 'html.parser')
break
for div in soup.select('div.criteria'):
criteria = div.find(text=True).strip()
ranking = div.b.get_text(strip=True)
print('{:<30} {}'.format(criteria, ranking))
Prints:
Overall Score: 97
Academic Reputation: 97
Employer Reputation: 82.8
Faculty Student: 100
Citations per Faculty: 99.9
International Faculty: 100
International Students: 88.2

Unable to insert data into table using python and oracle database

My database table looks like this
I have a web crawler that fetches news from the website and i am trying to store it in this table. I have used scrappy and beautiful soup libraries.The below code shows my crawler logic.
import requests
from bs4 import BeautifulSoup
import os
import datetime
import cx_Oracle
def scrappy(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.find('title').text.split('|')[0]
time =soup.find('span', attrs={'class':'time_cptn'}).find_all('span')[2].contents[0]
full_text =soup.find('div', attrs={'class':'article_content'}).text.replace('Download The Times of India News App for Latest India News','')
except:
return ('','','','')
else:
return (title,time,url,full_text)
def pathmaker(name):
path = "Desktop/Web_Crawler/CRAWLED_DATA/{}".format(name)
try:
os.makedirs(path)
except OSError:
pass
else:
pass
def filemaker(folder,links_all):
#k=1
for link in links_all:
scrapped=scrappy(link)
#textfile=open('Desktop/Web_Crawler/CRAWLED_DATA/{}/text{}.txt'.format(x,k),'w+')
#k+=1
Title = scrapped[0]
Link = scrapped[2]
Dates = scrapped[1]
Text = scrapped[3]
con = cx_Oracle.connect('shivams/tiger#127.0.0.1/XE')
cursor = con.cursor()
sql_query = "insert into newsdata values(:1,:2,:3,:4)"
cursor.executemany(sql_query,[Title,Link,Dates,Text])
con.commit()
cursor.close()
con.close()
#textfile.write('Title\n{}\n\nLink\n{}\n\nDate & Time\n{}\n\nText\n{}'.format(scrapped[0],scrapped[2],scrapped[1],scrapped[3]))
#textfile.close()
con.close()
folders_links=[('India','https://timesofindia.indiatimes.com/india'),('World','https://timesofindia.indiatimes.com/world'),('Business','https://timesofindia.indiatimes.com/business'),('Homepage','https://timesofindia.indiatimes.com/')]
for x,y in folders_links:
pathmaker(x)
r = requests.get(y)
soup = BeautifulSoup(r.text, 'html.parser')
if x!='Homepage':
links =soup.find('div', attrs={'class':'main-content'}).find_all('span', attrs={'class':'twtr'})
links_all=['https://timesofindia.indiatimes.com'+links[x]['data-url'].split('?')[0] for x in range(len(links))]
else:
links =soup.find('div', attrs={'class':'wrapper clearfix'})
total_links = links.find_all('a')
links_all=[]
for p in range(len(total_links)):
if 'href' in str(total_links[p]) and '.cms' in total_links[p]['href'] and 'http' not in total_links[p]['href'] and 'articleshow' in total_links[p]['href'] :
links_all+=['https://timesofindia.indiatimes.com'+total_links[p]['href']]
filemaker(x,links_all)
Earlier i was creating text files and storing the news in them but now i want to store it in the database for my web application to access it. My database logic is in the file maker function. I am trying to insert the values in the table but its not working and giving various types of errors. I followed other posts on the website but they didnt work in my case. Can anyone help me with this. Also i am not sure if that is the correct way to insert CLOB data as i am using it for the first time. Need help.

You can do the following:
cursor.execute(sql_query, [Title, Link, Dates, Text])
or, if you build up a list of these values, you can then do the following:
allValues = []
allValues.append([Title, Link, Dates, Text])
cursor.executemany(sql_query, allValues)
Hope that explains things!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Optimizing a webcrawl - python-3.x

I would reccomend trying on a faster computer or even running on a service like google cloud platform, microsoft azure, or amazon web services. There is no code that will make it go faster.

Related

Beautiful Soup Value not extracting properly

Not getting any data entry with 'find_all' while scraping Spotify Charts webpage

Website server redirects me but get 200 as status code

Could not find a class name for BeautifulSoup on the website to crawl on

Unable to insert data into table using python and oracle database

Categories

Resources