I am scraping a forum page for posts and relevant links using BeautifulSoup.
The links on the page I want are in form r"xx/res/[0-9]{5}.html$".
So far, so good finding them in my BeautifulSoup object, with the following link format returned when I print: /xx/res/83071.html.
I now want to prepend the domain name 'http://website.com' to each result, and use the full url as the basis for further scraping.
My successful code looks like this:
url = 'http://website.com/xx/index.html'
res = urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
links = soup.select('a',{'href':re.compile(r"xx/res/[0-9]{5}.html$")})
for l in links:
print(l['href'])
As a example, the following is printed to the console:
/xx/res/83071.html
/xx/res/81813.html
/xx/res/92014.html
/xx/res/92393.html
Hoping to get some help with the correct syntax to concatenate the prepended string to the output.
Thanks.
This will work for you:-
url = 'http://website.com/xx/index.html'
res = urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
links = soup.select('a',{'href':re.compile(r"xx/res/[0-9]{5}.html$")})
for l in links:
print ('http://website.com'+l['href'])
There are several ways to do it. I personally like the string.format method.
store the base url:
xx = 'base_url'
your print line would be:
print('/{}/{}'.format(xx, l['href']))
where {} gets replaced by .format to instead have the variables you feed into the parameters.
Related
Recently i was working with python beautiful soup to extract some data and put it into pandas DataFrame.
I used python beautiful soup to extract some of the hotel data from the website booking.com.
I was able to extract some of the attributes very correctly without any empty.
Here is my code snippet:
def get_Hotel_Facilities(soup):
try:
title = soup.find_all("div", attrs={"class":"db29ecfbe2 c21a2f2d97 fe87d598e8"})
new_list = []
# Inner NavigatableString Object
for i in range(len(title)):
new_list.append(title[i].text.strip())
except AttributeError:
new_list=""
return new_list
The above code is my function to retrieve the Facilities of a hotel and return the facilitites List items.
page_no=0
d = {"Hotel_Name":[], "Hotel_Rating":[], "Room_type":[],"Room_price":[],"Room_sqft":[],"Facilities":[],"Location":[]}
while (page_no<=25):
URL = f"https://www.booking.com/searchresults.html?aid=304142&label=gen173rf-1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ&sid=2214b1422694e7b065e28995af4e22d9&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.html%3Faid%3D304142%26label%3Dgen173rf1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ%26sid%3D2214b1422694e7b065e28995af4e22d9%26&ss=goa&is_ski_area=0&checkin_year=2023&checkin_month=1&checkin_monthday=13&checkout_year=2023&checkout_month=1&checkout_monthday=14&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&offset{page_no}"
new_webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(new_webpage.content,"html.parser")
links = soup.find_all("a", attrs={"class":"e13098a59f"})
for link in links:
new_webpage = requests.get(link.get('href'), headers=HEADERS)
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
d["Hotel_Name"].append(get_Hotel_Name(new_soup))
d["Hotel_Rating"].append(get_Hotel_Rating(new_soup))
d["Room_type"].append(get_Room_type(new_soup))
d["Room_price"].append(get_Price(new_soup))
d["Room_sqft"].append(get_Room_Sqft(new_soup))
d["Facilities"].append(get_Hotel_Facilities(new_soup))
d["Location"].append(get_Hotel_Location(new_soup))
page_no += 25
The above code is the main one where the while loop will traverse the linked pages and retrieve the URL's of the pages. After retrieving ,it goes to every page to retrieve the corresponding atrributes.
I was able to retrieve the rest of the attributes correctly but i am not able to retrive the facilities, Like only some of the room facilities are being returned and some are not returning.
Here is my below o/p after making it into a pandas data frame.
Facilities o/p image
Please help me in this Problem as why some are coming and some are not coming.
P.S:- The facilities are available in the website
I have Tried using all the corresponding classes and attributes for retrieval but i am not getting the facilities column properly.
Probably as a predictive measure, the html fetched by the requests don't seem to consistent in their layouts or even the contents.
There might be more possible selectors, but try
def get_Hotel_Facilities(soup):
selectors = ['div[data-testid="property-highlights"]', '#facilities',
'.hp-description~div div.important_facility']
new_list = []
for sel in selectors:
for sect in soup.select(sel):
new_list += list(sect.stripped_strings)
return list(set(new_list)) # set <--> unique
But even with this, the results are inconsistent. E.g.: I tested on this page with
for i in range(10):
soup = BeautifulSoup(cloudscraper.create_scraper().get(url).content)
fl = get_Hotel_Facilities(soup) if soup else []
print(f'[{i}] {len(fl)} facilities: {", ".join(fl)}')
(But the inconsistencies might be due to using cloudscraper - maybe you'll get better results with your headers?)
I was wondering if someone could help me put together some code for
https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L
I currently use this code to scrape the current price
currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
This works fine but I occasionally get an error not really sure why as the links are all correct. but I would like to try to get the price again
so something like
try:
currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
except Exception:
currentPriceData = soup.find('span', {'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})[0].text
The problem is that I can't get it to scrape the number using this method any help would be greatly appreciated.
The data is embedded within the page as Javascript variable. But you can use json module to parse it.
For example:
import re
import json
import requests
url = 'https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L'
html_data = requests.get(url).text
#the next line extracts from the HTML source javascript variable
#that holds all data that is rendered on page.
#BeautifulSoup cannot run Javascript, so we are going to use
#`json` module to extract the data.
#NOTE: When you view source in Firefox/Chrome, you can search for
# `root.App.main` to see it.
data = json.loads(re.search(r'root\.App\.main = ({.*?});\n', html_data).group(1))
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# We now have the Javascript variable extracted to standard python
# dict, so now we just print contents of some keys:
price = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['regularMarketPrice']['fmt']
currency_symbol = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['currencySymbol']
print('{} {}'.format(price, currency_symbol))
Prints:
227.30 £
I am trying to use the built-in search function from the site but I keep getting results from the main page. Not sure what I am doing wrong.
import requests
from bs4 import BeautifulSoup
body = {'input':'ferris'} # <-- also have tried'query'
con = requests.post('http://www.collegedata.com/', data=body)
soup = BeautifulSoup(con.content, 'html.parser')
products = soup.findAll('div', {'class': 'schoolCityCol'})
print(soup)
print (products)
You have 2 issues in your code:
POST url is incorrect. You should correct this:
con = session.post('http://www.collegedata.com/cs/search/college/college_search_tmpl.jhtml', data=body)
Your POST data is incorrect too.
body = {'method':'submit', 'collegeName':'ferris', 'searchType':'1'}
You can use Developer tools in any browser (Chrome preferably) and check POST url and data on page Network.
suppose we have this swarm url "https://www.swarmapp.com/c/dZxqzKerUMc" how we can get the url under Apple Williamsburg hyperlink in link above.
I tried to filter it out according to html tags but there are many tags and lots of foursquare.com links.
below is a part of source code of the given link above
<h1><strong>Kristin Brooks</strong> at <a
href="https://foursquare.com/v/apple-williamsburg/57915fa838fab553338ff7cb"
target="_blank">Apple Williamsburg</a></h1>
the url foursquare in the code not always the same, so what is the best way to get that specific url uniquely for every given Swarm url.
I tried this:
import bs4
import requests
def get_4square_url(link):
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, "html.parser")
link = [a.attrs.get('href') for a in
soup.select('a[href=https://foursquare.com/v/*]')]
return link
print (get_4square_url('https://www.swarmapp.com/c/dZxqzKerUMc'))
I used https://foursquare.com/v/ as a pattern to get the desirable url
def get_4square_url(link):
try:
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, "html.parser")
for elem in soup.find_all('a',
href=re.compile('https://foursquare\.com/v/')): #here is my pattern
link = elem['href']
return link
except requests.exceptions.HTTPError or
requests.exceptions.ConnectionError or requests.exceptions.ConnectTimeout \
or urllib3.exceptions.MaxRetryError:
pass
I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.