I'm trying to scrape some content, but I cannot get a double for-loop to work. I tried look up other examples/solutions but I am getting no luck on my own. Using python3x and BS4.
Context:
In the html content there is a container containing 11x ("div",{"class":"days"})
Within this class, there can be 1-8x ("div",{"class":"item"})
Of this item, i want to have the 'name' and 'description' fields
page_soup = soup(page_html, "html.parser")
days = page_soup.findAll("div",{"class":"days"})
for item in days.findAll("div",{"class":"item"}):
name = item.h3.a.text
description = item.h4.a.text
print(name, description)
This give me the error AttributeError: ResultSet object has no attribute 'findAll'. When I add days = days[0] it provides me the correct details of the first 'days'. But now I want it to loop through all 11 'days', how do I loop through these 'days'?
Related
Recently i was working with python beautiful soup to extract some data and put it into pandas DataFrame.
I used python beautiful soup to extract some of the hotel data from the website booking.com.
I was able to extract some of the attributes very correctly without any empty.
Here is my code snippet:
def get_Hotel_Facilities(soup):
try:
title = soup.find_all("div", attrs={"class":"db29ecfbe2 c21a2f2d97 fe87d598e8"})
new_list = []
# Inner NavigatableString Object
for i in range(len(title)):
new_list.append(title[i].text.strip())
except AttributeError:
new_list=""
return new_list
The above code is my function to retrieve the Facilities of a hotel and return the facilitites List items.
page_no=0
d = {"Hotel_Name":[], "Hotel_Rating":[], "Room_type":[],"Room_price":[],"Room_sqft":[],"Facilities":[],"Location":[]}
while (page_no<=25):
URL = f"https://www.booking.com/searchresults.html?aid=304142&label=gen173rf-1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ&sid=2214b1422694e7b065e28995af4e22d9&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.html%3Faid%3D304142%26label%3Dgen173rf1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ%26sid%3D2214b1422694e7b065e28995af4e22d9%26&ss=goa&is_ski_area=0&checkin_year=2023&checkin_month=1&checkin_monthday=13&checkout_year=2023&checkout_month=1&checkout_monthday=14&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&offset{page_no}"
new_webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(new_webpage.content,"html.parser")
links = soup.find_all("a", attrs={"class":"e13098a59f"})
for link in links:
new_webpage = requests.get(link.get('href'), headers=HEADERS)
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
d["Hotel_Name"].append(get_Hotel_Name(new_soup))
d["Hotel_Rating"].append(get_Hotel_Rating(new_soup))
d["Room_type"].append(get_Room_type(new_soup))
d["Room_price"].append(get_Price(new_soup))
d["Room_sqft"].append(get_Room_Sqft(new_soup))
d["Facilities"].append(get_Hotel_Facilities(new_soup))
d["Location"].append(get_Hotel_Location(new_soup))
page_no += 25
The above code is the main one where the while loop will traverse the linked pages and retrieve the URL's of the pages. After retrieving ,it goes to every page to retrieve the corresponding atrributes.
I was able to retrieve the rest of the attributes correctly but i am not able to retrive the facilities, Like only some of the room facilities are being returned and some are not returning.
Here is my below o/p after making it into a pandas data frame.
Facilities o/p image
Please help me in this Problem as why some are coming and some are not coming.
P.S:- The facilities are available in the website
I have Tried using all the corresponding classes and attributes for retrieval but i am not getting the facilities column properly.
Probably as a predictive measure, the html fetched by the requests don't seem to consistent in their layouts or even the contents.
There might be more possible selectors, but try
def get_Hotel_Facilities(soup):
selectors = ['div[data-testid="property-highlights"]', '#facilities',
'.hp-description~div div.important_facility']
new_list = []
for sel in selectors:
for sect in soup.select(sel):
new_list += list(sect.stripped_strings)
return list(set(new_list)) # set <--> unique
But even with this, the results are inconsistent. E.g.: I tested on this page with
for i in range(10):
soup = BeautifulSoup(cloudscraper.create_scraper().get(url).content)
fl = get_Hotel_Facilities(soup) if soup else []
print(f'[{i}] {len(fl)} facilities: {", ".join(fl)}')
(But the inconsistencies might be due to using cloudscraper - maybe you'll get better results with your headers?)
I'm trying to run spyder to extract real estate advertisements informaiton.
My code:
import scrapy
from ..items import RealestateItem
class AddSpider (scrapy.Spider):
name = 'Add'
start_urls = ['https://www.exampleurl.com/2-bedroom-apartment-downtown-4154251/']
def parse(self, response):
items = RealestateItem()
whole_page = response.css('body')
for item in whole_page:
Title = response.css(".obj-header-text::text").extract()
items['Title'] = Title
yield items
After running in console:
scrapy crawl Add -o Data.csv
In .csv file I get
['\n 2-bedroom-apartment ']
Tried adding strip method to function:
Title = response.css(".obj-header-text::text").extract().strip()
But scrapy returns:
Title = response.css(".obj-header-text::text").extract().strip()
AttributeError: 'list' object has no attribute 'strip'
Is there are some easy way to make scrapy return into .csv file just:
2-bedroom-apartment
AttributeError: 'list' object has no attribute 'strip'
You get this error because .extract() returns a list, and .strip() is a method of string.
If that selector always returns ONE item, you could replace it with .get() [or extract_first()] instead of .extract(), this will return a string of the first item, instead of a list. Read more here.
If you need it to return a list, you can loop through the list, calling strip in each item like:
title = response.css(".obj-header-text::text").extract()
title = [item.strip() for item in title]
You can also use an XPath selector, instead of a CSS selector, that way you can use normalize-space to strip whitespace.
title = response.xpath('normalize-space(.//*[#class="obj-header-text"]/text())').extract()
This XPath may need some adjustment, as you didn't post the source I couldn't check it
When I was using BeautifulSoup to scrape listing product name and price, the similar code worked on other website. But when running in this website, soup.findAll attributes are there but no text scraped, AttributeError occurred. Is anyone can help to take look the code and website inspect?
I checked and ran many times, the same issue remained
Codes are here:
url = 'https://shopee.co.id/Handphone-Aksesoris-cat.40'
re = requests.get(url,headers=headers)
print(str(re.status_code))
soup = BeautifulSoup(re.text, "html.parser")
for el in soup.findAll('div', attrs={"class": "collection-card_collecton-title"}):
name = el.get.text()
print(name)
AttributeError: 'NoneType' object has no attribute 'text'
You are missing an i in the class name. However, content is dynamically loaded from API call (which is why you can't find it in your call where js doesn't run and so this next call to update DOM doesn't occur); you can find in network tab. It returns json.
import requests
r = requests.get('https://shopee.co.id/api/v2/custom_collection/get?category_id=40&platform=0').json()
titles = [i['collection_title'] for i in r['collections'][0]['list_popular_collection']]
print(titles)
Prices as well:
import requests
r = requests.get('https://shopee.co.id/api/v2/custom_collection/get?category_id=40&platform=0').json()
titles,prices =zip(*[(i['collection_title'], i['price']) for i in r['collections'][0]['list_popular_collection']])
print(titles,prices)
I try to use beautifulsoup4 to scrape the URL of the HTML code in python, but I got the error like this: AttributeError: 'NoneType' object has no attribute 'get'
HTML code:
<a class="top NQHJEb dfhHve" href="https://globalnews.ca/news/5137005/donald-trump-robert-mueller-report/" ping="/url?sa=t&source=web&rct=j&url=https://globalnews.ca/news/5137005/donald-trump-robert-mueller-report/&ved=0ahUKEwiS9pn-4rzhAhWOyIMKHSOPD6QQvIgBCDcwAg"><img class="th BbeB2d" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ_Nf-kVlqsQz8NeNgQ9a9YRiA7Fl4DJ6Jod0sxNXapOK_iJebx20dgROk5YBl8IqFQX6S-eeY2" alt="Story image for trump from Globalnews.ca" onload="typeof google==='object'&&google.aft&&google.aft(this)" data-iml="1554598687532" data-atf="3"></a>
My python code:
URL_results = soup.find_all('a', class_= 'top NQHJEb dfhHve').get('href')
You are applying the method to a list. Instead you want to apply to each element
URL_results = [a.attrs.get('href') for a in soup.find_all('a', class_= 'top NQHJEb dfhHve')]
I prefer
URL_results = [item['href'] for item in soup.select('a.top.NQHJEb.dfhHve')]
And you may be able to remove some of the classes from the current compound class selector e.g.
URL_results = [item['href'] for item in soup.select('a.dfhHve')]
You will need to play around and see.
I have been trying to delete the first instance of an element using BeautifulSoup and I am sure I am missing something. I did not use find all since I need to target the first instance which is always a header(div) and has the class HubHeader. The class is used in other places in combination with a div tag. Unfortunately I can't change the setup of the base html.
I did also try select one outside of a loop and it still did not work.
def delete_header(filename):
html_docs = open(filename,'r')
soup = BeautifulSoup( html_docs, "html.parser")
print (soup.select_one(".HubHeader")) #testing
for div in soup.select_one(".HubHeader"):
div.decompose()
print (soup.select_one(".HubHeader")) #testing
html_docs.close()
delete_header("my_file")
The most recent error is this:
AttributeError: 'NavigableString' object has no attribute 'decompose'
I am using select_one() and decompose().
Short answer, replace,
for div in soup.select_one(".HubHeader"):
div.decompose()
With one line:
soup.select_one(".HubHeader").decompose()
Longer answer, you code iterates over a bs4.element.Tag object. The function .select_one() returns an object while .select() returns a list if you were using .select() your code would work but take out all occurrences of the element with the selected class.