Class consist out of four parts seperated by spaces - python-3.x

I am trying to scrape a web site using python and beautiful soup. The goal is to build a csv file, with the relevant information(location, unit size, rent...)
I am not 100% sure what the problem is but I think it has to do with the strutcture of the class. "result matches_criteria_and_filters first_listing highlighted"
First part of the code:
import requests
from bs4 import BeautifulSoup
r= requests.get("https://www.publicstorage.com/storage-search-landing.aspx?
location=New+York")
c=r.content
After that I would need the class= result matches_criteria_and_filters first_listing highlighted. Here I am not able to do it.
Solutions that I found in other threads were not working.
soup.select("result.matches_criteria_and_filters.first_listing.highlighted")
Another possibility I found is to seperate, but it did not work.
soup.find_all(attrs={'class': 'result'})
soup.find_all(attrs={'class': 'matches_criteria_and_filters'})
Everything I tried, gave empty or none objects.

First try getting the parent div by the code similar to the following:
soup = BeautifulSoup('yourhtml', 'lxml')
results_div = soup.find('div', {'id':'results'})
#now iterate through all children divs
then do whatever you want to do with children divs

Related

Why does find_next_sibling in bs4 work on one line of code but not another, very similar, line of code?

I'm writing a simple web scraper to get data from the Texas Commission on Environmental Quality (TCEQ) website. The info I need is inside 'td' tags. I'm scraping the appropriate 'td' by referencing the preceding 'th', which all have the same text used to ID. I'm using find_next_sibling to scrape the data into a variable.
Here is my code:
import requests
from bs4 import BeautifulSoup
URL = "https://www2.tceq.texas.gov/oce/eer/index.cfm?fuseaction=main.getDetails&target=323191"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
###This one works
report = soup.find("th", text="Incident Tracking Number:").find_next_sibling("td").text
###This one doesn't
owner = soup.find("th", text="Name of Owner or Operator:").find_next_sibling("td").text
I'm getting this error: AttributeError: 'NoneType' object has no attribute 'find_next_sibling'. This code has several lines like the two above, and, like them, some of them work and some of them don't. I've looked into the HTML to see if there's another tag, but I'm not seeing it if it's there. Please and thank you for any help!
When using the text parameter, you should make sure you provide the text exactly. In your case, there's a space at the end.
soup.find('th', text='Name of Owner or Operator: ').find_next_sibling('td').text
This prints:
\n \n \n \n \n PHILLIPS 66 COMPANY\n \n \n

How to get specific text in a soup.find method on python?

I'm having multiple issues with trying to scrape a website when the CSS code are all the same. I'm still learning about the soup.find method and things I can do with it. The issue is there are several lines of CSS code on a webpage that has <span class="list-quest" and when I use soup.find(class_='list-quest') for example I will only get the result from the top of the page that uses the same CSS code. Is there a way to get the exact specific line of code? Possibly using Born [dd-mm-yyyy] ? But sadly I do not know how to use a specific keyword such as that for Python to find it.
<span class="list-quest">Born [dd-mm-yyyy]:</span>
By using a regex on the text attribute :
Regex :
Born \d{2}-\d{2}-\d{4}:
Python Code :
from bs4 import BeautifulSoup
import re
text = '<span class="list-quest">Born 01-01-2019:</span>'
soup = BeautifulSoup(text,features='html.parser')
tag = soup.find('span',attrs={'class':'list-quest'} , text=re.compile(r'Born \d{2}-\d{2}-\d{4}'))
print(tag.text)
Demo : Here
You might, with bs4 4.7.1 + be able to use contains
item = soup.select_one('span.list-quest:contains("Born ")')
if item is not None: print(item.text)

Processing all values of an array with get_text

(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)

Can't get beautiful soup to return the correct article titles, links, and img. Help debug?

I've been trying to scrape data for a project from the times for the last 7 hours. And yes it has to be done without the API. Its been a war of attrition, but this code that checks out keeps returning nans, am I missing something simple? Towards the bottom of the page is every story contained within the front page, little cards that have an image, 3 article titles, and their corresponding links. It either doesn't grab a thing, partially grabs it, or grabs something completely wrong. There should be about 35 cards with 3 links a piece for 105 articles. I've gotten it to recognize 27 cards with a lot of nans instead of strings and none of the individual articles.
import csv, requests, re, json
from bs4 import BeautifulSoup
handle = 'http://www.'
location = 'ny'
ping = handle + locaiton + 'times.com'
pong = requests.get(ping, headers = {'User-agent': 'Gordon'})
soup = BeautifulSoup(pong.content, 'html.parser')
# upper cards attempt
for i in soup.find_all('div', {'class':'css-ki19g7 e1aa0s8g0'}):
print(i.a.get('href'))
print(i.a.text)
print('')
# lower cards attempt
count = 0
for i in soup.find_all('div', {"class":"css-1ee8y2t assetWrapper"}):
try:
print(i.a.get('href'))
count+=1
except:
pass
print('current card pickup: ', count)
print('the goal card pickup:', 35)
Everything Clickable uses "css-1ee8y2t assetWrapper", but when I find_all I'm only getting 27 of them. I wanted to start from css-guaa7h and work my way down but it only returns nans. Other promising but fruitless divs are
div class="css-2imjyh" data-testid="block-Well" data-block-tracking-id="Well"
div class="css-a11566"
div class="css-guaa7h”
div class="css-zygc9n"
div data-testid="lazyimage-container" # for images
Current attempt:
h3 class="css-1d654v4">Politics
My hope is running out, why is just getting a first job is harder then working hard labor.
I checked their website and it's using ajax to load the articles as soon as you scroll down. You'll probably have to use selenium. Here's an answer that might help you do that: https://stackoverflow.com/a/21008335/7933710

Getting all tags with multiple attributes with SoupStrainer and BeautifulSoup

I'm trying to get all the occurrences of the 'td' tag when the class attribute has one of a few different values.
I know how to do this with BeautifulSoup after the fact but due to the amount of time it takes I'm trying to speed it up by selectively parsing each page with SoupStrainer. I at first tried the below but it doesn't seem to work.
strainer = SoupStrainer('td', attrs={'class': ['Value_One', 'Value_Two']})
soup = BeautifulSoup(foo.content, "lxml", parse_only=strainer)
Does anybody know of a way to make this work (it doesn't have to involve SoupStrainer or even Beautiful Soup)?
Depending on what you may mean, of course. You might be able to use scrapy which gives you the ability to formulate xpath expressions such as the one used here. It takes advantage of the fact that the two class attributes are similar. Many other ways of making selections are available.
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('.//td[contains(#class,"Value")]/text()').extract()
['value one', 'value two']

Resources