Web Scraping Location Data with BeautifulSoup - python-3.x

I am trying to scrape a webpage for address data (the highlighted street address shown in this image:1) using the find() function of the BeautifulSoup library. Most online tutorials only provide examples where data can be easily pinpointed to a certain class; however, for this particular site, the street address is a element within a larger class="dataCol col02 inlineEditWrite" and I'm not sure how to get at it with the find() function.
What would be the arguments to find() to get the street address in this example? Any help would be greatly appreciated.
Image: 1

This should get you started, it will find the div element with the class "dataCol col02 inlineEditWrite" then search for td elements within it and print the first td elements text:
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
for tag in divTag:
tdTags = tag.find_all("td")
print (tdTags[0].text)
the above example assumes you want to print the first td element from all the div elements with the class "dataCol col02 inlineEditWrite" otherwise
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
tdTags = divTag[0].find_all("td")
print (tdTags[0].text)

Related

filtering out elements found with beautiful soup based on a key word in any attribute

Here is an example of an url.
url = 'https://rapaxray.com'
# logo
html_content = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_content, "lxml")
images_found = soup.findAll('img', {'src' : re.compile(r'(jpe?g)|(png)|(svg)$')})
images_found
First I'm narrowing down the list of elements to the ones containing jpg, png or svg in a tag. In this case I only get 3 elements. Then I would like to filter those elements to show me only the ones that have a key word 'logo' in ANY attribute.
The element I'm looking for in this example looks like this:
'img alt="Radiology Associates, P.A." class="attachment-full size-full astra-logo-svg" loading="lazy" src="https://rapaxray.com/wp-content/uploads/2019/09/RAPA100.svg"/'
I want to filter out this element out of all elements based on condition that it has a key word 'logo' in ANY of its attributes
The challenge is that:
I have thousands of urls, and key word logo could be in a different attribute for different url
logic: if 'logo' in ANY(attribute for attribute in list_of_possible_attributes_that_this_element_has) doesn't work the same way as list comprehensions because I couldn't find the way of how to access any possible attribute without using its specific name
Checking all specific names is also problematic because particular attribute could exist in one element but not the other which throws error
Case above is also extra challenging because attribute value is a list, so we would need to flatten it to be able to check if the key word is in it.
For most of the urls the element I'm looking for is not returned as the top one like in this example so choosing top first is not an option.
Is there a way of filtering out elements based on a key word in ANY of its attributes? (without prior knowledge of what the name of the attribute is?).
If I understood you correctly, you could use a filter function similar to this answer to search for all tags such that any tag attribute's value contains val:
def my_filter(tag, val):
types = ['.jpg','.jpeg','.svg','.png']
if tag is not None and tag.name == "img" and tag.has_attr("src"):
if all(y not in tag['src'] for y in types):
return False
for key in tag.attrs.keys():
if isinstance(tag[key], list):
if any(val in entry for entry in tag[key]):
return True
else:
if val in tag[key]:
return True
return False
res = soup.find_all(lambda tag: my_filter(tag, "logo"))

Accessing Span Elements

When trying to scrape out the integer value for movie reviews from IMDB reviews , i am confused on how to access the rating when its inspect html is just listed as, 10, and changes for each individual rating (i.e 7 . How would I use the soup.find_all to access these values and add them to a list- i am confused how to do this when there is no class listed for the variable?
rate=soup.find_all('span')
rate_list=[]
for i in range(0,len(rate)):
rate_list.append(rate[i].get_text())
Try using the fact the target span sits next to the star
ratings = [i.text for i in soup.select('.ipl-star-icon + span')]
But, in case there are ratings for everything I would probably loop reviews (for review in soup.select('.lister-item-content): .....) and test if review.select_one('.ipl-star-icon + span') is not None

How to iterate over WebElements and get a new WebElement in Robot Framework

I am trying to get href attribute from an html list using Robot Framework keywords. For example suppose the html code
<ul class="my-list">
<li class="my-listitem"><a href="...">...</li>
...
<li class="my-listitem"><a href="...">...</li>
</ul>
I have tried to use the keywords WebElement, WebElements and for loop without success. How can I do it?
This is my MWE
*** Test Cases ***
#{a tags} = Create List
#{href attr} = Create List
#{li items} = Get WebElements class:my-listitem
FOR ${li} IN #{li items}
${a tag} = Get WebElement tag:a
Append To List #{a tags} ${a tag}
END
FOR ${a tag} IN #{a tags}
${attr} = Get Element Attribute css:my-listitem href
Append To List #{href attr} ${attr}
END
Thanks in advance.
The href is an attribute of the a elements, not the li, thus you need to target them. Get a reference for all such elements, and then get their href in the loop:
${the a-s}= Get WebElements xpath=//li[#class='my-listitem']/a # by targeting the correct element, the list is a reference to all such "a" elements
${all href}= Create List
FOR ${el} IN #{the a-s} # loop over each of them
${value}= Get Element Attribute ${el} href # get the individual href
Append To List ${all href} ${value} # and store it in a result list
END
Log To Console ${all href}
Here is a possible solution (not tested):
#{my_list}= Get WebElements xpath=//li[#class='my-listitem']
FOR ${element} IN #{my_list}
${attr}= Get Element Attribute ${element} href
Log ${attr} html=True
END

I am doing some web scraping and during that I've run into an error whih states that, "'NoneType' object is not subscriptable"

I am using bs4 for web scraping. This is the html code that I am scraping.
items is a list of these multiple div tags i.e <div class="list_item odd" itemscope=""...>
from which the tag that I really want from each in items element is:
<p class="cert-runtime-genre">
<img title="R" alt="Certificate R" class="absmiddle certimage" src="https://m...>
<time datetime="PT119M">119 min</time>
-
<span>Drama</span>
<span class="ghost">|</span>
<span>War</span>
</p>
The main class of this list is saved in items. From that I want to scrape the img tag and then access the title attribute so that I can save all the certifications of the movies in a database i.e R or PG etc. But when I apply loop to the items it gives an error that the items is not subscriptable. I tried list comprehensions, simple for loop, called items elements through a predefined integer array nothing works and still gives the same error. (items is not Null and is subscriptable i.e. is a list). But when I call it with a direct integers it works fine i.e items[0] or items[1] etc, and gives the correct result for each corresponding element in the items list. The error line is below:
cert = [item.find(class_ = "absmiddle certimage")["title"] for item in items] or
cert = [item.find("img",{"class": "absmiddle certimage"})["title"] for item in items]
and this is what works fine: cert = items[0].find(class_ = "absmiddle certimage")["title"]
Any suggestion will be appreciated.

Scrape info from a span title

My html looks like this:
<h3>Current Guide Price <span title="92"> 92
</span></h3>
The info I am trying to get is the 92.
here is another html page where i need to get the same data:
<h3>Current Guide Price <span title="4,161"> 4,161
</span></h3>
I would need to get the 4,161 from this page.
here is the link to the page for reference:
http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=1613
What I have tried:
/h3/span[#title="92"]#title
/h3/span[#title="92"]/text()
/div[#class="stats"]/h3/span[#title="4,161"]#title
since the info I need is in the actual span tag, it is hard to grab the data in a dynamic way that I can use for many different pages.
from lxml import html
import requests
baseUrl = 'http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=2355'
page = requests.get(baseUrl)
tree = html.fromstring(page.content)
price = tree.xpath('//h3/span')
price2 = tree.xpath('//h3/span/#title')
for p in price:
print(p.text.strip())
for p2 in price2:
print(p2)
The output is 92 in both cases.

Resources