How to parse the only the second span tag in an HTML document using python bs4 - python-3.x

I want to parse only one span tag in my html document. There are three sibling span tags without any class or I'd. I am targeting the second one only using BeautifulSoup 4.
Given the following html document:
<div class="adress">
<span>35456 street</span>
<span>city, state</span>
<span>zipcode</span>
</div>
I tried:
for spn in soup.findAll('span'):
data = spn[1].text
but it didn't work. The expected result is the text in the second span stored in a a variable:
data = "city, state"
and how to to get both the first and second span concatenated in one variable.

You are trying to slice an individual span (a Tag instance). Get rid of the for loop and slice the findAll response instead, i.e.
>>> soup.findAll('span')[1]
<span>city, state</span>
You can get the first and second tags together using:
>>> soup.findAll('span')[:2]
[<span>35456 street</span>, <span>city, state</span>]
or, as a string:
>>> "".join([str(tag) for tag in soup.findAll('span')[:2]])
'<span>35456 street</span><span>city, state</span>'

Another option:
data = soup.select_one('div > span:nth-of-type(2)').get_text(strip=True)
print(data)
Output:
city, state

Related

How to get text which is inside the span tag using selenium webdriver?

I want to get the text which is inside the span. However, I am not able to achieve it. The text is inside ul<li<span<a<span. I am using selenium with python.
Below is the code which I tried:
departmentCategoryContent = driver.find_elements_by_class_name('a-list-item')
departmentCategory = departmentCategoryContent.find_elements_by_tag_name('span')
after this, I am just iterating departmentCategory and printing the text using .text i.e
[ print(x.text) for x in departmentCategory ]
However, this is generating an error: AttributeError: 'list' object has no attribute 'find_elements_by_tag_name'.
Can anyone tell me what I am doing wrong and how I can get the text?
Problem:
As far as I understand, departmentCategoryContent is a list, not a single WebElement, then it doesn't have the find_elements_by_tag_name() method.
Solution:
you can choose 1 of 2 ways below:
You need for-each of list departmentCategoryContent first, then find_elements_by_tag_name().
Save time with one single statement, using find_elements_by_css_selector():
departmentCategory = driver.find_elements_by_css_selector('.a-spacing-micro.apb-browse-refinements-indent-2 .a-list-item span')
[ print(x.text) for x in departmentCategory ]
Test on devtool:
Explanation:
Your locator .a-list-item span will return all the span tag belong to the div that has class .a-list-time. There are 88 items containing the unwanted tags.
So, you need to add more specific locator to separate the other div. In this case, I use some more classes. .a-spacing-micro.apb-browse-refinements-indent-2
You're looping over the wrong thing. You want to loop through the 'a-list-item' list and find a single span element that is a child of that webElement. Try this:
departmentCategoryContent = driver.find_elements_by_class_name('a-list-item')
print(x.find_element_by_tag_name('span').text) for x in departmentCategoryContent
note that the second dom search is a find_element (not find_elements) which will return a single webElement, not a list.

Scrapy parse is returning an empty array, regardles of yield

I am brand new to Scrapy, and I could use a hint here. I realize that there are quite a few similar questions, but none of them seem to fix my problem. I have the following code written for a simple web scraper:
import scrapy
from ScriptScraper.items import ScriptItem
class ScriptScraper(scrapy.Spider):
name = "script_scraper"
allowed_domains = ["https://proplay.ws"]
start_urls = ["https://proplay.ws/dramas/"]
def parse(self, response):
for column in response.xpath('//div[#class="content-column one_fourth"]'):
text = column.xpath('//p/b/text()').extract()
item = ScriptItem()
item['url'] = "test"
item['title'] = text
yield item
I will want to do some more involved scraping later, but right now, I'm just trying to get the scraper to return anything at all. The HTML for the site I'm trying to scrape looks like this:
<div class="content-column one_fourth">
::before
<p>
<b>
All dramas
<br>
(in alphabetical
<br>
order):
</b>
</p>
...
</div>
and I am running the following command in the Terminal:
scrapy parse --spider=script_scraper -c parse_ITEM -d 2 https://proplay.ws/dramas/
According to my understanding of Scrapy, the code I have written should be yielding the text "All dramas"; however, it is yielding an empty array instead. Can anyone give me a hint as to why this is not producing the expected yield? Again, I apologize for the repetitive question.
your XPath expressions are not exactly as you want to extract data. If you want the first column's first-row item. Then your XPath expression should be.
item = {}
item['text'] = response.xpath ('//div[#class="content-column one_fourth"][1]/p[1]/b/text()').extract()[0].
The function extract() will return all the matches for the expression, it returns an array. If you want the first you should use extract()[0] or extract_first().
Go through this page https://devhints.io/xpath to get more knowledge related to Xpath.

Querying <div class="name"> in Python

I am trying to follow the guide posted here: https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe
I am at this point, where I am supposed to get the name of presumably the stock.
Take out the div of name and get its value
name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})
I suspect I will also have trouble when querying the price. Do I have to replace 'price' with 'priceText__1853e8a5' as found in the html?
get the index price
price_box = soup.find(‘div’, attrs={‘class’:’price’})
Thanks, this would be a massive help.
If you replace price with priceText__1853e8a5 you will get your result, but I suspect that the class name changes dynamically/is dynamically generated (note the number at the end). So to get your result you need something more robust.
You can target tags in BeautifulSoups with CSS selectors (with select()/select_one() methods. This example will target all <span> tags with class attribute that begins with priceText (^= operator - more info about CSS selectors here).
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.bloomberg.com/quote/SPX:IND')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('span[class^="priceText"]').text)
This prints:
2,813.36
You have several options to do that.
getting the value by appropriate xPath.
//span[contains(#class, 'priceText__')]
Writing regex to find the exact element.
price_tag = soup.find_all('span', {'class':
re.compile(r'priceText__.*?')})
I am not sure with the regex pattern as i am bad in it. Edits are welcome.

Extracting div attributes based on text

I am using Python 3.6 with bs4 to implement this task.
my div tag looks like this
<div class="Portfolio" portfolio_no="345">VBHIKE324</div>
<div class="Portfolio" portfolio_no="567">SCHF54TYS</div>
I need to extract portfolio_no i.e 345. As it is a dynamic value it keeps changing for multiple div tags but whereas the text remains same.
for data in soup.find_all('div',class_='Portfolio', text='VBHIKE324'):
print (data)
It outputs as None, where as I'm looking for o/p like 345
Here you go
for data in soup.find_all('div', {'class':'Portfolio'}):
print(data['portfolio_no'])
If you want the portfolio_no for the one with text VBHIKE324 then you can do something like this
for data in soup.find_all('div', {'class':'Portfolio'}):
if data.text == 'VBHIKE324':
print(data['portfolio_no'])

Removing unwanted html from an href tag in Python [duplicate]

This question already has an answer here:
BeautifulSoup getting href [duplicate]
(1 answer)
Closed 6 years ago.
I want to be able to scrape out a list of links. I cannot due this directly with BeautifulSoup because of the way the html is structured.
start_list = soup.find_all(href=re.compile('id='))
print(start_list)
[<b>Act of Valor</b>,
<b>Action Jackson</b>]
I am looking to pull just the href information. I am thinking some sort of filter where I can put all of the bold tags into a list then filter them out of another list which contains the information above.
start_list = soup.find_all('a', href=re.compile('id='))
start_list_soup = BeautifulSoup(str(start_list), 'html.parser')
things_to_remove = start_list_soup.find_all('b')
The idea is to be able to loop through things_to_remove and remove all occurrences of its contents from start_list
start_list = soup.find_all(href=re.compile('id='))
href_list = [i['href'] for i in start_list]
href is the attrbute of tag, if you use find_all get bunch of tags, just iterate over it and use tag['href'] to access the attribute.
To understand why use [], you should know that tag's attribute are store in the dictionary.
Document:
A tag may have any number of attributes. The tag <b class="boldest">
has an attribute “class” whose value is “boldest”. You can access a
tag’s attributes by treating the tag like a dictionary:
tag['class']
# u'boldest'
You can access that dictionary directly as .attrs:
tag.attrs
# {u'class': u'boldest'}
list comprehension is simple, you can reference this PEP, in this case, it can be done in the for loop:
href_list = []
for i in start_list:
href_list.append(i['href'])

Resources