Extract text between dynamic HTML tags using python soup - python-3.x

I have a requirement where i need to extract the text between HTML tags. I used BeautifulSoup to extract data and store the text into a variable for further processing. Later i found that, the text which i need to extract is coming in two different tags. However please note that i need to extract the text and store into same variable. My earlier code and sample HTML text information is provided. Please help me how to get my end results i.e expected output.
Sample HTML text:
<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 80 DOCUMENTS</SPAN></P>
<DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Financial Times (London, England)</SPAN></P>
<DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The Financial Times Ltd.<BR>All Rights Reserved<BR>Please do not cut and paste FT articles and redistribute by email or post to the web.</SPAN></P>
<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">80 of 80 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Financial Times (London,England)</SPAN></P>
</DIV>
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Copyright 1990 The Financial Times Limited</SPAN></P>
</DIV>
From the above HTML text, i need to store documents(1 of 80 documents, 80 of 80 documents) into a single variable. similarly for other text it follows similar approach. I wrote code for div.c0
soup = BeautifulSoup(response, 'html.parser')
docpublicationcpyright = soup.select('div.c0')
list1 = [b.text.strip() for b in docpublicationcpyright]
doccountvalues = list1[0:len(list1):3]
publicationvalues = list1[1:len(list1):3]
copyrightvalues = list1[2:len(list1):3]
documentcount = doccountvalues
publicationpaper = publicationvalues
Any help would be greatly appreciated.

Given sample HTML is not properly structured. For example: closing tag is missing for first DIV element. Anyways for this type of HTML also using regular expressions you can scrape required data.
I wrote a sample code considering only sample HTML posted in the question & able to extract the all three required fields
soup = BeautifulSoup(response, 'html.parser')
documentElements = soup.find_all('span', text=re.compile(r'of [0-9]+ DOCUMENTS'))
documentCountList = []
publicationPaperList = []
documentPublicationCopyrightList = []
for elem in documentElements:
documentCountList.append(elem.get_text().strip())
if elem.parent.find_next_sibling('div'):
publicationPaperList.append(elem.parent.find_next_sibling('div').find('span').get_text().strip())
documentPublicationCopyrightList.append(elem.parent.find_next_sibling('div').find_all('span')[1].get_text())
else:
publicationPaperList.append(elem.parent.parent.find_next('div').get_text().strip())
documentPublicationCopyrightList.append(elem.parent.parent.find_next('div').find_next('div').get_text().strip())
print(documentCountList)
print(publicationPaperList)
print(documentPublicationCopyrightList)
output looks like below
[u'1 of 80 DOCUMENTS', u'80 of 80 DOCUMENTS']
[u'Financial Times (London, England)', u'Financial Times (London,England)']
[u'Copyright 2015 The Financial Times Ltd.All Rights ReservedPlease do not cut and paste FT articles and redistribute by email or post to the web.', u'Copyright 1990 The Financial Times Limited']

Related

Scrapy parse is returning an empty array, regardles of yield

I am brand new to Scrapy, and I could use a hint here. I realize that there are quite a few similar questions, but none of them seem to fix my problem. I have the following code written for a simple web scraper:
import scrapy
from ScriptScraper.items import ScriptItem
class ScriptScraper(scrapy.Spider):
name = "script_scraper"
allowed_domains = ["https://proplay.ws"]
start_urls = ["https://proplay.ws/dramas/"]
def parse(self, response):
for column in response.xpath('//div[#class="content-column one_fourth"]'):
text = column.xpath('//p/b/text()').extract()
item = ScriptItem()
item['url'] = "test"
item['title'] = text
yield item
I will want to do some more involved scraping later, but right now, I'm just trying to get the scraper to return anything at all. The HTML for the site I'm trying to scrape looks like this:
<div class="content-column one_fourth">
::before
<p>
<b>
All dramas
<br>
(in alphabetical
<br>
order):
</b>
</p>
...
</div>
and I am running the following command in the Terminal:
scrapy parse --spider=script_scraper -c parse_ITEM -d 2 https://proplay.ws/dramas/
According to my understanding of Scrapy, the code I have written should be yielding the text "All dramas"; however, it is yielding an empty array instead. Can anyone give me a hint as to why this is not producing the expected yield? Again, I apologize for the repetitive question.
your XPath expressions are not exactly as you want to extract data. If you want the first column's first-row item. Then your XPath expression should be.
item = {}
item['text'] = response.xpath ('//div[#class="content-column one_fourth"][1]/p[1]/b/text()').extract()[0].
The function extract() will return all the matches for the expression, it returns an array. If you want the first you should use extract()[0] or extract_first().
Go through this page https://devhints.io/xpath to get more knowledge related to Xpath.

Extracting contents of nested <p> Tags with Beautiful Soup

I am having a hard time using the advantages of beautiful soup for my use case. There are many similar but not always equal nested p tags where I want to get the contents from. Examples as follows:
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">first</span>.</p>
I need to save the string of the span tag as well as the strings inside the p tag, no matter its styling and if applicable the referencequote. So from examples above I would like to extract:
example = 20, text = 'normal string', reference = []
example = 21, text = 'this text belongs together', reference = []
example = 22, text = 'some text that might continue', reference = ['a reference text']
example = 23, text = 'more text', reference = []
example = 24, text = 'text with two references', reference = ['first', 'second']
What I was trying is to collect all items with the "example" class and then looping though its parents contents.
for span in bs.find_all("span", {"class": "example"}):
references = []
for item in span.parent.contents:
if (type(item) == NavigableString):
text= item
elif (item['class'][0]) == 'verse':
number= int(item.string)
elif (item['class']) == 'referencequote':
references.append(item.string)
else:
#how to handle <strong> tags?
verses.append(MyClassObject(n=number, t=text, r=references))
My approach is very prone to error and there might be even more tags like <strong>, <em> that I am ignoring right now. The get_text() method unfortunately gives back sth like '22 some text a reference text that might continue'.
There must be an elegant way to extract this information. Could you give me some ideas for other approaches? Thanks in advance!
Try this.
from simplified_scrapy.core.regex_helper import replaceReg
from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">second</span>.</p>
'''
html = replaceReg(html,"<[/]*strong>","") # Pretreatment
doc = SimplifiedDoc(html)
ps = doc.ps
for p in ps:
text = ''.join(p.spans.nextText())
text = replaceReg(text,"[()]+","") # Remove ()
span = p.span # Get first span
spans = span.getNexts(tag="span").text # Get references
print (span["class"], span.text, text, spans)
Result:
example 20 normal string []
example 21 this text belongs together []
example 22 some text that might continue ['a reference text']
example 23 more text []
example 24 text with two references. ['first', 'second']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
A different approach I found out - without regex and maybe more robust to different spans that might come up
for s in bsItem.select('span'):
if s['class'][0] == 'example' :
# do whatever needed with the content of this span
s.extract()
elif s['class'][0] == 'referencequote':
# do whatever needed with the content of this span
s.extract()
# check for all spans with a class where you want the text excluded
# finally get all the text
text = span.parent.text.replace(' ()', '')
maybe that approach is of interest for someone reading this :)

How to parse the only the second span tag in an HTML document using python bs4

I want to parse only one span tag in my html document. There are three sibling span tags without any class or I'd. I am targeting the second one only using BeautifulSoup 4.
Given the following html document:
<div class="adress">
<span>35456 street</span>
<span>city, state</span>
<span>zipcode</span>
</div>
I tried:
for spn in soup.findAll('span'):
data = spn[1].text
but it didn't work. The expected result is the text in the second span stored in a a variable:
data = "city, state"
and how to to get both the first and second span concatenated in one variable.
You are trying to slice an individual span (a Tag instance). Get rid of the for loop and slice the findAll response instead, i.e.
>>> soup.findAll('span')[1]
<span>city, state</span>
You can get the first and second tags together using:
>>> soup.findAll('span')[:2]
[<span>35456 street</span>, <span>city, state</span>]
or, as a string:
>>> "".join([str(tag) for tag in soup.findAll('span')[:2]])
'<span>35456 street</span><span>city, state</span>'
Another option:
data = soup.select_one('div > span:nth-of-type(2)').get_text(strip=True)
print(data)
Output:
city, state

Querying <div class="name"> in Python

I am trying to follow the guide posted here: https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe
I am at this point, where I am supposed to get the name of presumably the stock.
Take out the div of name and get its value
name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})
I suspect I will also have trouble when querying the price. Do I have to replace 'price' with 'priceText__1853e8a5' as found in the html?
get the index price
price_box = soup.find(‘div’, attrs={‘class’:’price’})
Thanks, this would be a massive help.
If you replace price with priceText__1853e8a5 you will get your result, but I suspect that the class name changes dynamically/is dynamically generated (note the number at the end). So to get your result you need something more robust.
You can target tags in BeautifulSoups with CSS selectors (with select()/select_one() methods. This example will target all <span> tags with class attribute that begins with priceText (^= operator - more info about CSS selectors here).
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.bloomberg.com/quote/SPX:IND')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('span[class^="priceText"]').text)
This prints:
2,813.36
You have several options to do that.
getting the value by appropriate xPath.
//span[contains(#class, 'priceText__')]
Writing regex to find the exact element.
price_tag = soup.find_all('span', {'class':
re.compile(r'priceText__.*?')})
I am not sure with the regex pattern as i am bad in it. Edits are welcome.

Extracting div attributes based on text

I am using Python 3.6 with bs4 to implement this task.
my div tag looks like this
<div class="Portfolio" portfolio_no="345">VBHIKE324</div>
<div class="Portfolio" portfolio_no="567">SCHF54TYS</div>
I need to extract portfolio_no i.e 345. As it is a dynamic value it keeps changing for multiple div tags but whereas the text remains same.
for data in soup.find_all('div',class_='Portfolio', text='VBHIKE324'):
print (data)
It outputs as None, where as I'm looking for o/p like 345
Here you go
for data in soup.find_all('div', {'class':'Portfolio'}):
print(data['portfolio_no'])
If you want the portfolio_no for the one with text VBHIKE324 then you can do something like this
for data in soup.find_all('div', {'class':'Portfolio'}):
if data.text == 'VBHIKE324':
print(data['portfolio_no'])

Resources