Beautifullsoup get text in tag - python-3.x

I am trying to get data from yellowpages, but i need only numbered plumbers. But i can't get text numbers in h2 class='n'. I can get a class="business-name" text but i need only numbered plumbers not with advertisement. What is my mistake? Thank you very much.
This html :
<div class="info">
<h2 class="n">1. <a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>
And this is my python code:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})
for link in links:
for content in link.contents:
try:
print(content.find("h2", {"class": "n"}).text)
except:
pass

You need a different class selector to limit to that section
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
print(links)
.organic is a single class selector, from a compound class, for a parent element which restricts to all the numbered plumbers. Observe how the highlighting starts after the ads:
Output:

Related

How to extract both content and markup in a class?

I'm trying to extract the content marked by <div class="sense"> in abc. With ''.join(map(str, soup.select_one('.sense').contents)), I only get the content between markers, i.e xyz. To fulfill my job, I also need the full <div class="sense">xyz</div>
from bs4 import BeautifulSoup
abc = """abcdd<div class="sense">xyz</div>"""
soup = BeautifulSoup(abc, 'html.parser')
content1 = ''.join(map(str, soup.select_one('.sense').contents))
print(content1)
and the result is xyz. Could you please elaborate on how to achieve my goal?
Try:
from bs4 import BeautifulSoup
abc = """abcdd<div class="sense">xyz</div>"""
soup = BeautifulSoup(abc, 'html.parser')
div = soup.find('div', attrs={'class': 'sense'})
print(div)
prints:
<div class="sense">xyz</div>

How to only retrieve the tag I specify using BeautifulSoup

I just want the written text out of this website: https://algorithms-tour.stitchfix.com/ so I can put it in Word doc and read it.
When I run the code, I get all the html and the tags, at the very end I get what I want, but I just want to separate the text.
import requests
from bs4 import BeautifulSoup
url = "https://algorithms-tour.stitchfix.com"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
item = soup.find_all("p")
print(item)
Is there a way to get just content so I can clean it up some more?
You have a few options for this. If you only want text found within p tags, you can do this:
import requests
from bs4 import BeautifulSoup
url = "https://algorithms-tour.stitchfix.com"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
items = soup.find_all("p")
result = []
for item in items:
result.append(item.string)
print(result)
Note that soup.find_all returns an iterable list, and not a single object.
An alternative, and easier method is to just use soup.get_text:
import requests
from bs4 import BeautifulSoup
url = "https://algorithms-tour.stitchfix.com"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())

BeautifulSoup python: Get the text with no tags and get the adjacent links

I am trying to extract the movie titles and links for it from this site
from bs4 import BeautifulSoup
from requests import get
link = "https://tamilrockerrs.ch"
r = get(link).content
#r = open('json.html','rb').read()
b = BeautifulSoup(r,'html5lib')
a = b.findAll('p')[1]
But the problem is there is no tag for the titles. I can't extract the titles and if I could do that how can I bind the links and title together.
Thanks in Advance
You can find title and link by this way.
from bs4 import BeautifulSoup
import requests
url= "http://tamilrockerrs.ch"
response= requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('div', {"class":"title"})
for film in data:
print("Title:", film.find('a').text) # get the title here
print("Link:", film.find('a').get("href")) #get the link here

How can I get the href of anchor tag using Beautiful Soup?

I am trying to get the href of anchor tag of the very first video search on YouTube using Beautiful Soup. I am searching it by using the "a" and class_="yt-simple-endpoint style-scope ytd-video-renderer".
But I am getting None output:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
# print(soup2.prettify())
a =soup.findAll("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
a_fin = soup.find("a", class_="compact-media-item-image")
#
print(a)
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
first_serach_result_link = soup.findAll('a',attrs={'class':'yt-uix-tile-link'})[0]['href']
heavily inspired by
this answer
Another option is to render the page first with Selenium.
import bs4
from selenium import webdriver
url = 'https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing'
browser = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
browser.get(url)
source = browser.page_source
soup = bs4.BeautifulSoup(source,'html.parser')
hrefs = soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
for a in hrefs:
print (a['href'])
Output:
/watch?v=Jor09n2IF44
/watch?v=ym14AyqJDTg
/watch?v=g-2V1XJL0kg
/watch?v=eeVYaDLC5ik
/watch?v=StI92Bic3UI
/watch?v=2W_4LIAhbdQ
/watch?v=PH1WZPT5IKw
/watch?v=Au2EH3GsM7k
/watch?v=q-j1HEnDn7w
/watch?v=Usjg7IuUhvU
/watch?v=YizmwHibomQ
/watch?v=i2q6Fm0E3VE
/watch?v=OXNAMyEvcH4
/watch?v=vdcBtAeZsCk
/watch?v=E4v2StDdYqs
/watch?v=x7kCuRB0f7E
/watch?v=KERtHNoZrF0
/watch?v=TenbA4wWIJA
/watch?v=Ey9HfjUyUvY
/watch?v=hqsuOT0URJU
It dynamic html you can use Selenium or to get static html use GoogleBot user-agent
headers = {'User-Agent' : 'Googlebot/2.1 (+http://www.google.com/bot.html)'}
source = requests.get("https://.......", headers=headers).text
soup = BeautifulSoup(source, 'lxml')
links = soup.findAll("a", class_="yt-uix-tile-link")
for link in links:
print(link['href'])
Try looping over the matches:
import urllib2
data = urllib2.urlopen("some_url")
html_data = data.read()
soup = BeautifulSoup(html_data)
for a in soup.findAll('a',href=True):
print a['href']
The class which you're searching does not exist in the scraped html. You can identify it by printing the soup variable.
For example:
a =soup.findAll("a", class_="sign-in-link")
gives output as:
[<a class="sign-in-link" href="https://accounts.google.com/ServiceLogin?passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26feature%3Dplaylist%26hl%3Den%26next%3D%252Fresults%253Fsearch_query%253DMP%252Belection%252Bresults%252B2018%25253A%252BBJP%252Bminister%252Bblames%252Bconspiracy%252Bas%252Breason%252Bwhile%252Blosing&uilel=3&hl=en&service=youtube">Sign in</a>]

Extracting Data from HTML Span using Beautiful Soup

I want to extract"1.02 Crores" and "7864" from html code and save them in different column in csv file.
Code:
<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>
Not sure about the actual data but this is just something that I threw together really quick. If you need it to navigate to a website then use import requests. you'' need to add url = 'yourwebpagehere' page = requests.get(url) and change soup to soup = BeautifulSoup(page.text, 'lxml') then remove the html variable since it would be unneeded.
from bs4 import BeautifulSoup
import csv
html = '<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>'
soup = BeautifulSoup(html, 'lxml')
findSpan = soup.find('span')
findB = soup.find('b')
print([findSpan.text, findB.text.replace('/sq.ft', '')])
with open('NAMEYOURFILE.csv', 'w+') as writer:
csv_writer = csv.writer(writer)
csv_writer.writerow(["First Column Name", "Second Column Name"])
csv_writer.writerow([findSpan, findB])
self explained in code
from bs4 import BeautifulSoup
# data for first column
firstCol = []
# data for second column
secondCol = []
for url in listURL:
html = '.....' # downloaded html
soup = BeautifulSoup(html, 'html.parser')
# 'select_one' select using CSS selectors, return only first element
fCol = soup.select_one('.featuresvap h3 span')
# remove: <i class="icon-inr"></i>
span.find("i").extract()
sCol = soup.select_one('.featuresvap h3 b')
firstCol.append(fCol.text)
secondCol.append(sCol.text.replace('/sq.ft', ''))
with open('results.csv', 'w') as fl:
csvContent = ','.join(firstCol) + '\n' + ','.join(secondCol)
fl.write(csvContent)
''' sample results
1.02 Crores | 2.34 Crores
7864 | 2475
'''
print('finish')

Resources