Am not very familiar with Beautifulsoup, for the life i cant seem to retrieve the table in this html. I parsed the html page using Beautiful Soup and i come up empty. Any help will be appreciated. Thanks!
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
response = requests.get(url, timeout=10)
bs4 = BeautifulSoup(response.content, 'lxml')
table_body=bs4.find('table')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('th')
cols=[x.text.strip() for x in cols]
print(cols)
So i could generate the header for the table, but could not retrive the data from the table itself. Here is the html:
<table class="dgrid-row-table" role="presentation">
<tr>
<td class="dgrid-cell dgrid-cell-padding dgrid-column-0 field-HOSPITAL_NAME"
role="gridcell"><div>**Phoenix VA Health Care System (AKA Carl T Hayden VA
Medical Center)**</div>
</td>
:
:
<td....................</td>
<td....................</td>
<td....................</td>
<td....................</td>
...and there are several other TDs. Am trying to capture all the values from the table. Here is my attempt so far:
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
browser = webdriver.Chrome(r"C:\Users\lab\chromedriver")
browser.get(url)
time.sleep(15)
html = browser.page_source
soup = Soup(html, "html")
table_body=soup.find('table', {'class': 'dgrid-row-table', 'role': 'presentation'})
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
The column generates nothing when i run it. Thanks.
Using selenium:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data"
browser = webdriver.Chrome('/usr/bin/chromedriver')
browser.get(url)
time.sleep(15)
html = browser.page_source
soup = BeautifulSoup(html, "html")
print(len(soup.find_all("table")))
print(soup.find("table", {"id": "dgrid_0-header"}))
browser.close()
browser.quit()
Related
I'm trying to extract the content marked by <div class="sense"> in abc. With ''.join(map(str, soup.select_one('.sense').contents)), I only get the content between markers, i.e xyz. To fulfill my job, I also need the full <div class="sense">xyz</div>
from bs4 import BeautifulSoup
abc = """abcdd<div class="sense">xyz</div>"""
soup = BeautifulSoup(abc, 'html.parser')
content1 = ''.join(map(str, soup.select_one('.sense').contents))
print(content1)
and the result is xyz. Could you please elaborate on how to achieve my goal?
Try:
from bs4 import BeautifulSoup
abc = """abcdd<div class="sense">xyz</div>"""
soup = BeautifulSoup(abc, 'html.parser')
div = soup.find('div', attrs={'class': 'sense'})
print(div)
prints:
<div class="sense">xyz</div>
I am trying to extract some information from this site using beautifulsoup. I am familiar with extracting tags by class/attributes, but how can I extract the url from "tr data-url"?
import requests
import re
from bs4 import BeautifulSoup
url = "https://www.amcham.org.sg/events-list/?item%5Bdate_start%5D=07%2F05%2F2019&item%5Bdate_end%5D=09/17/2019#page-1"
webpage_response = requests.get(url)
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find_all("tbody")
for i in table:
rows = i.find_all("tr")
for row in rows:
print(row)
<tr data-url="https://www.amcham.org.sg/event/8914">
<td class="date">July 09, 2019</td>
Try (picking up on your code):
for row in rows:
myurl = [item['data-url'] for item in bs.find_all('row', attrs={'data-url' : True})]
print(myurl)
Source:
https://stackoverflow.com/a/24198276/1447509
I am trying to get data from yellowpages, but i need only numbered plumbers. But i can't get text numbers in h2 class='n'. I can get a class="business-name" text but i need only numbered plumbers not with advertisement. What is my mistake? Thank you very much.
This html :
<div class="info">
<h2 class="n">1. <a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>
And this is my python code:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})
for link in links:
for content in link.contents:
try:
print(content.find("h2", {"class": "n"}).text)
except:
pass
You need a different class selector to limit to that section
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
print(links)
.organic is a single class selector, from a compound class, for a parent element which restricts to all the numbered plumbers. Observe how the highlighting starts after the ads:
Output:
I want to extract"1.02 Crores" and "7864" from html code and save them in different column in csv file.
Code:
<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>
Not sure about the actual data but this is just something that I threw together really quick. If you need it to navigate to a website then use import requests. you'' need to add url = 'yourwebpagehere' page = requests.get(url) and change soup to soup = BeautifulSoup(page.text, 'lxml') then remove the html variable since it would be unneeded.
from bs4 import BeautifulSoup
import csv
html = '<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>'
soup = BeautifulSoup(html, 'lxml')
findSpan = soup.find('span')
findB = soup.find('b')
print([findSpan.text, findB.text.replace('/sq.ft', '')])
with open('NAMEYOURFILE.csv', 'w+') as writer:
csv_writer = csv.writer(writer)
csv_writer.writerow(["First Column Name", "Second Column Name"])
csv_writer.writerow([findSpan, findB])
self explained in code
from bs4 import BeautifulSoup
# data for first column
firstCol = []
# data for second column
secondCol = []
for url in listURL:
html = '.....' # downloaded html
soup = BeautifulSoup(html, 'html.parser')
# 'select_one' select using CSS selectors, return only first element
fCol = soup.select_one('.featuresvap h3 span')
# remove: <i class="icon-inr"></i>
span.find("i").extract()
sCol = soup.select_one('.featuresvap h3 b')
firstCol.append(fCol.text)
secondCol.append(sCol.text.replace('/sq.ft', ''))
with open('results.csv', 'w') as fl:
csvContent = ','.join(firstCol) + '\n' + ','.join(secondCol)
fl.write(csvContent)
''' sample results
1.02 Crores | 2.34 Crores
7864 | 2475
'''
print('finish')
Scraping a column from Wikipedia with Beautifulsoup returns the last row, while I want all of them in a list:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col) > 0:
com = str(col[1].string.strip("\n"))
list(com)
com
Out: 'ZTS'
So it only shows the last row of the string, I was expecting to get a list with each line of the string as a string value. So that I can assign the list to new variable.
"Rice", "Buffalo milk", "Cow milk", "Wheat"
Can anyone help me?
Your method will not work because you are not "adding" anything to com.
One way to do what you desire is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
com=[]
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col)> 0:
temp=col[1].contents[0]
try:
to_append=temp.contents[0]
except Exception as e:
to_append=temp
com.append(to_append)
print(com)
This will give you what you require.
Explanation
col[1].contents[0] gives the first child of the tag. .contents gives you a list of children of the tag. Here we have a single child so 0.
In some cases, the content inside the <tr> tag is a <a href> tag. So I apply another .contents[0] to get the text.
In other cases it is not a link. For that I used an exception statement. If there is no descendant of the child extracted, we would get an exception.
See the official documentation for details