Extracting the Link from this HTML - python-3.x

I am trying to extract some information from this site using beautifulsoup. I am familiar with extracting tags by class/attributes, but how can I extract the url from "tr data-url"?
import requests
import re
from bs4 import BeautifulSoup
url = "https://www.amcham.org.sg/events-list/?item%5Bdate_start%5D=07%2F05%2F2019&item%5Bdate_end%5D=09/17/2019#page-1"
webpage_response = requests.get(url)
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find_all("tbody")
for i in table:
rows = i.find_all("tr")
for row in rows:
print(row)
<tr data-url="https://www.amcham.org.sg/event/8914">
<td class="date">July 09, 2019</td>

Try (picking up on your code):
for row in rows:
myurl = [item['data-url'] for item in bs.find_all('row', attrs={'data-url' : True})]
print(myurl)
Source:
https://stackoverflow.com/a/24198276/1447509

Related

Getting Empty content Beautiful Soup Python

Am not very familiar with Beautifulsoup, for the life i cant seem to retrieve the table in this html. I parsed the html page using Beautiful Soup and i come up empty. Any help will be appreciated. Thanks!
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
response = requests.get(url, timeout=10)
bs4 = BeautifulSoup(response.content, 'lxml')
table_body=bs4.find('table')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('th')
cols=[x.text.strip() for x in cols]
print(cols)
So i could generate the header for the table, but could not retrive the data from the table itself. Here is the html:
<table class="dgrid-row-table" role="presentation">
<tr>
<td class="dgrid-cell dgrid-cell-padding dgrid-column-0 field-HOSPITAL_NAME"
role="gridcell"><div>**Phoenix VA Health Care System (AKA Carl T Hayden VA
Medical Center)**</div>
</td>
:
:
<td....................</td>
<td....................</td>
<td....................</td>
<td....................</td>
...and there are several other TDs. Am trying to capture all the values from the table. Here is my attempt so far:
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
browser = webdriver.Chrome(r"C:\Users\lab\chromedriver")
browser.get(url)
time.sleep(15)
html = browser.page_source
soup = Soup(html, "html")
table_body=soup.find('table', {'class': 'dgrid-row-table', 'role': 'presentation'})
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
The column generates nothing when i run it. Thanks.
Using selenium:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data"
browser = webdriver.Chrome('/usr/bin/chromedriver')
browser.get(url)
time.sleep(15)
html = browser.page_source
soup = BeautifulSoup(html, "html")
print(len(soup.find_all("table")))
print(soup.find("table", {"id": "dgrid_0-header"}))
browser.close()
browser.quit()

BeautifulSoup python: Get the text with no tags and get the adjacent links

I am trying to extract the movie titles and links for it from this site
from bs4 import BeautifulSoup
from requests import get
link = "https://tamilrockerrs.ch"
r = get(link).content
#r = open('json.html','rb').read()
b = BeautifulSoup(r,'html5lib')
a = b.findAll('p')[1]
But the problem is there is no tag for the titles. I can't extract the titles and if I could do that how can I bind the links and title together.
Thanks in Advance
You can find title and link by this way.
from bs4 import BeautifulSoup
import requests
url= "http://tamilrockerrs.ch"
response= requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('div', {"class":"title"})
for film in data:
print("Title:", film.find('a').text) # get the title here
print("Link:", film.find('a').get("href")) #get the link here

Beautifullsoup get text in tag

I am trying to get data from yellowpages, but i need only numbered plumbers. But i can't get text numbers in h2 class='n'. I can get a class="business-name" text but i need only numbered plumbers not with advertisement. What is my mistake? Thank you very much.
This html :
<div class="info">
<h2 class="n">1. <a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>
And this is my python code:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})
for link in links:
for content in link.contents:
try:
print(content.find("h2", {"class": "n"}).text)
except:
pass
You need a different class selector to limit to that section
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
print(links)
.organic is a single class selector, from a compound class, for a parent element which restricts to all the numbered plumbers. Observe how the highlighting starts after the ads:
Output:

Extracting Data from HTML Span using Beautiful Soup

I want to extract"1.02 Crores" and "7864" from html code and save them in different column in csv file.
Code:
<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>
Not sure about the actual data but this is just something that I threw together really quick. If you need it to navigate to a website then use import requests. you'' need to add url = 'yourwebpagehere' page = requests.get(url) and change soup to soup = BeautifulSoup(page.text, 'lxml') then remove the html variable since it would be unneeded.
from bs4 import BeautifulSoup
import csv
html = '<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>'
soup = BeautifulSoup(html, 'lxml')
findSpan = soup.find('span')
findB = soup.find('b')
print([findSpan.text, findB.text.replace('/sq.ft', '')])
with open('NAMEYOURFILE.csv', 'w+') as writer:
csv_writer = csv.writer(writer)
csv_writer.writerow(["First Column Name", "Second Column Name"])
csv_writer.writerow([findSpan, findB])
self explained in code
from bs4 import BeautifulSoup
# data for first column
firstCol = []
# data for second column
secondCol = []
for url in listURL:
html = '.....' # downloaded html
soup = BeautifulSoup(html, 'html.parser')
# 'select_one' select using CSS selectors, return only first element
fCol = soup.select_one('.featuresvap h3 span')
# remove: <i class="icon-inr"></i>
span.find("i").extract()
sCol = soup.select_one('.featuresvap h3 b')
firstCol.append(fCol.text)
secondCol.append(sCol.text.replace('/sq.ft', ''))
with open('results.csv', 'w') as fl:
csvContent = ','.join(firstCol) + '\n' + ','.join(secondCol)
fl.write(csvContent)
''' sample results
1.02 Crores | 2.34 Crores
7864 | 2475
'''
print('finish')

Parsing through HTML to extract data from table rows with beautiful soup

I'm using BeautifulSoup to extract stock information from the NASDAQ website. I want to retrieve information specifically from the table rows on the HTML page but I am always getting an error (line 12).
#import html-parser
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.nasdaq.com/symbol/amzn' #AMZN is just an example
response = get(url)
#Create parse tree (BeautifulSoup Object)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all(class_= 'column span-1-of-2')
table = data.find(class_= 'table-row') #This is where the error occurs
print(table)
You can do something like this to get the data from table rows.
import requests
from bs4 import BeautifulSoup
import re
r = requests.get("https://www.nasdaq.com/")
print(r)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find('table',{'id':'indexTable', 'class':'floatL marginB5px'}).script.text
matches = re.findall(r'nasdaqHomeIndexChart.storeIndexInfo(.*);\r\n', data)
table_rows = [re.findall(r'\".*\"', row) for row in matches]
print(table_rows)
table_rows is list of lists containing table data.

Resources