Extracting the Link from this HTML

Extracting the Link from this HTML - python-3.x

I am trying to extract some information from this site using beautifulsoup. I am familiar with extracting tags by class/attributes, but how can I extract the url from "tr data-url"?
import requests
import re
from bs4 import BeautifulSoup
url = "https://www.amcham.org.sg/events-list/?item%5Bdate_start%5D=07%2F05%2F2019&item%5Bdate_end%5D=09/17/2019#page-1"
webpage_response = requests.get(url)
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find_all("tbody")
for i in table:
rows = i.find_all("tr")
for row in rows:
print(row)
<tr data-url="https://www.amcham.org.sg/event/8914">
<td class="date">July 09, 2019</td>

Try (picking up on your code):
for row in rows:
myurl = [item['data-url'] for item in bs.find_all('row', attrs={'data-url' : True})]
print(myurl)
Source:
https://stackoverflow.com/a/24198276/1447509

Related

Getting Empty content Beautiful Soup Python

Am not very familiar with Beautifulsoup, for the life i cant seem to retrieve the table in this html. I parsed the html page using Beautiful Soup and i come up empty. Any help will be appreciated. Thanks!
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
response = requests.get(url, timeout=10)
bs4 = BeautifulSoup(response.content, 'lxml')
table_body=bs4.find('table')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('th')
cols=[x.text.strip() for x in cols]
print(cols)
So i could generate the header for the table, but could not retrive the data from the table itself. Here is the html:
<table class="dgrid-row-table" role="presentation">
<tr>
<td class="dgrid-cell dgrid-cell-padding dgrid-column-0 field-HOSPITAL_NAME"
role="gridcell"><div>**Phoenix VA Health Care System (AKA Carl T Hayden VA
Medical Center)**</div>
</td>
:
:
<td....................</td>
<td....................</td>
<td....................</td>
<td....................</td>
...and there are several other TDs. Am trying to capture all the values from the table. Here is my attempt so far:
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
browser = webdriver.Chrome(r"C:\Users\lab\chromedriver")
browser.get(url)
time.sleep(15)
html = browser.page_source
soup = Soup(html, "html")
table_body=soup.find('table', {'class': 'dgrid-row-table', 'role': 'presentation'})
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
The column generates nothing when i run it. Thanks.

Using selenium:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data"
browser = webdriver.Chrome('/usr/bin/chromedriver')
browser.get(url)
time.sleep(15)
html = browser.page_source
soup = BeautifulSoup(html, "html")
print(len(soup.find_all("table")))
print(soup.find("table", {"id": "dgrid_0-header"}))
browser.close()
browser.quit()

BeautifulSoup python: Get the text with no tags and get the adjacent links

I am trying to extract the movie titles and links for it from this site
from bs4 import BeautifulSoup
from requests import get
link = "https://tamilrockerrs.ch"
r = get(link).content
#r = open('json.html','rb').read()
b = BeautifulSoup(r,'html5lib')
a = b.findAll('p')[1]
But the problem is there is no tag for the titles. I can't extract the titles and if I could do that how can I bind the links and title together.
Thanks in Advance

You can find title and link by this way.
from bs4 import BeautifulSoup
import requests
url= "http://tamilrockerrs.ch"
response= requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('div', {"class":"title"})
for film in data:
print("Title:", film.find('a').text) # get the title here
print("Link:", film.find('a').get("href")) #get the link here

Beautifullsoup get text in tag

I am trying to get data from yellowpages, but i need only numbered plumbers. But i can't get text numbers in h2 class='n'. I can get a class="business-name" text but i need only numbered plumbers not with advertisement. What is my mistake? Thank you very much.
This html :
<div class="info">
<h2 class="n">1. <a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>
And this is my python code:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})
for link in links:
for content in link.contents:
try:
print(content.find("h2", {"class": "n"}).text)
except:
pass

You need a different class selector to limit to that section
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
print(links)
.organic is a single class selector, from a compound class, for a parent element which restricts to all the numbered plumbers. Observe how the highlighting starts after the ads:
Output:

Extracting Data from HTML Span using Beautiful Soup

I want to extract"1.02 Crores" and "7864" from html code and save them in different column in csv file.
Code:
<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>

Not sure about the actual data but this is just something that I threw together really quick. If you need it to navigate to a website then use import requests. you'' need to add url = 'yourwebpagehere' page = requests.get(url) and change soup to soup = BeautifulSoup(page.text, 'lxml') then remove the html variable since it would be unneeded.
from bs4 import BeautifulSoup
import csv
html = '<div class="featuresvap _graybox clearfix"><h3><span><i class="icon-inr"></i>1.02 Crores</span><small> # <i class="icon-inr"></i><b>7864/sq.ft</b> as per carpet area</small></h3>'
soup = BeautifulSoup(html, 'lxml')
findSpan = soup.find('span')
findB = soup.find('b')
print([findSpan.text, findB.text.replace('/sq.ft', '')])
with open('NAMEYOURFILE.csv', 'w+') as writer:
csv_writer = csv.writer(writer)
csv_writer.writerow(["First Column Name", "Second Column Name"])
csv_writer.writerow([findSpan, findB])

self explained in code
from bs4 import BeautifulSoup
# data for first column
firstCol = []
# data for second column
secondCol = []
for url in listURL:
html = '.....' # downloaded html
soup = BeautifulSoup(html, 'html.parser')
# 'select_one' select using CSS selectors, return only first element
fCol = soup.select_one('.featuresvap h3 span')
# remove: <i class="icon-inr"></i>
span.find("i").extract()
sCol = soup.select_one('.featuresvap h3 b')
firstCol.append(fCol.text)
secondCol.append(sCol.text.replace('/sq.ft', ''))
with open('results.csv', 'w') as fl:
csvContent = ','.join(firstCol) + '\n' + ','.join(secondCol)
fl.write(csvContent)
''' sample results
1.02 Crores | 2.34 Crores
7864 | 2475
'''
print('finish')

Parsing through HTML to extract data from table rows with beautiful soup

I'm using BeautifulSoup to extract stock information from the NASDAQ website. I want to retrieve information specifically from the table rows on the HTML page but I am always getting an error (line 12).
#import html-parser
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.nasdaq.com/symbol/amzn' #AMZN is just an example
response = get(url)
#Create parse tree (BeautifulSoup Object)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all(class_= 'column span-1-of-2')
table = data.find(class_= 'table-row') #This is where the error occurs
print(table)

You can do something like this to get the data from table rows.
import requests
from bs4 import BeautifulSoup
import re
r = requests.get("https://www.nasdaq.com/")
print(r)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find('table',{'id':'indexTable', 'class':'floatL marginB5px'}).script.text
matches = re.findall(r'nasdaqHomeIndexChart.storeIndexInfo(.*);\r\n', data)
table_rows = [re.findall(r'\".*\"', row) for row in matches]
print(table_rows)
table_rows is list of lists containing table data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extracting the Link from this HTML - python-3.x

Try (picking up on your code): for row in rows: myurl = [item['data-url'] for item in bs.find_all('row', attrs={'data-url' : True})] print(myurl) Source: https://stackoverflow.com/a/24198276/1447509

Related

Getting Empty content Beautiful Soup Python

BeautifulSoup python: Get the text with no tags and get the adjacent links

Beautifullsoup get text in tag

Extracting Data from HTML Span using Beautiful Soup

Parsing through HTML to extract data from table rows with beautiful soup

Categories

Resources