Unable to get the expected html element details using Python - python-3.x

I am trying to scrape a website using Python. I have been able to scrape it successfully, however the expected resulted is not fetching up. I think there is something to do with the JavaScript of the web page.
My Code below:
driver.get(
"https://my website")
soup=BeautifulSoup(driver.page_source,'lxml')
all_text = soup.text
ct = all_text.replace('\n', ' ')
cl_text = ct.replace('\t', ' ')
cln_text_t = cl_text.replace('\r', ' ')
cln_text = re.sub(' +', ' ', cln_text_t)
print(cln_text)
Instead of giving me the website details it is giving the below data. Any idea how could I fix this?
html, body {height:100%;margin:0;} You have to enable javascript in your browser to use an application built with Vaadin.........

Why do you need this BeautifulSoup at all? It doesn't seem to support JavaScript.
If you need to get web page text you can fetch the document root using simple XPath selector of //html and get innerText property of the resulting WebElement
Suggested code change:
driver.get(
"my website")
root = driver.find_element_by_xpath("//html")
all_text = root.get_attribute("innerText")

Related

web scraping help needed

I was wondering if someone could help me put together some code for
https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L
I currently use this code to scrape the current price
currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
This works fine but I occasionally get an error not really sure why as the links are all correct. but I would like to try to get the price again
so something like
try:
currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
except Exception:
currentPriceData = soup.find('span', {'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})[0].text
The problem is that I can't get it to scrape the number using this method any help would be greatly appreciated.
The data is embedded within the page as Javascript variable. But you can use json module to parse it.
For example:
import re
import json
import requests
url = 'https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L'
html_data = requests.get(url).text
#the next line extracts from the HTML source javascript variable
#that holds all data that is rendered on page.
#BeautifulSoup cannot run Javascript, so we are going to use
#`json` module to extract the data.
#NOTE: When you view source in Firefox/Chrome, you can search for
# `root.App.main` to see it.
data = json.loads(re.search(r'root\.App\.main = ({.*?});\n', html_data).group(1))
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# We now have the Javascript variable extracted to standard python
# dict, so now we just print contents of some keys:
price = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['regularMarketPrice']['fmt']
currency_symbol = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['currencySymbol']
print('{} {}'.format(price, currency_symbol))
Prints:
227.30 £

How to fix "businessObject not defined"

I am a newbie to Python and web scraping. To practice, I am just trying to pull some business names from some HTML tags a website. However, the code is not running and is throwing an 'object is not defined' error.
from bs4 import BeautifulSoup
import requests
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
content = BeautifulSoup(response.content, "html.parser")
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
businessObject = {
"BusinessName": business.find('h4', attrs={"class": "groomer-salon-card__name"}).text.encode('utf-8')
}
print (businessObject)
Expected: I am trying to retrieve the business names from this web page.
Result:
NameError: name 'businessObject' is not defined
When you did
content.find_all('div', attrs={"class": "groomer-salon-card__details"})
you actually got an empty list as no match.
So, when you did
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
you didn't generate
businessObject
As mentioned in comments, that led to your error.
Content is dynamically loaded from elswhere in the DOM using javascript (as well as other DOM modifications). You can still regex out the javascript object which contains the content used to update the DOM as you saw it in browser. You then parse with json parser as follows:
import requests, re, json
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
p = re.compile(r'state: (.*?)\n', re.DOTALL)
data = json.loads(p.findall(response.text)[0])
for listing in data['content']['search_results']['pages']['data']:
print(listing['organization_name'])
If you view page source on webpage you will see that the DOM is essentially dynamically populated from top to bottom with mutation observers monitoring progress.

Can I get url, that is generated by JavaScript, using Selenium and Python 3?

I write parser using Selenium and Python 3.7 for next site - https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/
I'm interested, to get the url, that is generated by JavaScript, using Selenium in Python 3?
I need to get the url for events from the sites from which the data is taken in the table.
For example. It seems to me that the data in the first line (10Bet) is obtained from this page - https://www.10bet.com/sports/football/germany-1-bundesliga/20190218/nurnberg-vs-dortmund/
How can get url to this page?
To get the URL of all the links on a page, you can store all the elements with tagname a in a WebElement list and then you can fetch the href attribute to get the link of each WebElement.
you can refer to the following code for all the link present in hole page :
List<WebElement> links = driver.findElements(By.tagName("a")); //This will store all the link WebElements into a list
for(WebElement ele: links) // This way you can take the Url of each link
{
String url = ele.getAttribute("href"); //To get the link you can use getAttribute() method with "href" as an argument
System.out.println(url);
}
In case you need the particilar link explicitly you need to pass the xpath of the element
WebElement ele = driver.findElements(By.xpath("Pass the xapth of the element"));
and after that you need to do this
String url = ele.getAttribute("href") //to get the url of the particular element
I have also shared the link with you so you can go and check i have highlighted elements in that
let us know if that helped or not
Try the below code, which will prints the required URL's as per your requirement :
from selenium import webdriver
driver = webdriver.Chrome('C:\\NotBackedUp\\chromedriver.exe')
driver.maximize_window()
driver.get('https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/')
# Locators for locating the required URL's
xpath = "//div[#id='odds-data-table']//tbody//tr"
rows = "//div[#id='odds-data-table']//tbody//tr/td[1]"
print("=> Required URL's is/are : ")
# Fetching & Printing the required URL's
for i in range(1, len(driver.find_elements_by_xpath(rows)), 1):
anchor = driver.find_elements_by_xpath(xpath+"["+str(i)+"]/td[2]/a")
if len(anchor) > 0:
print(anchor[0].get_attribute('href'))
else:
a = driver.find_elements_by_xpath("//div[#id='odds-data-table']//tbody//tr["+(str(i+1))+"]/td[1]//a[2]")
if len(a) > 0:
print(a[0].get_attribute('href'))
print('Done...')
I hope it helps...

Finding tweet id from parsed html page

I am trying to get tweet id from the parsed HTML. Here is my code:
tweet_ids = []
stat = statnum_parser(page_soup)
name = stat["Full_Name"]
print(page_soup.select("div.tweet"))
for tweet in page_soup.select("div.tweet"): # doesn't work properly
if tweet['data-name'] == name:
tweet_ids.append(tweet['data-tweet-id'])
The if condition checks if the tweet is not retweeted. The for loop does not work properly. Can someone help me?
I am using Selenium, BeautifulSoup
I figured out the problem. The problem was not using properly selenium with BeautifulSoup. Here is the code to get properly the HTML content of static website:
import selenium as webdriver
path_to_chrome_driver="path_to_your_chrome_driver"
driver = webdriver.Chrome(executable_path=path_to_chrome_driver)
driver.base_url = "URL of the website"
driver.get(driver.base_url)

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Resources