RVest trying to scrape date time field - rvest

I am new here as well as with web scraping. I am trying to figure out how to get the date/time from a piece of html code. I would think this is quite simple with Rvest, but it is taking me a while already. My best guess was:
test <- page %>% html_nodes("span") %>% html_attr("time")
But it returns all "NA".
Thank you for your advice!
<div class="v-popover"><span aria-describedby="popover_1chdwnsl8d" class="trigger" style="display: inline-block;"><time datetime="2019-03-30T04:55:56.000Z" title="Saturday, March 30, 2019, 05:55:56 AM" class="review-date--tooltip-target">Mar 30, 2019</time> <div class="tooltip-container-2"></div> <!----></span> </div>
</div>

library(rvest)
pg <- read_html(<path>)
datetime <- pg %>% html_node(xpath = "//time") %>% html_attr("datetime")
date <- pg %>% html_node(xpath = "//time") %>% html_text()
datetime_long <- pg %>% html_node(xpath = "//time") %>% html_attr("title")
This will return the result in the text. You might want to convert to date or datetime type.

Related

When Scraping got html with "encoded" part, is it possible to get it

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

How to scrape a string from the div tag using Selenium and Python?

I have source code like the code below. I'm trying to scrape out the '11 tigers' string. I'm new to xpath, can anyone suggest how to get it using selenium or beatiful soup? I'm thinking driver.find_element_by_xpath or soup.find_all.
source:
<div class="count-box fixed_when_handheld s-vgLeft0_5 s-vgPullBottom1 s-vgRight0_5 u-colorGray6 u-fontSize18 u-fontWeight200" style="display: block;">
<div class="label-container u-floatLeft">11 tigers</div>
<div class="u-floatRight">
<div class="hide_when_tablet hide_when_desktop s-vgLeft0_5 s-vgRight0_5 u-textAlignCenter">
<div class="js-show-handheld-filters c-button c-button--md c-button--blue s-vgRight1">
Filter
</div>
<div class="js-save-handheld-filters c-button c-button--md c-button--transparent">
Save
</div>
</div>
</div>
<div class="cb"></div>
</div>
You can use same .count-box .label-container css selector for both BS and Selenium.
BS:
page = BeautifulSoup(yourhtml, "html.parser")
# if you need first one
label = page.select_one(".count-box .label-container").text
# if you need all
labels = page.select(".count-box .label-container")
for label in labels:
print(label.text)
Selenium:
labels = driver.find_elements_by_css_selector(".count-box .label-container")
for label in labels:
print(label.text)
Variant of the answer given by Sers.
page = BeautifulSoup(html_text, "lxml")
# first one
label = page.find('div',{'class':'count-box label-container')).text
# for all
labels = page.find('div',{'class':'count-box label-container'))
for label in labels:
print(label.text)
Use lxml parser as it's faster. You need to install it explicitly via pip install lxml
To extract the text 11 tigers you can use either of the following solution:
Using css_selector:
my_text = driver.find_element_by_css_selector("div.count-box>div.label-container.u-floatLeft").get_attribute("innerHTML")
Using xpath:
my_text = driver.find_element_by_xpath("//div[contains(#class, 'count-box')]/div[#class='label-container u-floatLeft']").get_attribute("innerHTML")

Python Selenium Firefox <input type="date" /> select with .click()

I know there are lots of these out there, but none has proved helpful yet. I cannot find the popup calendar no matter how hard I try. Any suggestions?
html:
<label for="id_start_date">Start Date</label>
<input class="form-control" id="id_start_date" type="date" value="" />
test file:
start_date = self.browser.find_element_by_id('id_start_date')
start_date_label = self.browser.find_element_by_xpath(
"//label[#for='id_start_date']"
)
self.assertEqual('Start Date', start_date_label.text)
# Today's Date is preloaded into the field. Jeff clicks on it
# and it pops open a little calender to choose a date.
start_date.click()
"""
What is supposed to go here so I can find and select my desired date?
"""
It does show up on my page when I manually check it. I am at a complete loss.
You don't have to click on the calendar. you can set the value because it is of the type "input".
Pay attention to the format of date, assuming that it is in the format (MM/dd/yyyy) the code below works.
start_date = self.browser.find_element_by_id('id_start_date')
start_date.clear() # clear any value that was in the field before (if you don't clear, will append the new string sent.)
start_date.send_keys("08/17/2018")

How to replace selected string from content editable div?

I'm trying to replace chapter titles from the contenteditable="true" div tag by using python and selenium-webdriver, at first I am searching for the chapter title, which is usually at first line... then I'm replacing it with empty value and saving.. but it's not saving after refreshing browser. But I see that code is working. Here is my code
##getting content editable div tag
input_field = driver.find_element_by_css_selector('.trumbowyg-editor')
### getting innerHTML of content editable div
chapter_html = input_field.get_attribute('innerHTML')
chapter_content = input_field.get_attribute('innerHTML')
if re.search('<\w*>', chapter_html):
chapter_content = re.split('<\w*>|</\w*>', chapter_html)
first_chapter = chapter_content[1]
### replacing first_chapter with ''
chapter_replace = chapter_html.replace(first_chapter, '')
### writing back innerHTML without first_chapter string
driver.execute_script("arguments[0].innerHTML = arguments[1];",input_field, chapter_replace)
time.sleep(1)
## click on save button
driver.find_element_by_css_selector('.btn.save-button').click()
How I can handle this ? It is working when I'm doing manually(I mean it probably can't be site problem/bug)... Please help ...
Relevant HTML is following:
<div class="trumbowyg-editor" dir="ltr" contenteditable="true">
<p>Chapter 1</p>
<p> There is some text</p>
<p> There is some text</p>
<p> There is some text</p>
</div>
As per the HTML you have shared to replace the chapter title with empty value you have to induce WebDriverWait with expected_conditions clause set to visibility_of_element_located and can use the following block of code :
page_number = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='trumbowyg-editor' and #contenteditable='true']/p[contains(.,'Chapter')]")))
driver.execute_script("arguments[0].removeAttribute('innerHTML')", page_number)
#or
driver.execute_script("arguments[0].removeAttribute('innerText')", page_number)
#or
driver.execute_script("arguments[0].removeAttribute('textContent')", page_number)

Resources