Scraping TripAdvisor search query results - rvest

I'm trying to scrape the number of times a particular search term (in this case "sunset") is referenced in TripAdvisor reviews at different sights/locations, but I'm getting a http 403 error.
Is there a fix, of is this TripAdvisor not wanting me to scrape this page?
install.packages("rvest")
library(rvest)
install.packages("xml2")
library(xml2)
place <- xml2::read_html("https://www.tripadvisor.com/Search?q=sunset&geo=186216") %>%
html_nodes(".result-title") %>%
html_text()
place
sunsets <- xml2::read_html("https://www.tripadvisor.com/Search?q=sunset&geo=186216") %>%
html_nodes(".review-mention-block") %>%
html_text()
sunsets
Thanks!

Related

Mutate: character string is not in a standard unambiguous format

I have a column titled started_at which is formatted in like this: 4/12/2021 18:25. When I try to run code, I get the following message:
Error in mutate():
! Problem while computing day_of_week = wday(start_time, label = TRUE).
Caused by error in as.POSIXlt.character():
! character string is not in a standard unambiguous format
This is the code I am trying to run:
**> cyclistic_trips_merge_v2 %>%
mutate(day_of_week = wday(start_time, label = TRUE)) %>% #creates weekday field using wday()
group_by(usertype, day_of_week) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(usertype, day_of_week)**
I am new to R, and this is a capstone project. I get stuck but then figure my way around things doing web searches but right now I am kind of stumped. The date/time mentioned above is classified as a string. I believe that is the problem, but what do I need to convert it to and how can I do that? Can anyone please help? Losing my mind.

How to parse the only the second span tag in an HTML document using python bs4

I want to parse only one span tag in my html document. There are three sibling span tags without any class or I'd. I am targeting the second one only using BeautifulSoup 4.
Given the following html document:
<div class="adress">
<span>35456 street</span>
<span>city, state</span>
<span>zipcode</span>
</div>
I tried:
for spn in soup.findAll('span'):
data = spn[1].text
but it didn't work. The expected result is the text in the second span stored in a a variable:
data = "city, state"
and how to to get both the first and second span concatenated in one variable.
You are trying to slice an individual span (a Tag instance). Get rid of the for loop and slice the findAll response instead, i.e.
>>> soup.findAll('span')[1]
<span>city, state</span>
You can get the first and second tags together using:
>>> soup.findAll('span')[:2]
[<span>35456 street</span>, <span>city, state</span>]
or, as a string:
>>> "".join([str(tag) for tag in soup.findAll('span')[:2]])
'<span>35456 street</span><span>city, state</span>'
Another option:
data = soup.select_one('div > span:nth-of-type(2)').get_text(strip=True)
print(data)
Output:
city, state

How to remove< > from the result

I am trying to get the list of college names from an online dataset table (search result), and the college names are in between the tag and , i am not sure how to remove those from the result.
geo_table = soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('strong')
Colleges
I am thinking that the problem is I am extracting the wrong part because refers to bold the line. Where shall I find the college name?
This is a sample output:
href="?s=IL+MA+PA&p=14.0802+14.0801+14.3901&l=91+92+93+94&id=211440"
To fetch the href value you need to find_all <a> tag and then iterate the loop and get the attribute value href to fetch the college name you can find <strong> tag and get the text value.
geo_table =soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('a')
for college in Colleges:
print('href :' + college['href'])
print('college Name : ' + college.find('strong').text )

Querying <div class="name"> in Python

I am trying to follow the guide posted here: https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe
I am at this point, where I am supposed to get the name of presumably the stock.
Take out the div of name and get its value
name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})
I suspect I will also have trouble when querying the price. Do I have to replace 'price' with 'priceText__1853e8a5' as found in the html?
get the index price
price_box = soup.find(‘div’, attrs={‘class’:’price’})
Thanks, this would be a massive help.
If you replace price with priceText__1853e8a5 you will get your result, but I suspect that the class name changes dynamically/is dynamically generated (note the number at the end). So to get your result you need something more robust.
You can target tags in BeautifulSoups with CSS selectors (with select()/select_one() methods. This example will target all <span> tags with class attribute that begins with priceText (^= operator - more info about CSS selectors here).
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.bloomberg.com/quote/SPX:IND')
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('span[class^="priceText"]').text)
This prints:
2,813.36
You have several options to do that.
getting the value by appropriate xPath.
//span[contains(#class, 'priceText__')]
Writing regex to find the exact element.
price_tag = soup.find_all('span', {'class':
re.compile(r'priceText__.*?')})
I am not sure with the regex pattern as i am bad in it. Edits are welcome.

Extract text between dynamic HTML tags using python soup

I have a requirement where i need to extract the text between HTML tags. I used BeautifulSoup to extract data and store the text into a variable for further processing. Later i found that, the text which i need to extract is coming in two different tags. However please note that i need to extract the text and store into same variable. My earlier code and sample HTML text information is provided. Please help me how to get my end results i.e expected output.
Sample HTML text:
<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 80 DOCUMENTS</SPAN></P>
<DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Financial Times (London, England)</SPAN></P>
<DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The Financial Times Ltd.<BR>All Rights Reserved<BR>Please do not cut and paste FT articles and redistribute by email or post to the web.</SPAN></P>
<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">80 of 80 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Financial Times (London,England)</SPAN></P>
</DIV>
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Copyright 1990 The Financial Times Limited</SPAN></P>
</DIV>
From the above HTML text, i need to store documents(1 of 80 documents, 80 of 80 documents) into a single variable. similarly for other text it follows similar approach. I wrote code for div.c0
soup = BeautifulSoup(response, 'html.parser')
docpublicationcpyright = soup.select('div.c0')
list1 = [b.text.strip() for b in docpublicationcpyright]
doccountvalues = list1[0:len(list1):3]
publicationvalues = list1[1:len(list1):3]
copyrightvalues = list1[2:len(list1):3]
documentcount = doccountvalues
publicationpaper = publicationvalues
Any help would be greatly appreciated.
Given sample HTML is not properly structured. For example: closing tag is missing for first DIV element. Anyways for this type of HTML also using regular expressions you can scrape required data.
I wrote a sample code considering only sample HTML posted in the question & able to extract the all three required fields
soup = BeautifulSoup(response, 'html.parser')
documentElements = soup.find_all('span', text=re.compile(r'of [0-9]+ DOCUMENTS'))
documentCountList = []
publicationPaperList = []
documentPublicationCopyrightList = []
for elem in documentElements:
documentCountList.append(elem.get_text().strip())
if elem.parent.find_next_sibling('div'):
publicationPaperList.append(elem.parent.find_next_sibling('div').find('span').get_text().strip())
documentPublicationCopyrightList.append(elem.parent.find_next_sibling('div').find_all('span')[1].get_text())
else:
publicationPaperList.append(elem.parent.parent.find_next('div').get_text().strip())
documentPublicationCopyrightList.append(elem.parent.parent.find_next('div').find_next('div').get_text().strip())
print(documentCountList)
print(publicationPaperList)
print(documentPublicationCopyrightList)
output looks like below
[u'1 of 80 DOCUMENTS', u'80 of 80 DOCUMENTS']
[u'Financial Times (London, England)', u'Financial Times (London,England)']
[u'Copyright 2015 The Financial Times Ltd.All Rights ReservedPlease do not cut and paste FT articles and redistribute by email or post to the web.', u'Copyright 1990 The Financial Times Limited']

Resources