rvest reading a page that is redirected to another URL - rvest

I would like to get a table in a div from a website with the id Analysis. The website I'm trying to access is https://www.finanzen.net/aktien/credit_suisse-aktie. When I manually load the site, it redirects to https://www.finanzen.ch/aktien/credit_suisse-aktie?countryredirect=https%3A%2F%2Fwww.finanzen.net%2Faktien%2Fcredit_suisse-aktie. If I try reading the first URL with read_html the table I'm looking for is not available and the code results in an error
require(rvest)
url1 <- "https://www.finanzen.net/aktien/credit_suisse-aktie"
test <- read_html(url1)
test %>% html_element("#Analysis") %>% html_table()
If I try it with the second URL it works
url2 <- "https://www.finanzen.ch/aktien/credit_suisse-aktie?countryredirect=https%3A%2F%2Fwww.finanzen.net%2Faktien%2Fcredit_suisse-aktie"
test <- read_html(url2)
test %>% html_element("#Analysis") %>% html_table()
So my question is, is there a way to tell read_html to follow the redirection of the first URL and only read that page?

Related

Scraping from website list returns a null result based of Xpath

So I'm trying to scrape the job listing off this site https://www.dsdambuster.com/careers .
I have the following code:
url = "https://www.dsdambuster.com/careers"
page = requests.get(url, verify=False)
tree = html.fromstring(page.content)
path = '/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
jobs = tree.xpath(xpath)
for job in jobs:
Title = (job.text)
print(Title)
not too sure why it wouldnt work...
I see 2 issues here:
You are using very bad XPath. It is extremely fragile and not reliable.
Instead of
'/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
Please use
'//div[#class="vf-vacancy-title"]'
You are possibly missing a wait / delay.
I'm not familiar with the way you are using here, but with Selenium that I do familiar with, you will need to wait for the elements completely loaded before extracting their text contents.

Automatically generate Google search query

I am trying parse HTML data of certain patents to gather information using Python 3.7 and bs4.
My Problem simplified:
Given this URL
https://patents.google.com/patent/X/en?oq=Y
Where:
X = automatically generated string by Google
Y = My user input (a patent number)
and usually: X == Y (some patent number)
I need to get the value of X.
A more detailed description of my problem:
For 90% of my queries, there is no problem, as I can simply parse using the following code:
patent_number = "EP1000000B1"
paten_url = ("https://patents.google.com/patent/" + patent_number + "/en?oq=" + patent_number)
r = requests.get(patent_url)
response = r.content
soup = BeautifulSoup(response, "html.parser")
However, sometimes the query structure varies, for instance:
I try to search for patent number WO198700753A1 using the code above, but I get an 404 Error, because the URL
https://patents.google.com/patent/WO198700753A1/en?oq=WO198700753A1
does not exist.
This part does not seem to be relevant
en?oq=" + patent_number
, but the first part is.
Searching Google Patents by hand reveals that Google automatically redirects my query from WO198700753A1 to WO1987000753A1 (another 0 added).
Is there any way to automatically generate my url (the part in the middle), so my program will always find results?
Thanks for your help ;)
At the moment, the structure of the link has changed and the oq parameter is not needed. The patent number cannot be randomly generated because each patent has its own unique number. If you generated a patent which not exist, then this page does not exist too.

How to open "partial" links using Python?

I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).
For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.
I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:
links = [...]
keyword = "china"
for link in links:
if keyword in link:
print(link)
However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:
href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"
But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):
href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
because it isn't a full link.
So, what is a robust solution to this (something that works for other sites too, not just CNN)?
EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.
Try using the urljoin function from the urllib.parse package. It takes two parameters, the first is the URL of the page you're currently parsing, which serves as the base for relative links, the second is the link you found. If the link you found starts with http:// or https://, it'll return just that link, else it will resolve URL relative to what you passed as the first parameter.
So for example:
#!/usr/bin/env python3
from urllib.parse import urljoin
print(
urljoin(
"https://www.cnbc.com/",
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
)
)
# prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
print(
urljoin(
"https://www.cnbc.com/",
"http://some-other.website/"
)
)
# prints "http://some-other.website/"

Newbie webscraping with python3

This is my first attempt to use web scraping in python to extract some links from a webpage.
This the webpage i am interested in getting some data from:
http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5
I am interest in extracting all the instance of following from above webpage:
href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352"
I have written following regex to extract all the matches of above type of links:
r"href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\""
Here is quick code i have written to try to extract all the regex mataches:
#!/usr/bin/python3
import re
import requests
url = "http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5"
page = requests.get(url)
l = re.findall(r'href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\"', page.text)
print(l)
When I run the above code I get following ouput:
./links2.py
[]
When I inspect the webpage using developer tools within the browser I can see this links but when I try to extract the text I am interested in(href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352") using python3 script I get no matches.
Am I downloading the webpage correctly, how do I make sure I am getting all of the webapage from within my script. i have a feeling I am missing parts of the web page when using the requests to get the web page.
Any help please.

Making webcrawler - Wont go into my for-loop

I'm making a webcrawler for fun. Basically what I want to do for example is to crawl this page
http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=2010&view=.dateSeason
and first of all get all the home teams. Here is my code:
def urslit_spider(max_years):
year = 2010
while year <= max_years:
url = 'http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=' + str(year) + '&view=.dateSeason'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class' : 'clubs rHome'}):
lid = link.string
print(lid)
year += 1
I've found out that the code wont enter the for loop. It gives me no error but it doesn't do anything. Tried to search for this but can't find what's wrong.
The link you provided redirected me to the homepage. Tinkering with the URL I get to http://br.premierleague.com/en-gb/matchday/results.html
In this URL I get all the home teams name using
soup.findAll('td', {'class' : 'home'}):
How can I navigate to the link you provided? Maybe the HTML is different on that page
Edit: Looks like the content of this website is loaded from this URL: http://br.premierleague.com/pa-services/api/football/lang_en_gb/i18n/competition/fandr/api/gameweek/1.json
Tinkering with the url parameters, you can find lots of informations.
I still cant open the url you provided, it keeps redirecting me, but in the link I provided, I cant extract the table info from html (and BeautifulSoup) because it is gathering the info from that JSON above.
The best thing to do is using that json to get the information you need. My advice is to use json package from python.
If you are new to JSON, you can use this website to make the JSON more readable: https://jsonformatter.curiousconcept.com/

Resources