Python: Can't extract tbody information from website

Python: Can't extract tbody information from website - python-3.x

I want to extract all links of this website: https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#/tab/general
The information I want are stored in the tbody: page code
Every time I try to extract the data I get no result.
from bs4 import BeautifulSoup
import requests
from requests_html import HTMLSession
url = "https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare#complex-searchresult"
session = HTMLSession()
r = session.get(url)
r.html.render()
soup = BeautifulSoup(r.html.html,'html.parser')
print(r.html.search("Details"))
Thank you for your help!

The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
You can scrape that data like this, it returns json which is easy enough to parse:
import requests
headers = {
'Referer':'https://pflegefinder.bkk-dachverband.de/pflegeheime/searchresult.php?required=1&statistics=1&searchdata%5BmaxDistance%5D=0&searchdata%5BcareType%5D=inpatientCare',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
for page in range(2):
url = f'https://pflegefinder.bkk-dachverband.de/api/nursing-homes?required=1&statistics=1&maxDistance=0&careType=inpatientCare&limit=20&offset={page*20}'
resp = requests.get(url,headers=headers).json()
print(resp)
The api checks that you have a "Referer" header otherwise you get a 400 response.

Related

How to scrape an article which requires login to view full content using python?

I am trying to scrape an article from The Wall Street Journal and it requires log-in to view the whole content. So, I have written a code like the below using Python Requests:
import requests
from bs4 import BeautifulSoup
import re
import base64
import json
username= <username>
password= <password>
base_url= "https://accounts.wsj.com"
session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)
connection = <connection_name>
r = session.post(
'https://sso.accounts.dowjones.com/usernamepassword/login',
data = {
"username": username,
"password": password,
"connection": connection,
"client_id": credentials["clientID"],
"state": credentials["internalOptions"]["state"],
"nonce": credentials["internalOptions"]["nonce"],
"scope": credentials["internalOptions"]["scope"],
"tenant": "sso",
"response_type": "code",
"protocol": "oauth2",
"redirect_uri": "https://accounts.wsj.com/auth/sso/login"
})
soup = BeautifulSoup(r.text, "html.parser")
login_result = dict([
(t.get("name"), t.get("value"))
for t in soup.find_all('input')
if t.get("name") is not None
])
r = session.post(
'https://sso.accounts.dowjones.com/login/callback',
data = login_result,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"},
)
# article get request
r = session.get(
"https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761",
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"}
)
print(r.text)
am able to login through the request but still I am not getting full article to scrape. Can anyone help me with this? Thanks in advance :-)

An easy and reliable solution is using the selenium webdriver.
With selenium you create an automated browser window which opens the website and from there you can make it choose the elements to log in. Then the content loads in as usual, like when you look for it manually on your browser.
Then you can soup that page with BeautifulSoup.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# Download the driver for your desired browser and place it in any path
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# open your website link
driver.get("https://www.your-url.com")
# then soup the page with BS
html = driver.page_source
page_soup = BeautifulSoup(html)
From there you can use the "page_soup" as you usually would.
Any questions? :)

I want to open the first link that appear when i do a search on google

I want to get the first link from the html parser, but I'm getting anything(tried to print).
Also when i inspect the page on browser, the links are under class='r'
But when i print the soup.prettify(), and closely analyse then i find there is no class='r', instead class="BNeawe UPmit AP7Wnd".
Please help, thanks in advance!
import requests
import sys
import bs4
import webbrowser
def open_web(query):
res = requests.get('https://google.com/search?q=' + query)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
link_to_open = min(1, len(link_elements))
for i in range(link_to_open):
webbrowser.open('https://google.com' + link_elements[i].get('href'))
open_web('youtube')

The problem is that google serves different HTML when you don't specify User-Agent in headers. To add User-Agent to your request, put it in the headers= attribute:
import requests
import bs4
def open_web(query):
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('https://google.com/search?q=' + query, headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
print(link_elements)
open_web('youtube')
Prints:
[<a href="https://www.youtube.com/?gl=EE&hl=et" onmousedown="return rwt(this,'','','','1','AOvVaw2lWnw7oOhIzXdoFGYhvwv_','','2ahUKEwjove3h7onkAhXmkYsKHbWPAUYQFjAAegQIBhAC','','',event)"><h3 class="LC20lb">
... and so on.

You received a completely different HTML with different elements and selectors thus the output is empty. The reason why Google blocks your request is because default requests user-agent is python-requests and Google understands it and blocks it. Check what's your user-agent.
User-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Sometimes you can receive a different HTML, with different selectors.
You can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
If you want to get the very first link then use select_one() instead.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = soup.select_one('.yuRUbf a')['href']
print(link)
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means first index of search results
link = results['organic_results'][0]['link']
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Disclaimer, I work for SerpApi.

Scraping store locations from a complex website

I am new to web scraping and I need to scrape store locations from the given website. The information I need includes location title, address, city, state, country, phone. So far I have extracted the webpage but I don't know how to go forward
url = 'https://www.rebounderz.com/all-locations/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102
Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
Please guide me how can I get the required information. I have searched other answers and looked into tutorials too but the structure of this website has made me confused.

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
url = "https://www.rebounderz.com/all-locations/"
context = ssl._create_unverified_context()
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
request = urllib.request.Request(url, headers=headers)
html = urlopen(request, context=context)
soup = BeautifulSoup(html, 'lxml')
divs = soup.find_all('div', {"class":"size1of3"})
for div in divs:
print(div.find("h5").get_text())
print(div.find("p").get_text())

404 error using urllib but URL works fine in browser AND the full web page is returned in the error

I'm trying to open a web page in python using urllib ( to scrape it ). The web page looks fine in a browser but I get a 404 error with urlopen. However, if look at the text returned with the error, it actually has the full web page in it.
from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
try:
html = urlopen('http://www.enduroworldseries.com/series-rankings')
except HTTPError as e:
err = e.read()
code = e.getcode()
print(err)
When I run the code, the exception is caught and 'code' is '404'. The err variable has the complete html that shows up if you look at the page in a browser. So why am I getting an error?
Not sure if it matters but other pages on the same domain load fine with urlopen.

I found a solution without knowing what the initial problem was. Simply replaced urllib with the requests library.
req = Request('http://www.enduroworldseries.com/series-rankings', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})
html = urlopen(req)
bsObj = BeautifulSoup(html, "html.parser")
Became
response = requests.get('http://www.enduroworldseries.com/series-rankings', {'User-Agent': 'Mozilla/5.0'})
bsObj = BeautifulSoup(response.content, "html.parser")

Flurry scraping using python3 requests.Session()

This seems really straight forward, but for some reason this isn't connecting to flurry correctly and I unable to scrape the data.
loginurl = "https://dev.flurry.com/secure/loginPage.do"
csvurl = "https://dev.flurry.com/eventdata"
session = requests.Session()
login = session.post(loginurl, data={'loginEmail': 'user', 'loginPassword': 'pass'})
data = session.get(csvurl)
Every time I try to use this, I get redirected back to the login screen (loginurl) without fetching the new data. Has anyone been able to connect to flurry like this successfully before?
Any and all help would be greatly appreciated, thanks.

There are two more form fields to be populated struts.token.name and the value from struts.token.name i.e token, you also have to post to loginAction.do:
You can do an initial get and parse the values using bs4 then post the data:
from bs4 import BeautifulSoup
import requests
loginurl = "https://dev.flurry.com/secure/loginAction.do"
csvurl = "https://dev.flurry.com/eventdata"#
data = {'loginEmail': 'user', 'loginPassword': 'pass'}
with requests.Session() as session:
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
soup = BeautifulSoup(session.get(loginurl).content)
name = soup.select_one("input[name=struts.token.name]")["value"]
data["struts.token.name"] = name
data[name] = soup.select_one("input[name={}]".format(name))["value"]
login = session.post(loginurl, data=data)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python: Can't extract tbody information from website - python-3.x

Related

How to scrape an article which requires login to view full content using python?

I want to open the first link that appear when i do a search on google

Scraping store locations from a complex website

404 error using urllib but URL works fine in browser AND the full web page is returned in the error

Flurry scraping using python3 requests.Session()

Categories

Resources