I am trying to fetch some data from a webpage using bs4, but I am having troubles opening the link. So here is the code I am using:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
my_url = "https://www.transfermarkt.com/wettbewerbe/europa/"
client = urlopen(my_url)
page_html = client.read()
client.close()
The curious thing is that only this particular link won't work. Others work completely fine. So what is so special about this link? And how can I open it?
The problem is from the User-Agent. Use urllib.request.Request to set/change the header.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup as soup
my_url = "https://www.transfermarkt.com/wettbewerbe/europa/"
client = Request(my_url, headers={"User-Agent" : "Mozilla/5.0"})
page = urlopen(client).read()
print(page)
Related
import requests
from bs4 import BeautifulSoup
source = requests.get('https://shop.travisscott.com/password').text
soup = BeautifulSoup(source,'lxml')
Article = soup.find('p')
print(Article.prettify())
i am tring to extract text inside span-id tag but getting blank output screen.
i have tried using parent element div text also , but fail to extract, please anyone help me.
below is my code.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.paperplatemakingmachines.com/')
soup = BeautifulSoup(r.text,'lxml')
mob = soup.find('span',{"id":"tollfree"})
print(mob.text)
i want the text inside that span which is given mobile number.
You'll have to use Selenium as that text is not present in the initial request, or at least no without searching through <script> tags.
from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
url='https://www.paperplatemakingmachines.com/'
driver.get(url)
# It's better to use Selenium's WebDriverWait, but I'm still learning how to use that correctly
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
mob = soup.find('span',{"id":"tollfree"})
print(mob.text)
The Data is actually rending dynamically through script. What you need to do is parse the data from script:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://www.paperplatemakingmachines.com/')
soup = BeautifulSoup(r.text,'lxml')
script= soup.find('script')
mob = re.search("(?<=pns_no = \")(.*)(?=\";)", script.text).group()
print(mob)
Another way of using regex to find the number
import requests
import re
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.paperplatemakingmachines.com/',)
soup = bs(r.content, 'lxml')
r = re.compile(r'var pns_no = "(\d+)"')
data = soup.find('script', text=r).text
script = r.findall(data)[0]
print('+91-' + script)
I am trying to scrape info from a website(Program name and program ID) and it is returning empty list.
I am not sure if i am mixing up the syntax but this is what i have
soup.find_all('h3', class_='ama__h3')
the website link is https://freida.ama-assn.org/Freida/#/programs?program=residencies&specialtiesToSearch=140
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
import pandas as pd
from urllib.parse import urlparse, urlsplit
import requests
res = requests.get('https://freida.ama-assn.org/Freida/#/programs?program=residencies&specialtiesToSearch=140')
soup = bs4.BeautifulSoup(res.text, 'html5lib')
print(soup.prettify())
soup.find_all('h3', class_='ama__h3')
Your error is because you are parsing with html5lib. For any well formed html, the parser choice is not really important. However for a non well formed html (like this one), html5lib seems to have issues. You should use html.parser or lxml (apparently html.parser is safer)
However this code is doing what you want to do :
soup = BeautifulSoup(res.text, 'html.parser')
programs = soup.find_all("a", class_='ama__promo--background')
for program in programs:
program_name = program.find("h3").text
program_id = program.find_all("small")[-1].text.split(': ')[1].strip()
print(program_name, program_id
I have written a program in which I want to request a website to read but the program causes a certificate problem and I couldn't solve.Though I have searched and read some article but I find nothing. I think my problem is unique.Thanks.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://stackoverflow.com/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for s in soup.find_all('a'):
print(s.string)
Use requests module
Demo:
import bs4 as bs
import requests
sauce = requests.get('https://stackoverflow.com/')
soup = bs.BeautifulSoup(sauce.content, 'lxml')
for s in soup.find_all('a'):
print(s.string)
Trying python 3.4 beautifulsoup to grab a zip file from a webpage so I can unzip and download it into a folder. I can get the beautifulsoup to print() all the hrefs on the page but I want a specific href ending in, "=Hospital_Revised_Flatfiles.zip". Is that possible? This is what I have so far, only the list of href from the url.
the full href of the file is, https://data.medicare.gov/views/bg9k-emty/files/Dlx5-ywq01dGnGrU09o_Cole23nv5qWeoYaL-OzSLSU?content_type=application%2Fzip%3B%20charset%3Dbinary&filename=Hospital_Revised_Flatfiles.zip
, but the crazy stuff in the middle changes when they update the file and there is no way of knowing what it changes to.
Please let me know if there is something I left out of the question that might be helpful. I'm using Python 3.4 and BeautifulSoup4 (bs4)
from bs4 import BeautifulSoup
import requests
import re
url = "https://data.medicare.gov/data/hospital-compare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
from BeautifulSoup import BeautifulSoup
import requests
import re
url = "https://data.medicare.gov/data/hospital-compare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll('a'):
if link.has_key('href'):
if(link['href'].endswith("=Hospital_Revised_Flatfiles.zip")):
print(link['href'])