Request.get not rendering all 'hrefs' in HTML Python

Request.get not rendering all 'hrefs' in HTML Python - python-3.x

I am trying to fetch the "Contact Us" page of multiple websites. It works for some of the websites, but for some, the text rendered by request.get does not contain all the 'href" links. When i inspect the page in browser, it is visible but not coming through in requests.
Tried to look for the solution , but to no luck:-
Below is the code and the webpage i am trying to scrape https://portcullis.co/ :-
headers = {"Accept-Language": "en-US, en;q=0.5"}
def page_contact(url):
r = requests.get(url, headers = headers)
txt = BeautifulSoup(r.text, 'html.parser')
links = []
for link in txt.findAll('a'):
links.append(link.get('href'))
return r, links
The output generated is :-
<Response [200]> []
Since it is working fine for some other websites, i would prefer to edit it in a way where it doesn't just cater to this website, but to all websites,
Any help is highly appreciated !!
Thanks !!!

This is another way to solve this using only selenium and not BeautifulSoup
browser = selenium.webdriver.Chrome(chrome.exe)
browser.get(url)
browser.set_page_load_timeout(100)
time.sleep(3)
WebDriverWait(browser, 20).until(lambda d: d.find_element_by_tag_name("a"))
time.sleep(20)
elements = browser.find_elements_by_xpath("//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') , 'contact')]")
for el in elements:
final_link.append(el.get_attribute("href"))

This would fetch you the source page info, and you can find the relevant links by passing it to beautifulsoup
from selenium import webdriver
import time
browser = webdriver.Chrome(r'path to your chrome exe')
browser.get('Your url')
time.sleep(5)
htmlSource = browser.page_source
txt = BeautifulSoup(htmlSource, 'html.parser')
browser.close()
links = []
for link in txt.findAll('a'):
links.append(link.get('href'))

Related

BeautifulSoup doesn't catch the complete link

When I try to get the link on a web page, bs4 doesn't catch the entire link, it stops before the **?ref**.....
I'll explain the question through the code:
imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
site = requests.get(imdb_link)
soup = BeautifulSoup(site.text,'lxml')
for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
link = items.find("a").get('href')
print(link)
The output is:
/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
...and so on..
But it's wrong, as you can see by seeing the web page, because it might be:
/title/tt0111161/?ref_=adv_li_tt
/title/tt0068646/?ref_=adv_li_tt
...and so on...
How can I get the entire link? I mean the ?ref_=adv_li_tt too?
I use Python 3.7.4

Overall it might be interesting to try and work out how to get the full link - which I think you will need selenium for to allow javascript to run on page, you don't need the full link as seen on rendered page. What you have, with addition of prefix https://www.imdb.com, is perfectly serviceable.
import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
r = s.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(r.content, 'lxml')
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
You could let selenium load page so content renders then pass over to bs4 to get links as on page:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(d.page_source, 'lxml')
d.quit()
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]

How to change the page no. of a page's search result for Web Scraping using Python?

I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?

At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()

Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

web scraping with beautiful soup in python

I want to crawl the homepage of youtube to pull out all the links of videos. Following is the code
from bs4 import BeautifulSoup
import requests
s='https://www.youtube.com/'
html=requests.get(s)
html=html.text
s=BeautifulSoup(html,features="html.parser")
for e in s.find_all('a',{'id':'video-title'}):
link=e.get('href')
text=e.string
print(text)
print(link)
print()
Nothing is happenning when I run the above code. It seems like the id is not getting discovered. What am I doing wrong

It is because you are not getting the same HTML as your browser have.
import requests
from bs4 import BeautifulSoup
s = requests.get("https://youtube.com").text
soup = BeautifulSoup(s,'lxml')
print(soup)
Save this code's output to a file named test.html and run. You will see that it is not the same as the browser's, as it looks corrupted.
See these questions below.
HTML in browser doesn't correspond to scraped data in python
Python requests not giving me the same HTML as my browser is
Basically, I recommend you to use Selenium Webdriver as it reacts as a browser.

Yes, this is a strange scrape, but if you scrape at the 'div id="content"' level, you are able to get the data you are requesting. I was able to get the titles of each video, but it appears youtube has some rate limiting or throttling, so I do not think you will be able to get ALL of the titles and links. At any rate, below is what I got working for the titles:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')
for each in links:
print(each.text)

May be this could help for scraping all videos from youtube home page,
from bs4 import BeautifulSoup
import requests
r = 'https://www.youtube.com/'
html = requests.get(r)
all_videos = []
soup = BeautifulSoup(html.text, 'html.parser')
for i in soup.find_all('a'):
if i.has_attr('href'):
text = i.attrs.get('href')
if text.startswith('/watch?'):
urls = r+text
all_videos.append(urls)
print('Total Videos', len(all_videos))
print('LIST OF VIDEOS', all_videos)

This code snippet will selects all links from youtube.com homepage that contains /watch? in their href attribute (links to videos):
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://www.youtube.com/').text, 'lxml')
for a in soup.select('a[href*="/watch?"]'):
print('https://www.youtube.com{}'.format(a['href']))
Prints:
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU
...and so on

Python 3.6: How can I get the content from a dynamic page?

I'm trying to get the content from this web page "http://www.fibalivestats.com/u/ACBS/333409/pbp.html" with this code:
r = requests.get("http://www.fibalivestats.com/u/ACBS/333409/pbp.html")
if r.status_code != 200:
print("Error!!!")
html = r.content
soup = BeautifulSoup(html, "html.parser")
print(soup)
And I get the template of the page but not the data associated to each tag.
How can I get the data? I'm new in Python.

In this case you have a situation in which the Javascript is not being triggered, thus it is not filling in the elements. It is because there are no DOM elements to be "ready" which normally trigger Javascript actions. I'd suggest you to use a webdriver such as Selenium, as exemplified in here.
It will mimick a Browser and the Javascript will be executed. An example bellow.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.fibalivestats.com/u/ACBS/333409/pbp.html")
html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")

Asynchronous scraping with Python: grequests and Beautifulsoup4

I am trying to scrape this site . I managed to do it by using urllib and beautifulsoup. But urllib is too slow. I want to have asynchronous requests because the urls are thousands. I found that a nice package is grequests.
example:
import grequests
from bs4 import BeautifulSoup
pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
pages.append(page)
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
page = page + "/offset_{}".format(i*10)
rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)
The problem is that I don't know how to continue and use beautifulsoup. So as to get the html code of every page.
It would be nice to hear your ideas. Thank you!

Refer to the script below, also check the link of the source. It will help.
reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('a', attrs={"class":'product__list-name'})
print(results[0].text)
prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
print(prices[0].text)
discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
print(discount[0].text)
Source: https://blog.datahut.co/asynchronous-web-scraping-using-python/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Request.get not rendering all 'hrefs' in HTML Python - python-3.x

Related

BeautifulSoup doesn't catch the complete link

How to change the page no. of a page's search result for Web Scraping using Python?

web scraping with beautiful soup in python

Python 3.6: How can I get the content from a dynamic page?

Asynchronous scraping with Python: grequests and Beautifulsoup4

Categories

Resources