URLS from website domain with "hidden" layers - web

I can not find a way to extract ALL URLs from the following website domains:
(1) https://www.ah.nl/zoeken?query=vegan
(2) https://www.jumbo.com/zoeken/?searchTerms=vegan
For the first, the problem is that the product are 'hidden' and as website visitor you need to select a button at the bottom page to show more items. I tried BeautifulSoup, but it does not extract the 'hidden' URLs.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas
req = Request('https://www.ah.nl/zoeken?query=vegan')
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
df = pandas.DataFrame()
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
With the second website link, the issue is that there are multiple pages. Something that the previous code also does not work for. In a previous question, it was suggested to use:
url = 'https://www.sainsburys.co.uk/groceries-api/gol-services/product/v1/product'
payload ={
'filter[keyword]': 'vegan',
'include[PRODUCT_AD]': 'citrus',
'page_number': '1',
'page_size': '2000',
'sort_order': 'FAVOURITES_FIRST'
}
jsonData = requests.get(url, params=payload).json()
products = jsonData['products']
df = pd.DataFrame(products)
I have, however, not yet worked with request and parameters, and cannot figure out how to adjust these parameters to work with link (2).
Hopefully someone can help me with these 2 website links. Thank you.

ah.nl site:
First of all, we must get the number of pages from the request. Further on the cycle we get each page. For example, I output data to the console, these are Title, Price and Link.
import requests
import json
def get_data(query):
page_size = 36
url = f"https://www.ah.nl/zoeken/api/products/search?page=1&size={page_size}&query={query}"
response = requests.request("GET", url, proxies=proxies)
json_obj = json.loads(response.text)
for page in range(int(json_obj['page']['totalPages'])):
url = f"https://www.ah.nl/zoeken/api/products/search?page={page}&size={page_size}&query={query}"
response = requests.request("GET", url)
json_obj = json.loads(response.text)
for products in json_obj['cards']:
for product in products['products']:
print(product['title'], product['price']['now'], product['link'])
get_data('vegan')
If you have any questions, I'll be happy to answer. If you need a code example with a second site, write, I will do

Related

Extract a particular link present in each of the considered web pages

I'm having trouble extracting a particular link from each of the web pages I'm considering.
In particular, considering for example the following websites:
https://lefooding.com/en/restaurants/ezkia
https://lefooding.com/en/restaurants/tekes
I would like to know if there is a unique way to extract the field WEBSITE (above the map) shown in the table on the left of the page.
For the reported cases, I would like to extract the links:
https://www.ezkia-restaurant.fr/
https://www.tekesrestaurant.com/
There are no unique tags to refer to and this makes extraction difficult.
I've thought of a solution using the selector, but it doesn't seem to work. For the first link I have:
from bs4 import BeautifulSoup
import requests
url = "https://lefooding.com/en/restaurants/ezkia"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
data = soup.find("div", {"class": "e-rowContent"})
print(data)
but there is no trace of the link I need here. Does anyone know of a possible solution?
Try this:
import requests
from bs4 import BeautifulSoup
urls = [
"https://lefooding.com/en/restaurants/ezkia",
"https://lefooding.com/en/restaurants/tekes",
]
with requests.Session() as s:
for url in urls:
soup = [
link.strip() for link
in BeautifulSoup(
s.get(url).text, "lxml"
).select(".pageGuide__infos a")[-1]
]
print(soup)
Output:
['https://www.ezkia-restaurant.fr']
['https://www.tekesrestaurant.com/']

Request.get not rendering all 'hrefs' in HTML Python

I am trying to fetch the "Contact Us" page of multiple websites. It works for some of the websites, but for some, the text rendered by request.get does not contain all the 'href" links. When i inspect the page in browser, it is visible but not coming through in requests.
Tried to look for the solution , but to no luck:-
Below is the code and the webpage i am trying to scrape https://portcullis.co/ :-
headers = {"Accept-Language": "en-US, en;q=0.5"}
def page_contact(url):
r = requests.get(url, headers = headers)
txt = BeautifulSoup(r.text, 'html.parser')
links = []
for link in txt.findAll('a'):
links.append(link.get('href'))
return r, links
The output generated is :-
<Response [200]> []
Since it is working fine for some other websites, i would prefer to edit it in a way where it doesn't just cater to this website, but to all websites,
Any help is highly appreciated !!
Thanks !!!
This is another way to solve this using only selenium and not BeautifulSoup
browser = selenium.webdriver.Chrome(chrome.exe)
browser.get(url)
browser.set_page_load_timeout(100)
time.sleep(3)
WebDriverWait(browser, 20).until(lambda d: d.find_element_by_tag_name("a"))
time.sleep(20)
elements = browser.find_elements_by_xpath("//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') , 'contact')]")
for el in elements:
final_link.append(el.get_attribute("href"))
This would fetch you the source page info, and you can find the relevant links by passing it to beautifulsoup
from selenium import webdriver
import time
browser = webdriver.Chrome(r'path to your chrome exe')
browser.get('Your url')
time.sleep(5)
htmlSource = browser.page_source
txt = BeautifulSoup(htmlSource, 'html.parser')
browser.close()
links = []
for link in txt.findAll('a'):
links.append(link.get('href'))

Scraping Site Data with out Selenium

Currently I am trying to pull CMS historical data from there site. I have got some working code to pull the download links from the page. My problem is that the links are divided into pages. I need to iterate through all the available pages and extract the download links. The obvious choice here is to use Selenium to click next pages and get data. Due to company policy i can not run selenium in the environment. Is there a way I can got through the pages and extract link. The website does not show the post link once you try to go to next page. I am out of ideas to try and get to next page without post link or not using selenium.
Current working code to pull links from first page
import pandas as pd
from datetime import datetime
#from selenium import webdriver
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslink = "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report"
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath('//a/#href')
df1 = pd.DataFrame(headers,columns= ['links'])
df1SubSet = df1[df1['links'].str.contains('contract-summary', case=False)]
These are the two urls that will give you the total 166 entries. I have also changed the condition for capturing hrefs. Give this a try.
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
df1 = pd.DataFrame(headers,columns= ['links'])
df=df.append(df1)

Unable to use the site search function

I am trying to use the built-in search function from the site but I keep getting results from the main page. Not sure what I am doing wrong.
import requests
from bs4 import BeautifulSoup
body = {'input':'ferris'} # <-- also have tried'query'
con = requests.post('http://www.collegedata.com/', data=body)
soup = BeautifulSoup(con.content, 'html.parser')
products = soup.findAll('div', {'class': 'schoolCityCol'})
print(soup)
print (products)
You have 2 issues in your code:
POST url is incorrect. You should correct this:
con = session.post('http://www.collegedata.com/cs/search/college/college_search_tmpl.jhtml', data=body)
Your POST data is incorrect too.
body = {'method':'submit', 'collegeName':'ferris', 'searchType':'1'}
You can use Developer tools in any browser (Chrome preferably) and check POST url and data on page Network.

How can I extract the Foursquare url location from Swarm webpage in python3?

suppose we have this swarm url "https://www.swarmapp.com/c/dZxqzKerUMc" how we can get the url under Apple Williamsburg hyperlink in link above.
I tried to filter it out according to html tags but there are many tags and lots of foursquare.com links.
below is a part of source code of the given link above
<h1><strong>Kristin Brooks</strong> at <a
href="https://foursquare.com/v/apple-williamsburg/57915fa838fab553338ff7cb"
target="_blank">Apple Williamsburg</a></h1>
the url foursquare in the code not always the same, so what is the best way to get that specific url uniquely for every given Swarm url.
I tried this:
import bs4
import requests
def get_4square_url(link):
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, "html.parser")
link = [a.attrs.get('href') for a in
soup.select('a[href=https://foursquare.com/v/*]')]
return link
print (get_4square_url('https://www.swarmapp.com/c/dZxqzKerUMc'))
I used https://foursquare.com/v/ as a pattern to get the desirable url
def get_4square_url(link):
try:
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, "html.parser")
for elem in soup.find_all('a',
href=re.compile('https://foursquare\.com/v/')): #here is my pattern
link = elem['href']
return link
except requests.exceptions.HTTPError or
requests.exceptions.ConnectionError or requests.exceptions.ConnectTimeout \
or urllib3.exceptions.MaxRetryError:
pass

Resources