Request the same webpage as seen in webbrowser - python-3.x

I'm trying to scrape some webpage, but I'm encountering with the problem that the page content is different from what I'm seeing in Firefox
This is my code:
import requests
from bs4 import BeautifulSoup
url = "https://www.sareb.es/es_ES/inmuebles"
with requests.get(url, verify = False) as html_file:
soup = BeautifulSoup(html_file.content, "html.parser")
soup.find_all("h3")
I want to scrape the prices, which are in h3 tags, but the output is not showing them with soup.find_all("h3").
Is there any way to retrieve the "same" webpage?
Thanks

You can use requests for json response, create a loop with the page number to get more results. Total page is 1217.
import requests
url = "https://www.sareb.es/dynamic/assets/json?"
params = {
"lang": "en_US",
"page": "1",
"orderField": "score",
"orderDirection": "DESC",
"compId": "7aa1f42482964610VgnVCMServera5ecbf0aRCRD",
"rtbPage": "Home > Inmuebles",
"compId": "7aa1f42482964610VgnVCMServera5ecbf0aRCRD"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36",
"X-CSRF-TOKEN": "5618a1c8-9d8e-4a88-95a8-2eef4c2b3455",
"X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, params=params, headers=headers, verify=False)
d = r.json()
results = (d['result']['assetsPage']['content'])
for result in results:
print(result['type'], result['price'], result['city'], result['district'])
Results:
Country House 281.000 € Sueca Valencia/València
Country House To consult Moaña Pontevedra
Country House 152.000 € Corcos Valladolid
Country House 36.000 € Alcalà de Xivert Castellón/Castelló
Office 130.800 € Sagunto/Sagunt Valencia/València
Office 643.370 € Valencia Valencia/València
Office 646.495 € Valencia Valencia/València
Office 3.100 € Valencia Valencia/València
Office 144.700 € / 1.070 € Palmas de Gran Canaria (Las) Palmas, Las
Office 326.400 € / 1.635 € Murcia Murcia
Offices From 60.000 € Colmenar Viejo Madrid
Office 519.300 € / 2.564 € Alicante/Alacant Alicante/Alacant
Office To consult Palma de Mallorca Balears, Illes
Office 97.000 € Santa Lucía de Tirajana Palmas, Las
Offices From 444.200 € Sevilla Sevilla
Offices From 34.700 € Villamayor Salamanca

Related

find_elements by CSS_Selector Python Selenum

I realized Selenium removed some attributes, my code is not able to use each_item.find_element(By.CSS_SELECTOR statement:
for i in range(pagenum):
driver.get(f"https://www.adiglobaldistribution.us/search?attributes=dd1a8f50-5ac8-ec11-a837-000d3a006ffb&page={i}&criteria=Tp-link")
time.sleep(5)
wait=WebDriverWait(driver,10)
search_items = driver.find_elements(By.CSS_SELECTOR,"[class='rd-thumb-details-price']")
for each_item in search_items:
item_title = each_item.find_element(By.CSS_SELECTOR, "span[class='rd-item-name-desc']").text
item_name = each_item.find_element(By.CSS_SELECTOR, "span[class='item-num-mfg']").text[7:]
item_link = each_item.find_element(By.CSS_SELECTOR, "div[class='item-thumb'] a").get_attribute('href')
item_price = each_item.find_element(By.CSS_SELECTOR, "div[class='rd-item-price rd-item-price--list']").text[2:].replace("\n",".")
item_stock = each_item.find_element(By.CSS_SELECTOR, "div[class='rd-item-price']").text[19:]
table = {"title": item_title, "name": item_name, "Price": item_price, "Stock": item_stock, "link": item_link}
data_adi.append(table)
Error:
You are probably approaching the whole situation the wrong way. Those products are being hydrated in page by javascript, once page loads, so you can actually scrape the api endpoint and avoid the complexities (and slowness) of selenium. here is a solution based on requests and pandas, scraping the API endpoint (found under Dev tools - Network tab):
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
full_df = pd.DataFrame()
for x in range(1, 4):
r = requests.get(f'https://www.adiglobaldistribution.us/api/v2/adiglobalproducts/?applyPersonalization=true&boostIds=&categoryId=16231864-9ed5-4536-a8b3-ae870078e9f7&expand=pricing,brand&getAllAttributeFacets=false&hasMarketingTileContent=false&includeAttributes=IncludeOnProduct&includeSuggestions=false&makeBrandUrls=false&page={x}&pageSize=36&previouslyPurchasedProducts=false&query=&searchWithin=&sort=Bestseller', headers=headers)
df = pd.json_normalize(r.json()['products'])
full_df = pd.concat([full_df, df], axis=0, ignore_index=True)
# print([x for x in full_df.columns])
print(full_df[['basicListPrice', 'modelNumber', 'name', 'properties.countrY_OF_ORIGIN', 'productDetailUrl', 'properties.minimuM_QTY', 'properties.onsalenow']])
Result printed in terminal:
basicListPrice modelNumber name properties.countrY_OF_ORIGIN productDetailUrl properties.minimuM_QTY properties.onsalenow
0 51.99 TL-SG1005P TP-Link TL-SG1005P 5-Port Gigabit Desktop Switch with 4-Port PoE China /Catalog/shop-brands/tp-link/FP-TLSG1005P 1 0
1 81.99 C7 TP-Link ARCHER C7 AC1750 Wireless Dual Band Gigabit Router China /Catalog/shop-brands/tp-link/FP-ARCHERC7 1 0
2 18.99 TL-POE150S TP-Link TL-POE150S PoE Injector, IEEE 802.3af Compliant China /Catalog/shop-brands/tp-link/FP-TLPOE150S 1 0
3 19.99 TL-WR841N TP-Link TL-WR841N 300Mbps Wireless N Router China /Catalog/shop-brands/tp-link/FP-TLWR841N 1 0
4 43.99 TL-PA4010 KIT TP-Link TL-PA4010KIT AV600 600Mbps Powerline Starter Kit China /Catalog/shop-brands/tp-link/FP-TLPA4010K 1 0
... ... ... ... ... ... ... ...
85 76.99 TL-SL1311MP TP-Link TL-SL1311MP 8-Port 10/100mbps + 3-Port Gigabit Desktop Switch With 8-Port PoE+ /Catalog/shop-brands/tp-link/FP-TSL1311MP 1 0
86 35.99 C20 TP-Link ARCHER C20 IEEE 802.11ac Ethernet Wireless Router China /Catalog/shop-brands/tp-link/FP-ARCHERC20 1 0
87 29.99 TL-WR802N TP-Link TL-WR802N 300Mbps Wireless N Nano Router, Pocket Size China /Catalog/shop-brands/tp-link/FP-TLWR802N 1 0
88 100.99 EAP610 TP-Link EAP610_V2 AX1800 CEILING MOUNT WI-FI 6" China /Catalog/shop-brands/tp-link/FP-EAP610V2 1 0
89 130.99 EAP650 TP-Link EAP650 AX3000 Ceiling Mount Wi-Fi 6 Access Point China /Catalog/shop-brands/tp-link/FP-EAP650 1 0
90 rows × 7 columns
You can further inspect that json response, andd see if there is more useful information you need from there.
Relevand pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
And for requests docs, see https://requests.readthedocs.io/en/latest/

Creating a function for my python web scraper that will output a dictionary

I have created my web scraper I have added an function unfortunately my function is not calling the out put is not coming out as a dictionary. How do I create and call the function and store the output as a dictionary. Below is my code and function so far.
from bs4 import BeautifulSoup
import requests
top_stories = []
def get_stories():
""" user agent to facilitates end-user interaction with web content"""
headers = {
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
}
base_url = 'www.example.com'
source = requests.get(base_url).text
soup = BeautifulSoup(source, 'html.parser')
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
try:
headline = article.h3.text.strip()
link = base_url + article.a['href']
text = article.find("div", class_="field--type-text-with-summary").text.strip()
img_url = base_url + article.picture.img['data-src']
print(headline,link,text,img_url)
stories_dict = {}
stories_dict['Headline'] = headline
stories_dict['Link'] = link
stories_dict['Text'] = text
stories_dict['Image'] = img_url
top_stories.append(stories_dict)
except AttributeError as ex:
print('Error:',ex)
get stories()
To get the data in a dictionary format (dict), you can create a dictionary as follows:
top_stories = {"Headline": [], "Link": [], "Text": [], "Image": []}
and append the correct data to it.
(by the way, when you have specified your headers, it should have been a dict not a set.)
from bs4 import BeautifulSoup
import requests
def get_stories():
"""user agent to facilitates end-user interaction with web content"""
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"
}
top_stories = {"Headline": [], "Link": [], "Text": [], "Image": []}
base_url = "https://www.jse.co.za/"
source = requests.get(base_url, headers=headers).text
soup = BeautifulSoup(source, "html.parser")
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
try:
top_stories["Headline"].append(article.h3.text.strip())
top_stories["Link"].append(base_url + article.a["href"])
top_stories["Text"].append(
article.find("div", class_="field--type-text-with-summary").text.strip()
)
top_stories["Image"].append(base_url + article.picture.img["data-src"])
except AttributeError as ex:
print("Error:", ex)
print(type(top_stories))
print(top_stories)
get_stories()
Output:
Number of articles found: 6
<class 'dict'>
{'Headline': ['South Africa offers investment opportunities to Asia Pacific investors', 'South Africa to showcase investment opportunities to the UAE market', 'South Africa to showcase investment opportunities to UK investors', 'JSE to become 100% owner of JSE Investor Services and expands services to include share plan administration services', 'Thungela Resources lists on the JSE after unbundling from Anglo American', 'JSE welcomes SAB’s B-BBEE scheme that gives investors exposure to AB InBev global market'], 'Link': ['https://www.jse.co.za//news/market-news/south-africa-offers-investment-opportunities-asia-pacific-investors', 'https://www.jse.co.za//news/market-news/south-africa-showcase-investment-opportunities-uae-market', 'https://www.jse.co.za//news/market-news/south-africa-showcase-investment-opportunities-uk-investors', 'https://www.jse.co.za//news/market-news/jse-become-100-owner-jse-investor-services-and-expands-services-include-share-plan', 'https://www.jse.co.za//news/market-news/thungela-resources-lists-jse-after-unbundling-anglo-american', 'https://www.jse.co.za//news/market-news/jse-welcomes-sabs-b-bbee-scheme-gives-investors-exposure-ab-inbev-global-market'], 'Text': ['The Johannesburg Stock Exchange (JSE) and joint sponsors, Citi and Absa Bank are collaborating to host the annual SA Tomorrow Investor conference, which aims to showcase the country’s array of investment opportunities to investors in the Asia Pacific region, mainly from Hong Kong and Singapore.', 'The Johannesburg Stock Exchange (JSE) and joint sponsors, Citi and Absa Bank are collaborating to host the SA Tomorrow Investor conference, which aims to position South Africa as a preferred investment destination for the United Arab Emirates (UAE) market.', 'The Johannesburg Stock Exchange (JSE) and joint sponsors Citi and Absa Bank are collaborating to host the annual SA Tomorrow Investor conference, which aims to showcase the country’s array of investment opportunities to investors in the United Kingdom.', 'The Johannesburg Stock Exchange (JSE) is pleased to announce that it has embarked on a process to incorporate JSE Investor Services Proprietary Limited (JIS) as a wholly owned subsidiary of the JSE by acquiring the minority shareholding of 25.15 % from LMS Partner Holdings.', 'Shares in Thungela Resources, a South African thermal coal exporter, today commenced trading on the commodity counter of the Main Board of the Johannesburg Stock Exchange (JSE).', 'From today, Black South African retail investors will get the opportunity to invest in the world’s largest beer producer, AB InBev, following the listing of SAB Zenzele Kabili on the Johannesburg Stock Exchange’s (JSE) Empowerment Segment.'], 'Image': ['https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner_0.jpg?h=4ae650de&itok=hdGEy5jA', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner2.jpg?h=4ae650de&itok=DgPFtAx8', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner.jpg?h=4ae650de&itok=Q0SsPtAz', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2020-12/DSC_0832.jpg?h=156fdada&itok=rL3M2gpn', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Thungela_Web_Banner_1440x390.jpg?h=4ae650de&itok=kKRO5fQk', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-05/SAB-Zenzele.jpg?h=4ae650de&itok=n9osAP33']}

Cant scrape google search results with beautifulsoup

I want to scrape google search results , but whenever i try to do so, the program returns an empty list
from bs4 import BeautifulSoup
import requests
keyWord = input("Input Your KeyWord :")
url = f'https://www.google.com/search?q={keyWord}'
src = requests.get(url).text
soup = BeautifulSoup(src, 'lxml')
container = soup.findAll('div', class_='g')
print(container)
Complementing Andrej Kesely's answer if you getting empty results you always can climb one div up or down to test and go from there.
Code (say you want to scrape title, summary and link):
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=ice cream',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Portion of output:
[
{
"Heading": "Ice cream - Wikipedia",
"Article Summary": "Ice cream (derived from earlier iced cream or cream ice) is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.",
"Link": "https://en.wikipedia.org/wiki/Ice_cream"
},
{
"Heading": "Jeni's Splendid Ice Creams",
"Article Summary": "Jeni's Splendid Ice Cream, built from the ground up with superlative ingredients. Order online, visit a scoop shop, or find the closest place to buy Jeni's near you.",
"Link": "https://jenis.com/"
}
]
Alternatively, you can do this using Google Search Engine Results API from SerpApi. It's a paid API with a free trial.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "ice cream",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")
Portion of the output:
Title: Ice cream - Wikipedia
Summary: Ice cream (derived from earlier iced cream or cream ice) is a sweetened frozen food typically eaten as a snack or dessert. It may be made from dairy milk or cream and is flavoured with a sweetener, either sugar or an alternative, and any spice, such as cocoa or vanilla.
Link: https://en.wikipedia.org/wiki/Ice_cream
Title: 6 Ice Cream Shops to Try in Salem, Massachusetts ...
Summary: 6 Ice Cream Shops to Try in Salem, Massachusetts · Maria's Sweet Somethings, 26 Front Street · Kakawa Chocolate House, 173 Essex Street · Melt ...
Link: https://www.salem.org/icecream/
Title: Melt Ice Cream - Salem
Summary: Homemade ice cream made on-site in Salem, MA. Bold innovative flavors, exceptional customer service, local ingredients.
Link: https://meltsalem.com/
Disclaimer, I work for SerpApi.
To get correct result page from google, specify User-Agent http header. For only english results put hl=en parameter in URL:
from bs4 import BeautifulSoup
import requests
keyWord = input("Input Your KeyWord :")
url = f'https://www.google.com/search?hl=en&q={keyWord}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
src = requests.get(url, headers=headers).text
soup = BeautifulSoup(src, 'lxml')
containers = soup.findAll('div', class_='g')
for c in containers:
print(c.get_text(strip=True, separator=' '))

how do i scrape the web pages where page number is not shown in URL, or no link is provided

Hi i need to scrape the following web pages, but when it comes to pagination I am not able to fetch URL with page number.
https://www.nasdaq.com/market-activity/commodities/ho%3Anmx/historical
You could go through the api to iterate through each page (or in the case of this api an offset number - as it tells you how many total records there are). Take the total records, then divide by the limit set (and use math.ceiling to round up. Then iterate the range from 1, to that number using the multiple as an offset of the limit as a parameter).
Or, just easier, adjust the limit to something higher, and get it in one request:
import requests
from pandas.io.json import json_normalize
url = 'https://api.nasdaq.com/api/quote/HO%3ANMX/historical'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
payload = {
'assetclass': 'commodities',
'fromdate': '2020-01-05',
'limit': '9999',
'todate': '2020-02-05'}
data = requests.get(url, headers=headers,params=payload).json()
df = json_normalize(data['data']['tradesTable']['rows'])
Output:
print (df.to_string())
close date high low open volume
0 1.5839 02/04/2020 1.6179 1.5697 1.5699 66,881
1 1.5779 02/03/2020 1.6273 1.5707 1.6188 62,146
2 1.6284 01/31/2020 1.6786 1.6181 1.6677 68,513
3 1.642 01/30/2020 1.699 1.6305 1.6952 70,173
4 1.7043 01/29/2020 1.7355 1.6933 1.7261 69,082
5 1.7162 01/28/2020 1.7303 1.66 1.674 79,852
6 1.6829 01/27/2020 1.7305 1.6598 1.7279 97,184
7 1.7374 01/24/2020 1.7441 1.7369 1.7394 80,351
8 1.7943 01/23/2020 1.7981 1.7558 1.7919 89,084
9 1.8048 01/22/2020 1.811 1.7838 1.7929 90,311
10 1.8292 01/21/2020 1.8859 1.8242 1.8782 53,130
11 1.8637 01/17/2020 1.875 1.8472 1.8669 79,766
12 1.8647 01/16/2020 1.8926 1.8615 1.8866 99,020
13 1.8822 01/15/2020 1.9168 1.8797 1.9043 92,401
14 1.9103 01/14/2020 1.9224 1.8848 1.898 62,254
15 1.898 01/13/2020 1.94 1.8941 1.9366 61,328
16 1.9284 01/10/2020 1.96 1.9262 1.9522 67,329
17 1.9501 01/09/2020 1.9722 1.9282 1.9665 73,527
18 1.9582 01/08/2020 1.9776 1.9648 1.9759 110,514
19 2.0324 01/07/2020 2.0392 2.0065 2.0274 72,421
20 2.0339 01/06/2020 2.103 2.0193 2.0755 87,832

Trying to scrape and segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel?
In the picture(https://ibb.co/8X5xY9C) or in the website link provided, Bold(Except Alphabet Letters(A) and later 'back to top' ) represents Heading and Explanation(non-bold just below bold) represents the content(Content even consists of 'li' and 'ul' blocks later in the site, which should come under respective Heading)
#Code to Start With
from bs4 import BeautifulSoup
import requests
url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
Heading = soup.findAll('strong')
content = soup.findAll('div', {"class": "comp-rich-text"})
Output Excel looks something Link this
https://i.stack.imgur.com/NsMmm.png
I've thought about it a little more and thought of a better solution. Rather than "crowd" my initial solution, I chose to add a 2nd solution here:
So thinking about it again, and following my logic of splitting the html by the headlines (essentially breaking it up where we find <strong> tags), I choose to convert to strings using .prettify(), and then split on those specific strings/tags and read back into BeautifulSoup to pull the text. From what I see, it looks like it hasn't missed anything, but you'll have to search through the dataframe to double check:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
splits = section.prettify().split('<strong>')
for each in splits:
try:
headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
headline = BeautifulSoup(headline, 'html.parser').text.strip()
content = BeautifulSoup(content, 'html.parser').text.strip()
content_split = content.split('\n')
content = ' '.join([ text.strip() for text in content_split if text != ''])
results[headline] = content
except:
continue
df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)
Output:
print (df)
Headings Content
0 Age requirements Applicants must be at least 18 years old at th...
1 Affordability Our affordability calculator is the same one u...
2 Agricultural restriction The only acceptable agricultural tie is where ...
3 Annual percentage rate of charge (APRC) The APRC is all fees associated with the mortg...
4 Adverse credit We consult credit reference agencies to look a...
5 Applicants (number of) The maximum number of applicants is two.
6 Armed Forces personnel Unsecured personal loans are only acceptable f...
7 Back to back Back to back is typically where the vendor has...
8 Customer funded purchase: when the customer has funded the purchase usin...
9 Bridging: residential mortgage applications where the cu...
10 Inherited: a recently inherited property where the benefi...
11 Porting: where a fixed/discounted rate was ported to a ...
12 Repossessed property: where the vendor is the mortgage lender in pos...
13 Part exchange: where the vendor is a large national house bui...
14 Bank statements We accept internet bank statements in paper fo...
15 Bonus For guaranteed bonuses we will consider an ave...
16 British National working overseas Applicants must be resident in the UK. Applica...
17 Builder's Incentives The maximum amount of acceptable incentive is ...
18 Buy-to-let (purpose) A buy-to-let mortgage can be used for: Purcha...
19 Capital Raising - Acceptable purposes permanent home improvem...
20 Buy-to-let (affordability) Buy to Let affordability must be assessed usin...
21 Buy-to-let (eligibility criteria) The property must be in England, Scotland, Wal...
22 Definition of a portfolio landlord We define a portfolio landlord as a customer w...
23 Carer's Allowance Carer's Allowance is paid to people aged 16 or...
24 Cashback Where a mortgage product includes a cashback f...
25 Casual employment Contract/agency workers with income paid throu...
26 Certification of documents When submitting copies of documents, please en...
27 Child Benefit We can accept up to 100% of working tax credit...
28 Childcare costs We use the actual amount the customer has decl...
29 When should childcare costs not be included? There are a number of situations where childca...
.. ... ...
108 Shared equity We lend on the Government-backed shared equity...
109 Shared ownership We do not lend against Shared Ownership proper...
110 Solicitors' fees We have a panel of solicitors for our fees ass...
111 Source of deposit We reserve the right to ask for proof of depos...
112 Sole trader/partnerships We will take an average of the last two years'...
113 Standard variable rate A standard variable rate (SVR) is a type of v...
114 Student loans Repayment of student loans is dependent on rec...
115 Tenure Acceptable property tenure: Feuhold, Freehold,...
116 Term Minimum term is 3 years Residential - Maximum...
117 Unacceptable income types The following forms of income are classed as u...
118 Bereavement allowance: paid to widows, widowers or surviving civil pa...
119 Employee benefit trusts (EBT): this is a tax mitigation scheme used in conjun...
120 Expenses: not acceptable as they're paid to reimburse pe...
121 Housing Benefit: payment of full or partial contribution to cla...
122 Income Support: payment for people on low incomes, working les...
123 Job Seeker's Allowance: paid to people who are unemployed or working 1...
124 Stipend: a form of salary paid for internship/apprentic...
125 Third Party Income: earned by a spouse, partner, parent who are no...
126 Universal Credit: only certain elements of the Universal Credit ...
127 Universal Credit The Standard Allowance element, which is the n...
128 Valuations: day one instruction We are now instructing valuations on day one f...
129 Valuation instruction A valuation will be automatically instructed w...
130 Valuation fees A valuation will always be obtained using a pa...
131 Please note: W hen upgrading the free valuation for a home...
132 Adding fees to the loan Product fees are the only fees which can be ad...
133 Product fee This fee is paid when the mortgage is arranged...
134 Working abroad Previously, we required applicants to be empl...
135 Acceptable - We may consider applications from people who: ...
136 Not acceptable - We will not consider applications from people...
137 Working and Family Tax Credits We can accept up to 100% of Working Tax Credit...
[138 rows x 2 columns]
EDIT: SEE OTHER SOLUTION PROVIDED
It's tricky. I tried to essentially to grab the headings, then use those to grab all the text after the heading, and that proceeds the next heading. The code below is a little messy, and requires some cleaning up, but hopefully gets you to a point to work with it or get you moving in the right direction:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
headlines = section.find_all('strong')
headlines = [each.text for each in headlines ]
for i, headline in enumerate(headlines):
if headline != headlines[-1]:
next_headline = headlines[i+1]
else:
next_headline = ''
try:
find_content = section(text=headline)[0].parent.parent.find_next_siblings()
if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
content = section(text=headline)[0].parent.nextSibling
results[headline] = content.strip()
break
except:
find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
if find_content == []:
try:
find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
except:
find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()
content = []
for sibling in find_content:
if next_headline not in sibling.text or headline == headlines[-1]:
content.append(sibling.text)
else:
content = '\n'.join(content)
results[headline.strip()] = content.strip()
break
if headline == headlines[-1]:
content = '\n'.join(content)
results[headline] = content.strip()
df = pd.DataFrame(results.items())

Resources