find_elements by CSS_Selector Python Selenum - python-3.x

I realized Selenium removed some attributes, my code is not able to use each_item.find_element(By.CSS_SELECTOR statement:
for i in range(pagenum):
driver.get(f"https://www.adiglobaldistribution.us/search?attributes=dd1a8f50-5ac8-ec11-a837-000d3a006ffb&page={i}&criteria=Tp-link")
time.sleep(5)
wait=WebDriverWait(driver,10)
search_items = driver.find_elements(By.CSS_SELECTOR,"[class='rd-thumb-details-price']")
for each_item in search_items:
item_title = each_item.find_element(By.CSS_SELECTOR, "span[class='rd-item-name-desc']").text
item_name = each_item.find_element(By.CSS_SELECTOR, "span[class='item-num-mfg']").text[7:]
item_link = each_item.find_element(By.CSS_SELECTOR, "div[class='item-thumb'] a").get_attribute('href')
item_price = each_item.find_element(By.CSS_SELECTOR, "div[class='rd-item-price rd-item-price--list']").text[2:].replace("\n",".")
item_stock = each_item.find_element(By.CSS_SELECTOR, "div[class='rd-item-price']").text[19:]
table = {"title": item_title, "name": item_name, "Price": item_price, "Stock": item_stock, "link": item_link}
data_adi.append(table)
Error:

You are probably approaching the whole situation the wrong way. Those products are being hydrated in page by javascript, once page loads, so you can actually scrape the api endpoint and avoid the complexities (and slowness) of selenium. here is a solution based on requests and pandas, scraping the API endpoint (found under Dev tools - Network tab):
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
full_df = pd.DataFrame()
for x in range(1, 4):
r = requests.get(f'https://www.adiglobaldistribution.us/api/v2/adiglobalproducts/?applyPersonalization=true&boostIds=&categoryId=16231864-9ed5-4536-a8b3-ae870078e9f7&expand=pricing,brand&getAllAttributeFacets=false&hasMarketingTileContent=false&includeAttributes=IncludeOnProduct&includeSuggestions=false&makeBrandUrls=false&page={x}&pageSize=36&previouslyPurchasedProducts=false&query=&searchWithin=&sort=Bestseller', headers=headers)
df = pd.json_normalize(r.json()['products'])
full_df = pd.concat([full_df, df], axis=0, ignore_index=True)
# print([x for x in full_df.columns])
print(full_df[['basicListPrice', 'modelNumber', 'name', 'properties.countrY_OF_ORIGIN', 'productDetailUrl', 'properties.minimuM_QTY', 'properties.onsalenow']])
Result printed in terminal:
basicListPrice modelNumber name properties.countrY_OF_ORIGIN productDetailUrl properties.minimuM_QTY properties.onsalenow
0 51.99 TL-SG1005P TP-Link TL-SG1005P 5-Port Gigabit Desktop Switch with 4-Port PoE China /Catalog/shop-brands/tp-link/FP-TLSG1005P 1 0
1 81.99 C7 TP-Link ARCHER C7 AC1750 Wireless Dual Band Gigabit Router China /Catalog/shop-brands/tp-link/FP-ARCHERC7 1 0
2 18.99 TL-POE150S TP-Link TL-POE150S PoE Injector, IEEE 802.3af Compliant China /Catalog/shop-brands/tp-link/FP-TLPOE150S 1 0
3 19.99 TL-WR841N TP-Link TL-WR841N 300Mbps Wireless N Router China /Catalog/shop-brands/tp-link/FP-TLWR841N 1 0
4 43.99 TL-PA4010 KIT TP-Link TL-PA4010KIT AV600 600Mbps Powerline Starter Kit China /Catalog/shop-brands/tp-link/FP-TLPA4010K 1 0
... ... ... ... ... ... ... ...
85 76.99 TL-SL1311MP TP-Link TL-SL1311MP 8-Port 10/100mbps + 3-Port Gigabit Desktop Switch With 8-Port PoE+ /Catalog/shop-brands/tp-link/FP-TSL1311MP 1 0
86 35.99 C20 TP-Link ARCHER C20 IEEE 802.11ac Ethernet Wireless Router China /Catalog/shop-brands/tp-link/FP-ARCHERC20 1 0
87 29.99 TL-WR802N TP-Link TL-WR802N 300Mbps Wireless N Nano Router, Pocket Size China /Catalog/shop-brands/tp-link/FP-TLWR802N 1 0
88 100.99 EAP610 TP-Link EAP610_V2 AX1800 CEILING MOUNT WI-FI 6" China /Catalog/shop-brands/tp-link/FP-EAP610V2 1 0
89 130.99 EAP650 TP-Link EAP650 AX3000 Ceiling Mount Wi-Fi 6 Access Point China /Catalog/shop-brands/tp-link/FP-EAP650 1 0
90 rows × 7 columns
You can further inspect that json response, andd see if there is more useful information you need from there.
Relevand pandas docs: https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
And for requests docs, see https://requests.readthedocs.io/en/latest/

Related

Extract only ASINS from product listing page where Price is visible on Amazon

I am trying to generate those urls of the product where price is visible on the listing page i.e https://www.amazon.com/s?k=ps5&rh=p_36%3A27500-65000 and my goal is to skip the remaining of the ASINS where price is not on listing page.
Logic I came up with is something like this:
Grab the Tag which contains all the product listing.
Filter the Tag with if and else condition to extract those specific products with price.
I am struggling through execution I had done some web scraping few months ago and right now I am bit of rusty and trying to keep up with this so any help would be much appreciated.
Here is my function:
from requests_html import HTMLSession
s = HTMLSession()
def get_product_links(session): #
# https://www.amazon.com/s?k=ps5&rh=p_36%3A27500-65000
url = session.get(
base_url + search_term + price_filter,
headers=headers,
)
print(url.status_code)
tag = url.html.find("div[data-component-type=s-search-result]")
price_tag = [pr.find("span.a-offscreen", first=True) for pr in tag]
print(price_tag)
check_price = [price.text for price in price_tag if price != None]
print(check_price)
if len(check_price) > 0:
product_asins = [
asin.attrs["data-asin"]
for asin in url.html.find("div[data-asin]")
if asin.attrs["data-asin"] != ""
]
product_link = [
"https://www.amazon.com/dp/" + link for link in product_asins
]
return product_link
else:
print("Skipping Product...")
To get links only for products which have price you can use this example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
"Accept-Language": "en-US,en;q=0.5",
}
url = "https://www.amazon.com/s?k=ps5&rh=p_36%3A27500-65000"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for asin in soup.select("[data-asin]"):
num = asin["data-asin"].strip()
price = asin.select_one(".a-price .a-offscreen")
if num and price:
print(asin.h2.text)
print(price.text, "https://www.amazon.com/dp/{}".format(num))
print()
Prints:
HexGaming Esports Ultimate Controller 4 Remap Buttons & Interchangeable Thumbsticks & Hair Trigger Compatible with PS5 Customized Controller PC Wireless FPS Esport Gamepad - Wild Attack
$289.99 https://www.amazon.com/dp/B09KMYCY1C
Samsung Electronics 980 PRO SSD with Heatsink 2TB PCIe Gen 4 NVMe M.2 Internal Solid State Hard Drive, Heat Control, Max Speed, PS5 Compatible, MZ-V8P2T0CW
$349.99 https://www.amazon.com/dp/B09JHKSNNG
G-STORY 15.6" Inch IPS 4k 60Hz Portable Monitor Gaming display Integrated with PS5(not included) 3840×2160 With 2 HDMI ports,FreeSync,Built-in 2 of Multimedia Stereo Speaker,UL Certificated AC Adapter
$379.99 https://www.amazon.com/dp/B073ZJ1K8G
Thrustmaster T248, Racing Wheel and Magnetic Pedals, HYBRID DRIVE, Magnetic Paddle Shifters, Dynamic Force Feedback, Screen with Racing Information (PS5, PS4, PC)
$399.99 https://www.amazon.com/dp/B08Z5CX6V2
WD_BLACK 1TB SN850 NVMe Internal Gaming SSD Solid State Drive with Heatsink - Works with Playstation 5, Gen4 PCIe, M.2 2280, Up to 7,000 MB/s - WDS100T1XHE
$189.99 https://www.amazon.com/dp/B08PHSVW7K
Thrustmaster T300 RS - Gran Turismo Edition Racing Wheel (PS5,PS4,PC)
$449.99 https://www.amazon.com/dp/B01M1L2NRL
Seagate FireCuda 530 2TB Internal Solid State Drive - M.2 PCIe Gen4 ×4 NVMe 1.4, PS5 Internal SSD, speeds up to 7300MB/s, 3D TLC NAND, 2550 TBW, 1.8M MTBF, Heatsink, Rescue Services (ZP2000GM3A023)
$399.99 https://www.amazon.com/dp/B0977K2C74
OWC 2TB Aura P12 Pro NVMe M.2 SSD
$329.00 https://www.amazon.com/dp/B07VZ79XQ6
Sabrent 2TB Rocket 4 Plus NVMe 4.0 Gen4 PCIe M.2 Internal Extreme Performance SSD + M.2 NVMe Heatsink for The PS5 Console (SB-RKT4P-PSHS-2TB)
$329.99 https://www.amazon.com/dp/B09G2MZ4VR
Sony Playstation PS4 1TB Black Console
$468.00 https://www.amazon.com/dp/B012CZ41ZA
Thrustmaster TH8A Shifter (PS5, PS4, XBOX Series X/S, One, PC)
$199.99 https://www.amazon.com/dp/B005L0Z2BQ
WD_BLACK 2TB P50 Game Drive SSD - Portable External Solid State Drive, Compatible with Playstation, Xbox, PC, & Mac, Up to 2,000 MB/s - WDBA3S0020BBK-WESN
$348.99 https://www.amazon.com/dp/B07YFG9PG2
Logitech G923 Racing Wheel and Pedals for PS 5, PS4 and PC featuring TRUEFORCE up to 1000 Hz Force Feedback, Responsive Pedal, Dual Clutch Launch Control, and Genuine Leather Wheel Cover
$399.98 https://www.amazon.com/dp/B07PFB72NL
GIGABYTE AORUS Gen4 7000s SSD 2TB PCIe 4.0 NVMe M.2, Nanocarbon Coated Aluminum Heatsink, 3D TLC NAND, SSD- GP-AG70S2TB
$319.99 https://www.amazon.com/dp/B08XY93JT3
THRUSTMASTER T-LCM Pedals (PS5, PS4, XBOX Series X/S, One, PC
$229.99 https://www.amazon.com/dp/B083MNB4D8

Creating a function for my python web scraper that will output a dictionary

I have created my web scraper I have added an function unfortunately my function is not calling the out put is not coming out as a dictionary. How do I create and call the function and store the output as a dictionary. Below is my code and function so far.
from bs4 import BeautifulSoup
import requests
top_stories = []
def get_stories():
""" user agent to facilitates end-user interaction with web content"""
headers = {
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
}
base_url = 'www.example.com'
source = requests.get(base_url).text
soup = BeautifulSoup(source, 'html.parser')
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
try:
headline = article.h3.text.strip()
link = base_url + article.a['href']
text = article.find("div", class_="field--type-text-with-summary").text.strip()
img_url = base_url + article.picture.img['data-src']
print(headline,link,text,img_url)
stories_dict = {}
stories_dict['Headline'] = headline
stories_dict['Link'] = link
stories_dict['Text'] = text
stories_dict['Image'] = img_url
top_stories.append(stories_dict)
except AttributeError as ex:
print('Error:',ex)
get stories()
To get the data in a dictionary format (dict), you can create a dictionary as follows:
top_stories = {"Headline": [], "Link": [], "Text": [], "Image": []}
and append the correct data to it.
(by the way, when you have specified your headers, it should have been a dict not a set.)
from bs4 import BeautifulSoup
import requests
def get_stories():
"""user agent to facilitates end-user interaction with web content"""
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"
}
top_stories = {"Headline": [], "Link": [], "Text": [], "Image": []}
base_url = "https://www.jse.co.za/"
source = requests.get(base_url, headers=headers).text
soup = BeautifulSoup(source, "html.parser")
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
try:
top_stories["Headline"].append(article.h3.text.strip())
top_stories["Link"].append(base_url + article.a["href"])
top_stories["Text"].append(
article.find("div", class_="field--type-text-with-summary").text.strip()
)
top_stories["Image"].append(base_url + article.picture.img["data-src"])
except AttributeError as ex:
print("Error:", ex)
print(type(top_stories))
print(top_stories)
get_stories()
Output:
Number of articles found: 6
<class 'dict'>
{'Headline': ['South Africa offers investment opportunities to Asia Pacific investors', 'South Africa to showcase investment opportunities to the UAE market', 'South Africa to showcase investment opportunities to UK investors', 'JSE to become 100% owner of JSE Investor Services and expands services to include share plan administration services', 'Thungela Resources lists on the JSE after unbundling from Anglo American', 'JSE welcomes SAB’s B-BBEE scheme that gives investors exposure to AB InBev global market'], 'Link': ['https://www.jse.co.za//news/market-news/south-africa-offers-investment-opportunities-asia-pacific-investors', 'https://www.jse.co.za//news/market-news/south-africa-showcase-investment-opportunities-uae-market', 'https://www.jse.co.za//news/market-news/south-africa-showcase-investment-opportunities-uk-investors', 'https://www.jse.co.za//news/market-news/jse-become-100-owner-jse-investor-services-and-expands-services-include-share-plan', 'https://www.jse.co.za//news/market-news/thungela-resources-lists-jse-after-unbundling-anglo-american', 'https://www.jse.co.za//news/market-news/jse-welcomes-sabs-b-bbee-scheme-gives-investors-exposure-ab-inbev-global-market'], 'Text': ['The Johannesburg Stock Exchange (JSE) and joint sponsors, Citi and Absa Bank are collaborating to host the annual SA Tomorrow Investor conference, which aims to showcase the country’s array of investment opportunities to investors in the Asia Pacific region, mainly from Hong Kong and Singapore.', 'The Johannesburg Stock Exchange (JSE) and joint sponsors, Citi and Absa Bank are collaborating to host the SA Tomorrow Investor conference, which aims to position South Africa as a preferred investment destination for the United Arab Emirates (UAE) market.', 'The Johannesburg Stock Exchange (JSE) and joint sponsors Citi and Absa Bank are collaborating to host the annual SA Tomorrow Investor conference, which aims to showcase the country’s array of investment opportunities to investors in the United Kingdom.', 'The Johannesburg Stock Exchange (JSE) is pleased to announce that it has embarked on a process to incorporate JSE Investor Services Proprietary Limited (JIS) as a wholly owned subsidiary of the JSE by acquiring the minority shareholding of 25.15 % from LMS Partner Holdings.', 'Shares in Thungela Resources, a South African thermal coal exporter, today commenced trading on the commodity counter of the Main Board of the Johannesburg Stock Exchange (JSE).', 'From today, Black South African retail investors will get the opportunity to invest in the world’s largest beer producer, AB InBev, following the listing of SAB Zenzele Kabili on the Johannesburg Stock Exchange’s (JSE) Empowerment Segment.'], 'Image': ['https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner_0.jpg?h=4ae650de&itok=hdGEy5jA', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner2.jpg?h=4ae650de&itok=DgPFtAx8', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner.jpg?h=4ae650de&itok=Q0SsPtAz', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2020-12/DSC_0832.jpg?h=156fdada&itok=rL3M2gpn', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Thungela_Web_Banner_1440x390.jpg?h=4ae650de&itok=kKRO5fQk', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-05/SAB-Zenzele.jpg?h=4ae650de&itok=n9osAP33']}

how do i scrape the web pages where page number is not shown in URL, or no link is provided

Hi i need to scrape the following web pages, but when it comes to pagination I am not able to fetch URL with page number.
https://www.nasdaq.com/market-activity/commodities/ho%3Anmx/historical
You could go through the api to iterate through each page (or in the case of this api an offset number - as it tells you how many total records there are). Take the total records, then divide by the limit set (and use math.ceiling to round up. Then iterate the range from 1, to that number using the multiple as an offset of the limit as a parameter).
Or, just easier, adjust the limit to something higher, and get it in one request:
import requests
from pandas.io.json import json_normalize
url = 'https://api.nasdaq.com/api/quote/HO%3ANMX/historical'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
payload = {
'assetclass': 'commodities',
'fromdate': '2020-01-05',
'limit': '9999',
'todate': '2020-02-05'}
data = requests.get(url, headers=headers,params=payload).json()
df = json_normalize(data['data']['tradesTable']['rows'])
Output:
print (df.to_string())
close date high low open volume
0 1.5839 02/04/2020 1.6179 1.5697 1.5699 66,881
1 1.5779 02/03/2020 1.6273 1.5707 1.6188 62,146
2 1.6284 01/31/2020 1.6786 1.6181 1.6677 68,513
3 1.642 01/30/2020 1.699 1.6305 1.6952 70,173
4 1.7043 01/29/2020 1.7355 1.6933 1.7261 69,082
5 1.7162 01/28/2020 1.7303 1.66 1.674 79,852
6 1.6829 01/27/2020 1.7305 1.6598 1.7279 97,184
7 1.7374 01/24/2020 1.7441 1.7369 1.7394 80,351
8 1.7943 01/23/2020 1.7981 1.7558 1.7919 89,084
9 1.8048 01/22/2020 1.811 1.7838 1.7929 90,311
10 1.8292 01/21/2020 1.8859 1.8242 1.8782 53,130
11 1.8637 01/17/2020 1.875 1.8472 1.8669 79,766
12 1.8647 01/16/2020 1.8926 1.8615 1.8866 99,020
13 1.8822 01/15/2020 1.9168 1.8797 1.9043 92,401
14 1.9103 01/14/2020 1.9224 1.8848 1.898 62,254
15 1.898 01/13/2020 1.94 1.8941 1.9366 61,328
16 1.9284 01/10/2020 1.96 1.9262 1.9522 67,329
17 1.9501 01/09/2020 1.9722 1.9282 1.9665 73,527
18 1.9582 01/08/2020 1.9776 1.9648 1.9759 110,514
19 2.0324 01/07/2020 2.0392 2.0065 2.0274 72,421
20 2.0339 01/06/2020 2.103 2.0193 2.0755 87,832

Trying to scrape and segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel?
In the picture(https://ibb.co/8X5xY9C) or in the website link provided, Bold(Except Alphabet Letters(A) and later 'back to top' ) represents Heading and Explanation(non-bold just below bold) represents the content(Content even consists of 'li' and 'ul' blocks later in the site, which should come under respective Heading)
#Code to Start With
from bs4 import BeautifulSoup
import requests
url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
Heading = soup.findAll('strong')
content = soup.findAll('div', {"class": "comp-rich-text"})
Output Excel looks something Link this
https://i.stack.imgur.com/NsMmm.png
I've thought about it a little more and thought of a better solution. Rather than "crowd" my initial solution, I chose to add a 2nd solution here:
So thinking about it again, and following my logic of splitting the html by the headlines (essentially breaking it up where we find <strong> tags), I choose to convert to strings using .prettify(), and then split on those specific strings/tags and read back into BeautifulSoup to pull the text. From what I see, it looks like it hasn't missed anything, but you'll have to search through the dataframe to double check:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
splits = section.prettify().split('<strong>')
for each in splits:
try:
headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
headline = BeautifulSoup(headline, 'html.parser').text.strip()
content = BeautifulSoup(content, 'html.parser').text.strip()
content_split = content.split('\n')
content = ' '.join([ text.strip() for text in content_split if text != ''])
results[headline] = content
except:
continue
df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)
Output:
print (df)
Headings Content
0 Age requirements Applicants must be at least 18 years old at th...
1 Affordability Our affordability calculator is the same one u...
2 Agricultural restriction The only acceptable agricultural tie is where ...
3 Annual percentage rate of charge (APRC) The APRC is all fees associated with the mortg...
4 Adverse credit We consult credit reference agencies to look a...
5 Applicants (number of) The maximum number of applicants is two.
6 Armed Forces personnel Unsecured personal loans are only acceptable f...
7 Back to back Back to back is typically where the vendor has...
8 Customer funded purchase: when the customer has funded the purchase usin...
9 Bridging: residential mortgage applications where the cu...
10 Inherited: a recently inherited property where the benefi...
11 Porting: where a fixed/discounted rate was ported to a ...
12 Repossessed property: where the vendor is the mortgage lender in pos...
13 Part exchange: where the vendor is a large national house bui...
14 Bank statements We accept internet bank statements in paper fo...
15 Bonus For guaranteed bonuses we will consider an ave...
16 British National working overseas Applicants must be resident in the UK. Applica...
17 Builder's Incentives The maximum amount of acceptable incentive is ...
18 Buy-to-let (purpose) A buy-to-let mortgage can be used for: Purcha...
19 Capital Raising - Acceptable purposes permanent home improvem...
20 Buy-to-let (affordability) Buy to Let affordability must be assessed usin...
21 Buy-to-let (eligibility criteria) The property must be in England, Scotland, Wal...
22 Definition of a portfolio landlord We define a portfolio landlord as a customer w...
23 Carer's Allowance Carer's Allowance is paid to people aged 16 or...
24 Cashback Where a mortgage product includes a cashback f...
25 Casual employment Contract/agency workers with income paid throu...
26 Certification of documents When submitting copies of documents, please en...
27 Child Benefit We can accept up to 100% of working tax credit...
28 Childcare costs We use the actual amount the customer has decl...
29 When should childcare costs not be included? There are a number of situations where childca...
.. ... ...
108 Shared equity We lend on the Government-backed shared equity...
109 Shared ownership We do not lend against Shared Ownership proper...
110 Solicitors' fees We have a panel of solicitors for our fees ass...
111 Source of deposit We reserve the right to ask for proof of depos...
112 Sole trader/partnerships We will take an average of the last two years'...
113 Standard variable rate A standard variable rate (SVR) is a type of v...
114 Student loans Repayment of student loans is dependent on rec...
115 Tenure Acceptable property tenure: Feuhold, Freehold,...
116 Term Minimum term is 3 years Residential - Maximum...
117 Unacceptable income types The following forms of income are classed as u...
118 Bereavement allowance: paid to widows, widowers or surviving civil pa...
119 Employee benefit trusts (EBT): this is a tax mitigation scheme used in conjun...
120 Expenses: not acceptable as they're paid to reimburse pe...
121 Housing Benefit: payment of full or partial contribution to cla...
122 Income Support: payment for people on low incomes, working les...
123 Job Seeker's Allowance: paid to people who are unemployed or working 1...
124 Stipend: a form of salary paid for internship/apprentic...
125 Third Party Income: earned by a spouse, partner, parent who are no...
126 Universal Credit: only certain elements of the Universal Credit ...
127 Universal Credit The Standard Allowance element, which is the n...
128 Valuations: day one instruction We are now instructing valuations on day one f...
129 Valuation instruction A valuation will be automatically instructed w...
130 Valuation fees A valuation will always be obtained using a pa...
131 Please note: W hen upgrading the free valuation for a home...
132 Adding fees to the loan Product fees are the only fees which can be ad...
133 Product fee This fee is paid when the mortgage is arranged...
134 Working abroad Previously, we required applicants to be empl...
135 Acceptable - We may consider applications from people who: ...
136 Not acceptable - We will not consider applications from people...
137 Working and Family Tax Credits We can accept up to 100% of Working Tax Credit...
[138 rows x 2 columns]
EDIT: SEE OTHER SOLUTION PROVIDED
It's tricky. I tried to essentially to grab the headings, then use those to grab all the text after the heading, and that proceeds the next heading. The code below is a little messy, and requires some cleaning up, but hopefully gets you to a point to work with it or get you moving in the right direction:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
headlines = section.find_all('strong')
headlines = [each.text for each in headlines ]
for i, headline in enumerate(headlines):
if headline != headlines[-1]:
next_headline = headlines[i+1]
else:
next_headline = ''
try:
find_content = section(text=headline)[0].parent.parent.find_next_siblings()
if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
content = section(text=headline)[0].parent.nextSibling
results[headline] = content.strip()
break
except:
find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
if find_content == []:
try:
find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
except:
find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()
content = []
for sibling in find_content:
if next_headline not in sibling.text or headline == headlines[-1]:
content.append(sibling.text)
else:
content = '\n'.join(content)
results[headline.strip()] = content.strip()
break
if headline == headlines[-1]:
content = '\n'.join(content)
results[headline] = content.strip()
df = pd.DataFrame(results.items())

How to handle such errors?

companies = pd.read_csv("http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv", index_col = 0)
companies.head()
I'm getting this error please suggest what approaches should be tried.
"utf-8' codec can't decode byte 0xb7 in position 7"
Try encoding as 'latin1' on macOS.
companies = pd.read_csv("http://www.richardmuir.com/data/public/csv/CompaniesRevenueEmployees.csv",
index_col=0,
encoding='latin1')
Downloading the file and opening it in notepad++ shows it is ansi-encoded. If you are on a windows system this should fix it:
import pandas as pd
url = "http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv"
companies = pd.read_csv(url, index_col = 0, encoding='ansi')
print(companies)
If not (on windows), you need to research how to convert ansi-encoded text to something you can read.
See: https://docs.python.org/3/library/codecs.html#standard-encodings
Output:
Name Industry \
0 Walmart Retail
1 Sinopec Group Oil and gas
2 China National Petroleum Corporation Oil and gas
... ... ...
47 Hewlett Packard Enterprise Electronics
48 Tata Group Conglomerate
Revenue (USD billions) Employees
0 482 2200000
1 455 358571
2 428 1636532
... ... ...
47 111 302000
48 108 600000

Resources