How to Reach the Next Element in HTML - python-3.x

I am having trouble with my code not printing all the strings that I want and I am unsure of how to edit my code to change that.
I am trying to scrape all the strings including things like 460 hp # 7000 rpm which it is currently not scraping. Ideally the strings in the strong elements are kept separate. I have tried adding another .next_sibling, changing the br to p and strong they just return an error.
The HTML is as follows:
<div class="specs-content">
<p>
<strong>Displacement:</strong>
" 307 cu in, 5038 "
<br>
<strong>Power:</strong>
" 460 hp # 7000 rpm "
<br>
<strong>Torque:</strong>
" 420 lb-ft # 4600 rpm "
</p>
<p>
<strong>TRANSMISSION:</strong>
" 10-speed automatic with manual shifting mode "
</p>
<p>
<strong>CHASSIS</strong>
<br>
" Suspension (F/R): struts/multilink "
<br>
" Brakes (F/R): 15.0-in vented disc/13.0-in vented disc "
<br>
" Tires: Michelin Pilot Sport 4S, F: 255/40ZR-19 (100Y) R: 275/40ZR-19 (105Y) "
</p>
</div>
I have written the following code thus far:
import requests
from bs4 import BeautifulSoup
URL = requests.get('https://www.LinkeHere.com')
soup = BeautifulSoup(URL.text, 'html.parser')
FindClass = soup.find(class_='specs-content')
FindElement = FindClass.find_all('br')
for Specs in FindElement:
Specs = Specs.next_sibling
print(Specs.string)
This returns:
Power:
Torque:
Suspension (F/R): struts/multilink
Brakes (F/R): 13.9-in vented disc/13.0-in vented disc
Tires: Michelin Pilot Sport 4S, 255/40ZR-19 (100Y)

You can use the get_text() method with adding a newline \n as the separator argument:
from bs4 import BeautifulSoup
html = """THE ABOVE HTML SNIPPET"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all(class_="specs-content"):
print(tag.get_text(strip=True, separator="\n").replace('"', ""))
Output:
Displacement:
307 cu in, 5038
Power:
460 hp # 7000 rpm
Torque:
420 lb-ft # 4600 rpm
TRANSMISSION:
10-speed automatic with manual shifting mode
CHASSIS
Suspension (F/R): struts/multilink
Brakes (F/R): 15.0-in vented disc/13.0-in vented disc
Tires: Michelin Pilot Sport 4S, F: 255/40ZR-19 (100Y) R: 275/40ZR-19 (105Y)

Related

I am trying to make a web scraper that checks for all the courses in a coursera specialisation..and then gives the list of them

The problem is there is a button saying "SHOW MORE" on the webpage...because of that the script can't access all the courses ....
from bs4 import BeautifulSoup
import requests
res = requests.get('https://www.coursera.org/specializations/digital-manufacturing-design-
technology#courses')
txt = res.text
status = res.status_code
#print(txt)
#print(status)## 200 is the code for success
page = requests.get('https://www.coursera.org/specializations/digital-manufacturing-design-
technology#courses')
soup = BeautifulSoup(page.content,'lxml')
#Display the title of the specialisation
specialization_title = soup.find('h1')
print(specialization_title.text)
print("\n")
#Display the courses inside the specialisation
number_of_courses = soup.find('h2',class_ = 'headline-4-text bold m-b-1')
print(number_of_courses.text)
print("\n")
course_cards = soup.find_all('h3',class_= 'headline-3-text bold m-t-1 m-b-2')
for course in course_cards:
print(course.text)
The data is stored inside the page in form of javascript object. You can use re/json module to decode it:
import re
import json
import requests
from textwrap import shorten
url = "https://www.coursera.org/specializations/digital-manufacturing-design-technology#courses"
html_doc = requests.get(url).text
data = json.loads(
re.search(r"window.__APOLLO_STATE__ = (.*});", html_doc).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
i = 1
for k, v in data.items():
if "SDPCourse:" in k and "." not in k:
print(
"{:<3} {:<8} {:<60} {}".format(
i,
v["averageInstructorRating"],
v["name"],
shorten(v["description"], 40),
)
)
i += 1
Prints:
1 4.6 Digital Manufacturing & Design This course will expose you to the [...]
2 4.56 Digital Thread: Components This course will help you [...]
3 4.76 Digital Thread: Implementation There are opportunities throughout [...]
4 4.41 Advanced Manufacturing Process Analysis Variability is a fact of life in [...]
5 4.56 Intelligent Machining Manufacturers are increasingly [...]
6 4.55 Advanced Manufacturing Enterprise Enterprises that seek to become [...]
7 4.55 Cyber Security in Manufacturing The nature of digital [...]
8 4.54 MBSE: Model-Based Systems Engineering This Model-Based Systems [...]
9 4.68 Roadmap to Success in Digital Manufacturing & Design Learners will create a roadmap to [...]

How to evaluate escape character when reading a text file in Python

I have text file which is like this:
["RUN DATE: 2/08/18 9:00:24 USER:XXXXXX DISPLAY: MENULIST PROG NAME: MH4567 PAGE 1\nMENU: ADCS00 Visual Basic Things
Service\n 80 Printer / Message Control\n 90 Sign Off\nSelection or
command\n===>____________________________________________________________________________\n____________________________________________________________________________
____\n F3=Exit F4=Prompt F9=Retrieve F12=Previous\n 80 CALL PGM(GUCMD)\nAUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR FPROGRAMMR OESUPR PROGRAMMER\n
90 SIGNOFF\nAUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR FPROGRAMMR OESUPR PROGRAMMER\n", "RUN DATE: 5/09/19 9:00:24 USER:XXXXXX DISPLAY:
MENULIST PROG NAME: MH4567 PAGE 2\nMENU: APM001 Accounts Payable Menu\n MENU
OPTIONS DISPLAY PROGRAMS\n 1 Radar Processing 30 Vendor\n 2 Prepaid Processing
31 Prepaid\n"]
I want to translate every '\n' as new line and then print.
I tried this code but it didn't work:
import json
contents = open("file.txt", "r")
Lines = contents.readlines()
for l in Lines:
print(l)
The above code prints end of line in .txt file as new line without observing the actual '\n' character?
Use -
with open('./sample.txt', 'r') as file_:
txt = file_.read().replace('\\n', '\n')
print(txt)
Output
"RUN DATE: 2/08/18 9:00:24 USER:XXXXXX DISPLAY: MENULIST PROG NAME: MH4567 PAGE 1
MENU: ADCS00 Visual Basic Things
Service
80 Printer / Message Control
90 Sign Off
Selection or
command
===>____________________________________________________________________________
____________________________________________________________________________
____
F3=Exit F4=Prompt F9=Retrieve F12=Previous
80 CALL PGM(GUCMD)
AUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR FPROGRAMMR OESUPR PROGRAMMER
90 SIGNOFF
AUTHORIZED: DOAPROCESS FDOAPROCES FOESUPR FPROGRAMMR OESUPR PROGRAMMER
", "RUN DATE: 5/09/19 9:00:24 USER:XXXXXX DISPLAY:
MENULIST PROG NAME: MH4567 PAGE 2
MENU: APM001 Accounts Payable Menu
MENU
OPTIONS DISPLAY PROGRAMS
1 Radar Processing 30 Vendor
2 Prepaid Processing
31 Prepaid
"

Beautiful Soup Scraping

I'm having issues with old working code not functioning correctly anymore.
My python code is scraping a website using beautiful soup and extracting event data (date, event, link).
My code is pulling all of the events which are located in the tbody. Each event is stored in a <tr class="Box">. The issue is that my scraper seems to be stopping after this <tr style ="box-shadow: none;> After it reaches this section (which is a section containing 3 advertisements on the site for events that I don't want to scrape) the code stops pulling event data from within the <tr class="Box">. Is there a way to skip this tr style/ignore future cases?
import pandas as pd
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
source = urllib.request.urlopen('https://10times.com/losangeles-us/technology/conferences').read()
soup = bs.BeautifulSoup(source,'html.parser')
#---Get Event Data---
test1=[]
table = soup.find('tbody')
table_rows = table.find_all('tr') #find table rows (tr)
for x in table_rows:
data = x.find_all('td') #find table data
row = [x.text for x in data]
if len(row) > 2: #Exlcudes rows with only event name/link, but no data.
test1.append(row)
test1
The data is loaded dynamically via JavaScript, so you don't see more results. You can use this example to load more pages:
import requests
from bs4 import BeautifulSoup
url = "https://10times.com/ajax?for=scroll&path=/losangeles-us/technology/conferences"
params = {"page": 1, "ajax": 1}
headers = {"X-Requested-With": "XMLHttpRequest"}
for params["page"] in range(1, 4): # <-- increase number of pages here
print("Page {}..".format(params["page"]))
soup = BeautifulSoup(
requests.get(url, headers=headers, params=params).content,
"html.parser",
)
for tr in soup.select('tr[class="box"]'):
tds = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
print(tds)
Prints:
Page 1..
['Tue, 29 Sep - Thu, 01 Oct 2020', 'Lens Los Angeles', 'Intercontinental Los Angeles Downtown, Los Angeles', 'LENS brings together the entire Degreed community - our clients, invited prospective clients, thought leaders, partners, employees, executives, and industry experts for two days of discussion, workshops,...', 'Business Services IT & Technology', 'Interested']
['Wed, 30 Sep - Sat, 03 Oct 2020', 'FinCon', 'Long Beach Convention & Entertainment Center, Long Beach 20.1 Miles from Los Angeles', 'FinCon will be helping financial influencers and brands create better content, reach their audience, and make more money. Collaborate with other influencers who share your passion for making personal finance...', 'Banking & Finance IT & Technology', 'Interested 7 following']
['Mon, 05 - Wed, 07 Oct 2020', 'NetDiligence Cyber Risk Summit', 'Loews Santa Monica Beach Hotel, Santa Monica 14.6 Miles from Los Angeles', 'NetDiligence Cyber Risk Summit will conference are attended by hundreds of cyber risk insurance, legal/regulatory and security/privacy technology leaders from all over the world. Connect with leaders in...', 'IT & Technology', 'Interested']
... etc.

Trying to scrape and segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel?
In the picture(https://ibb.co/8X5xY9C) or in the website link provided, Bold(Except Alphabet Letters(A) and later 'back to top' ) represents Heading and Explanation(non-bold just below bold) represents the content(Content even consists of 'li' and 'ul' blocks later in the site, which should come under respective Heading)
#Code to Start With
from bs4 import BeautifulSoup
import requests
url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
Heading = soup.findAll('strong')
content = soup.findAll('div', {"class": "comp-rich-text"})
Output Excel looks something Link this
https://i.stack.imgur.com/NsMmm.png
I've thought about it a little more and thought of a better solution. Rather than "crowd" my initial solution, I chose to add a 2nd solution here:
So thinking about it again, and following my logic of splitting the html by the headlines (essentially breaking it up where we find <strong> tags), I choose to convert to strings using .prettify(), and then split on those specific strings/tags and read back into BeautifulSoup to pull the text. From what I see, it looks like it hasn't missed anything, but you'll have to search through the dataframe to double check:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
splits = section.prettify().split('<strong>')
for each in splits:
try:
headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
headline = BeautifulSoup(headline, 'html.parser').text.strip()
content = BeautifulSoup(content, 'html.parser').text.strip()
content_split = content.split('\n')
content = ' '.join([ text.strip() for text in content_split if text != ''])
results[headline] = content
except:
continue
df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)
Output:
print (df)
Headings Content
0 Age requirements Applicants must be at least 18 years old at th...
1 Affordability Our affordability calculator is the same one u...
2 Agricultural restriction The only acceptable agricultural tie is where ...
3 Annual percentage rate of charge (APRC) The APRC is all fees associated with the mortg...
4 Adverse credit We consult credit reference agencies to look a...
5 Applicants (number of) The maximum number of applicants is two.
6 Armed Forces personnel Unsecured personal loans are only acceptable f...
7 Back to back Back to back is typically where the vendor has...
8 Customer funded purchase: when the customer has funded the purchase usin...
9 Bridging: residential mortgage applications where the cu...
10 Inherited: a recently inherited property where the benefi...
11 Porting: where a fixed/discounted rate was ported to a ...
12 Repossessed property: where the vendor is the mortgage lender in pos...
13 Part exchange: where the vendor is a large national house bui...
14 Bank statements We accept internet bank statements in paper fo...
15 Bonus For guaranteed bonuses we will consider an ave...
16 British National working overseas Applicants must be resident in the UK. Applica...
17 Builder's Incentives The maximum amount of acceptable incentive is ...
18 Buy-to-let (purpose) A buy-to-let mortgage can be used for: Purcha...
19 Capital Raising - Acceptable purposes permanent home improvem...
20 Buy-to-let (affordability) Buy to Let affordability must be assessed usin...
21 Buy-to-let (eligibility criteria) The property must be in England, Scotland, Wal...
22 Definition of a portfolio landlord We define a portfolio landlord as a customer w...
23 Carer's Allowance Carer's Allowance is paid to people aged 16 or...
24 Cashback Where a mortgage product includes a cashback f...
25 Casual employment Contract/agency workers with income paid throu...
26 Certification of documents When submitting copies of documents, please en...
27 Child Benefit We can accept up to 100% of working tax credit...
28 Childcare costs We use the actual amount the customer has decl...
29 When should childcare costs not be included? There are a number of situations where childca...
.. ... ...
108 Shared equity We lend on the Government-backed shared equity...
109 Shared ownership We do not lend against Shared Ownership proper...
110 Solicitors' fees We have a panel of solicitors for our fees ass...
111 Source of deposit We reserve the right to ask for proof of depos...
112 Sole trader/partnerships We will take an average of the last two years'...
113 Standard variable rate A standard variable rate (SVR) is a type of v...
114 Student loans Repayment of student loans is dependent on rec...
115 Tenure Acceptable property tenure: Feuhold, Freehold,...
116 Term Minimum term is 3 years Residential - Maximum...
117 Unacceptable income types The following forms of income are classed as u...
118 Bereavement allowance: paid to widows, widowers or surviving civil pa...
119 Employee benefit trusts (EBT): this is a tax mitigation scheme used in conjun...
120 Expenses: not acceptable as they're paid to reimburse pe...
121 Housing Benefit: payment of full or partial contribution to cla...
122 Income Support: payment for people on low incomes, working les...
123 Job Seeker's Allowance: paid to people who are unemployed or working 1...
124 Stipend: a form of salary paid for internship/apprentic...
125 Third Party Income: earned by a spouse, partner, parent who are no...
126 Universal Credit: only certain elements of the Universal Credit ...
127 Universal Credit The Standard Allowance element, which is the n...
128 Valuations: day one instruction We are now instructing valuations on day one f...
129 Valuation instruction A valuation will be automatically instructed w...
130 Valuation fees A valuation will always be obtained using a pa...
131 Please note: W hen upgrading the free valuation for a home...
132 Adding fees to the loan Product fees are the only fees which can be ad...
133 Product fee This fee is paid when the mortgage is arranged...
134 Working abroad Previously, we required applicants to be empl...
135 Acceptable - We may consider applications from people who: ...
136 Not acceptable - We will not consider applications from people...
137 Working and Family Tax Credits We can accept up to 100% of Working Tax Credit...
[138 rows x 2 columns]
EDIT: SEE OTHER SOLUTION PROVIDED
It's tricky. I tried to essentially to grab the headings, then use those to grab all the text after the heading, and that proceeds the next heading. The code below is a little messy, and requires some cleaning up, but hopefully gets you to a point to work with it or get you moving in the right direction:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
headlines = section.find_all('strong')
headlines = [each.text for each in headlines ]
for i, headline in enumerate(headlines):
if headline != headlines[-1]:
next_headline = headlines[i+1]
else:
next_headline = ''
try:
find_content = section(text=headline)[0].parent.parent.find_next_siblings()
if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
content = section(text=headline)[0].parent.nextSibling
results[headline] = content.strip()
break
except:
find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
if find_content == []:
try:
find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
except:
find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()
content = []
for sibling in find_content:
if next_headline not in sibling.text or headline == headlines[-1]:
content.append(sibling.text)
else:
content = '\n'.join(content)
results[headline.strip()] = content.strip()
break
if headline == headlines[-1]:
content = '\n'.join(content)
results[headline] = content.strip()
df = pd.DataFrame(results.items())

Scraping youtube playlist

I've been trying to write a python script which will fetch me the name of the songs contained in the playlist whose link will be provided. for eg.https://www.youtube.com/watch?v=foE1mO2yM04&list=RDGMEMYH9CUrFO7CfLJpaD7UR85wVMfoE1mO2yM04 from the terminal.
I've found out that names could be extracted by using "li" tag or "h4" tag.
I wrote the following code,
import sys
link = sys.argv[1]
from bs4 import BeautifulSoup
import requests
req = requests.get(link)
try:
req.raise_for_status()
except Exception as exc:
print('There was a problem:',exc)
soup = BeautifulSoup(req.text,"html.parser")
Then I tried using li-tag as:
i=soup.findAll('li')
print(type(i))
for o in i:
print(o.get('data-video-title'))
But it printed "None" those number of time. I belive it is not able to reach those li tags which contains data-video-title attribute.
Then I tried using div and h4 tags as,
for i in soup.findAll('div', attrs={'class':'playlist-video-description'}):
o = i.find('h4')
print(o.text)
But nothing happens again..
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=foE1mO2yM04&list=RDGMEMYH9CUrFO7CfLJpaD7UR85wVMfoE1mO2yM04'
data = requests.get(url)
data = data.text
soup = BeautifulSoup(data)
h4 = soup.find_all("h4")
for h in h4:
print(h.text)
output:
Mike Posner - I Took A Pill In Ibiza (Seeb Remix) (Explicit)
Alan Walker - Faded
Calvin Harris - This Is What You Came For (Official Video) ft. Rihanna
Coldplay - Hymn For The Weekend (Official video)
Jonas Blue - Fast Car ft. Dakota
Calvin Harris & Disciples - How Deep Is Your Love
Galantis - No Money (Official Video)
Kungs vs Cookin’ on 3 Burners - This Girl
Clean Bandit - Rockabye ft. Sean Paul & Anne-Marie [Official Video]
Major Lazer - Light It Up (feat. Nyla & Fuse ODG) [Remix] (Official Lyric Video)
Robin Schulz - Sugar (feat. Francesco Yates) (OFFICIAL MUSIC VIDEO)
DJ Snake - Middle ft. Bipolar Sunshine
Jonas Blue - Perfect Strangers ft. JP Cooper
David Guetta ft. Zara Larsson - This One's For You (Music Video) (UEFA EURO 2016™ Official Song)
DJ Snake - Let Me Love You ft. Justin Bieber
Duke Dumont - Ocean Drive
Galantis - Runaway (U & I) (Official Video)
Sigala - Sweet Lovin' (Official Video) ft. Bryn Christopher
Martin Garrix - Animals (Official Video)
David Guetta & Showtek - Bad ft.Vassy (Lyrics Video)
DVBBS & Borgeous - TSUNAMI (Original Mix)
AronChupa - I'm an Albatraoz | OFFICIAL VIDEO
Lilly Wood & The Prick and Robin Schulz - Prayer In C (Robin Schulz Remix) (Official)
Kygo - Firestone ft. Conrad Sewell
DEAF KEV - Invincible [NCS Release]
Eiffel 65 - Blue (KNY Factory Remix)
Ok guys, I have figured out what was happening. My code was perfect and it works fine, the problem was that I was passing the link as an argument from the terminal and co-incidentally, the link contained some symbols which were interpreted in some other fashion for eg. ('&').
Now I am passing the link as a string in the terminal and everything works fine. So dumb yet time-consuming mistake.

Resources