i need help scraping the following information from the webpage - python-3.x

i need to obtain details of federal agencies from a web page that lists the pages alphabetically a-w. I need to get the agency name,website and contact
the code that i have only returns the agency name from one page.
import requests
>>> import bs4
>>> res = requests.get("https://www.usa.gov/federal-agencies/")
>>> soup=bs4.BeautifulSoup(res.text,'lxml')
>>> soup.select('.url')
for i in soup.select('.url'):
print(i.text)
i expect to get; the Agency name, website and contact address for all the pages,from a to page w

You are going to have to iterate through each page, and follow the link of each item to then pull the data you want:
Code:
import requests
import bs4
for letter in map(chr, range(97, 123)):
res = requests.get("https://www.usa.gov/federal-agencies/%s" %letter)
soup=bs4.BeautifulSoup(res.text,'lxml')
section = soup.find('ul', {'class':'one_column_bullet'})
links = [ 'https://www.usa.gov' + i['href'] for i in section.find_all('a', {'class':'url'})]
for link in links:
res2 = requests.get(link)
soup=bs4.BeautifulSoup(res2.text,'lxml')
agency_name = soup.find('h1').text
website = soup.find('h3',{'class':'org'}).findNext('a')['href']
try:
address = soup.find('p',{'class':'spk street-address'}).text.strip()
address = address.split('\n')
address = ' '.join([ i.strip() for i in address if i.strip() != '' ])
except:
address = 'N/A'
print('Name:\t\t%s\nWebsite:\t%s\nAddress:\t%s\n' %(agency_name, website, address))
Output:
Name: U.S. AbilityOne Commission
Website: http://www.abilityone.gov
Address: 1401 S. Clark Street Suite 715 Arlington, VA 22202-3259
Name: U.S. Access Board
Website: http://www.access-board.gov/
Address: 1331 F St., NW Suite 1000 Washington, DC 20004-1111
Name: Administration for Children and Families
Website: http://www.acf.hhs.gov/
Address: 330 C St., SW Washington, DC 20201
Name: Administration for Community Living
Website: http://www.acl.gov
Address: One Massachusetts Ave., NW Washington, DC 20201
Name: Administration for Native Americans
Website: http://www.acf.hhs.gov/programs/ana/
Address: 2nd Floor, West Aerospace Center 370 L'Enfant Promenade, SW Washington, DC 20447-0002
Name: Administrative Conference of the United States
Website: http://acus.gov/
Address: 1120 20th St., NW Suite 706 South Washington, DC 20036
Name: Administrative Office of the U.S. Courts
Website: http://www.uscourts.gov/
Address: One Columbus Circle, NE Washington, DC 20544
...

Related

NoSuchElementException: M no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/form/table"}

I'm trying to get the table with all ISIN codes from following website, but I'm getting Error :
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/form/table"}
Code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
PATH = r"C:\Users\HP\Downloads\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get('http://stockcare.net/ISINNumber.asp')
table = driver.find_element_by_xpath('/html/body/form/table')
print(table)
driver.quit()
As I get the table I want to store it into the pandas DataFrame. Can someone help me out?
There is a frame in this site. That's why it is throwing no such element.
You need to first switch to that frame. Then try to get innerHTML.
driver.switchTo().frame(1);
String table = driver.findElement(By.xpath("/html/body/form/table")).getAttribute("innerHTML");
Thread.sleep(3000);
System.out.println("Table: "+ table);
Output: Your output will be like this. Now from this, you can extract the values.
Table:
<tbody><tr bgcolor="#6699cc" style="color:white;">
<th width="8%" class="specialTDBig">Group</th>
<th width="10%" class="specialTDBig">BSE Code</th>
<th width="33%" class="specialTDBig">BSE Name</th>
<th width="16%" class="specialTDBig">ISINNO</th>
<th width="20%" class="specialTDBig">Old Name</th>
</tr>
<tr>
<td class="specialTDBig">A</td>
<td class="specialTDBig">523395</td>
<td class="specialTDBig">3M India Ltd </td>
<td class="specialTDBig">INE470A01017</td>
To print the table data as the elements are within an <iframe> you have to:
Induce WebDriverWait for the desired frame to be available and switch to it.
Induce WebDriverWait for the desired element to be visible.
You can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("http://stockcare.net/ISINNumber.asp")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"form[name='frmScrip'] iframe")))
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "form[name='frmDetails']"))).text)
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
Group BSE Code BSE Name ISINNO Old Name
A 523395 3M India Ltd INE470A01017 Birla 3m
A 524348 Aarti Drugs INE767A01016
A 524208 Aarti Ind(5) INE769A01020
A 541988 Aavas Financiers INE216P01012
A 500002 Abb India Ltd.(2) INE117A01022 Abb Ltd.(2)
A 543187 ABB Power Prod(2) INE07Y701011
A 500488 Abbott (I) Ltd INE358A01014 Knoll Pharm.
A 500410 ACC INE012A01025
A 532762 Action Const(2) INE731H01025
A 512599 Adani Enter(1) INE423A01024
A 541450 Adani Green Energy INE364U01010
A 532921 Adani Ports(2) INE742F01042 Mundra Port
A 519183 ADF Foods Ltd INE982B01019 American Dry
A 535755 Adi.Bir.Fash INE647O01011 Pantaloons F
A 540691 Adit.Bir.Capital INE674K01013
A 540025 Advanced Enzym(2) INE837H01020
A 500003 Aegis Logist(1) INE208C01025
A 542752 Affle(India)(2) INE00WC01027
A 500215 Agro Tech Food INE209A01019 ITC Agro-tec
A 532811 Ahluwalia Co(2) INE758C01029
A 532683 Aia Engineer(2) INE212H01026
A 532331 Ajanta Pharma(2) INE031B01049
A 500710 Akzo Nobel India INE133A01011 ICI India
A 506235 Alembic Ltd(2) INE426A01027
A 533573 Alembic Pha(2) INE901L01018
A 539523 Alkem Laborat(2) INE540L01014
A 506767 Alkyl Amines(2) INE150B01039
A 532480 Allahabad Bk INE428A01015
A 532749 Allcargo Logis(2) INE418H01029 Allcargo Glo
A 500008 Amar Raja B(1) INE885A01032
A 540902 Amber Enterp.India INE371P01015
A 500425 Ambuja Cem(2) INE079A01024
A 532418 Andhra Bank INE434A01013
A 532259 Apar Ind. INE372A01015
A 523694 Apcotex Indus(2) INE116A01032
A 540692 Apex Frozen Foods INE346W01013
A 533758 APL Apollo Tub(2) INE702C01027
A 508869 Apollo Hosp.(5) INE437A01024
A 531761 Apollo Pipes Ltd INE126J01016
A 538566 Apollo Tricoat Tu(2) INE919P01029 Best Steel L
A 500877 Apollo Tyre(1) INE438A01022
A 532475 Aptech Ltd INE266F01018 Aptech Train
A 531179 Arman Fin.Serv. INE109C01017 Arman Leasin
A 542484 Arvind Fashion (4) INE955V01021
A 500101 Arvind Ltd INE034A01011 Arvind Mills
A 515030 Asahi Ind Gl(1) INE439A01020
A 500477 Ashok Leyland(1) INE208A01029
A 533271 Ashoka Buildcon(5) INE442H01029
A 532888 Asian Granito INE022I01019
A 500820 Asian Paint(1) INE021A01026
A 533138 Astec Lifesci INE563J01010
A 540975 Aster DM Healthcare INE914M01019
A 532493 Astra Micro(2) INE386C01029
A 532830 Astral Ltd(1) INE006I01046 Astral Poly(
A 506820 Astrazen.Ph(2) INE203A01020 Astra Idl
A 500027 Atul Ltd. INE100A01010
A 540611 AU Small Finan Bank INE949L01017
A 524804 Aurobindo Ph(1) INE406A01037
A 539289 Aurum Proptech Ltd(5 INE898S01029 Majesco Ltd(
A 505010 Auto Axles INE449A01011
A 512573 Avanti Feeds(1) INE871C01038
A 540376 Avenue Supermarts INE192R01011
A 532215 Axis Bank (2) INE238A01034
A 532977 Bajaj Auto Ltd INE917I01010
A 533229 Bajaj Consumer Ca(1) INE933K01021 Bajaj Corp(1
A 500031 Bajaj Elec(2) INE193E01025
A 500034 Bajaj Finance(2) INE296A01024
A 532978 Bajaj Finvest(5) INE918I01018
A 500032 Bajaj Hind Sugar(1) INE306A01021 Bajaj Hindus
A 500490 Bajaj Hold&Inves INE118A01012 Bajaj Auto
A 530999 Balaji Amine(2) INE050E01027
A 532382 Balaji Tele(2) INE794B01026
A 502355 Balkrish Ind(2) INE787D01026
A 523319 Balmer Lawri INE164A01016
A 500038 Balrampur Chi(1) INE119A01028
A 541153 Bandhan Bank Ltd INE545U01014
A 532525 Bank Maharashtra INE457A01014
A 532134 Bank of Baroda(2) INE028A01039
A 532149 Bank of India INE084A01016
A 500041 Bannari Aman INE459A01010
A 500042 Basf India INE373A01013
A 500043 Bata India(5) INE176A01028
A 506285 Bayer Corp INE462A01022 Bayer India
A 500048 BEML Ltd INE258A01016 Bh.Earth Mov
A 509480 Berger Paint(1) INE463A01038
A 533303 BF Investment Ltd INE878K01010
A 532430 BF Utilities(5) INE243D01012
A 500052 Bhansali Eng(1) INE922A01025
A 503960 Bharat Bijle(10) INE464A01028
A 541143 Bharat Dynamics INE171Z01018
A 500049 Bharat Elec(1) INE263A01024
A 533228 Bharat Financial INE180K01011 SKS Microfin
A 500493 Bharat Forge(2) INE465A01025
A 500103 Bharat heavy(2) INE257A01026
A 500547 Bharat Petro INE029A01011
A 532454 Bharti Artl(5) INE397D01024
A 532523 Biocon Ltd(5) INE376G01013
A 500335 Birla Corp INE340A01012
A 532400 BirlaSoft Ltd(2) INE836A01035 Kpit Technol
A 506197 Bliss Gvs Ph(1) INE416D01022 Bliss Chem
A 526612 Blue Dart Ex INE233B01017
A 500067 Blue Star(2) INE472A01039
A 524370 Bodal Chem(2) INE338D01028 Dintex Dye
A 501425 Bombay Burmah(2) INE050A01025
A 500020 Bombay Dyeing(2) INE032A01023
A 502219 Borosil Renewable INE666D01022 Borosil Glas
A 500530 BOSCH LTD INE323A01026 Motor ind
A 532929 Brigade Enterpr INE791I01019
A 500825 Britania Ind(1) INE216A01030
A 532321 Cadila Health(1) INE010B01027
A 532834 Camlin Fine Sc(1) INE052I01032
A 532483 Canara Bank INE476A01014
A 511196 Canfin Homes(2) INE477A01020
A 524742 Caplin Point(2) INE475E01026
A 531595 Capri Global Capi INE180C01018 Money Mat Fi
A 513375 Carborundum(1) INE120A01034
A 534804 CARE Ratings Ltd INE752H01013 Credit Analy
A 500870 Castrol Ind(2) INE172A01027
A 519600 CCL Products(2) INE421D01022
A 500878 Ceat Limited INE482A01020
A 532885 Central Bank INE483A01010
A 500040 Century INE055A01016
A 532548 Centuryply(1) INE348B01021
A 532443 Cera Sanitar(5) INE739E01017 Madhusudan O
A 500084 CESC Ltd.(1) INE486A01021
A 542399 Chalet Hotels Ltd INE427F01016
A 500085 Chambal Fert INE085A01013
A 500110 Chennai Pet. INE178A01016 Madras Ref.
A 511243 Chola.Inv&fin(2) INE121A01024
A 504973 Cholamandalam Fin(2) INE149A01025 TI Fin.Holdi
A 534758 Cigniti Techno INE675C01017 Chakkilam In
A 500087 Cipla Ltd(2) INE059A01026
A 532210 City Union Bnk(1) INE491A01021
A 506390 Clariant Chem INE492A01029 Colour Chem
A 533278 Coal India Ltd INE522F01014
A 540678 Cochin Shipyard INE704P01017
A 532541 Coforge Ltd INE591G01017 NIIT Tech
A 500830 Colgate Palm(1) INE259A01022 Colgate
A 531344 Contain.Corp(5) INE111A01025
A 506395 Coromand.Inter(1) INE169A01031 Coromand.Fer
A 532179 Corpn.Bank(2) INE112A01023
A 508814 Cosmo Films INE757A01017 Cosmo Fil(pr
A 541770 Credit Access Grameen INE741K01010
A 500092 CRISIL Rating(1) INE007A01025
A 539876 Crompt.Grev.Cons(2) INE299U01018
A 542867 CSB Bank Ltd INE679A01013
A 500480 Cummins Ind(2) INE298A01020 Kirlos.cumm
A 532175 Cyient Ltd(5) INE136B01020 Infotech Ent
A 500096 Dabur India(1) INE016A01026
A 500097 Dalmia Bhar.Sug(2) INE495A01022 Dalmia Cemen
A 542216 Dalmia Bharat Ltd INE00R701025 Odisha Cemen
A 533151 DB CORP LTD INE950I01011
A 532772 DCB Bank Ltd INE503A01015 Dev Cred ban
A 523367 Dcm Shriram Ltd(2) INE499A01024 Dcm Shriram
A 500645 Deepak Fert. INE501A01019
A 506401 Deepak Nitrit(2) INE288B01029
A 532848 Delta Corp (1) INE124G01033 Arrow Webtex
A 533137 Den Networks Ltd INE947J01015
A 532121 Dena Bank INE077A01010
A 519588 DFM Foods(2) INE456C01020
A 500119 Dhampur Suga INE041A01016
A 507717 Dhanuka Agrit(2) INE435G01025 Dhanuka Pest
A 540047 Dilip Buildcon INE917M01012
A 540701 Dishman Carbogen Amics INE385W01011
A 532488 Divi's Lab.(2) INE361B01024
A 540699 Dixon Techno(2) INE935N01020
A 532868 Dlf Ltd(2) INE271C01023
A 541403 Dollar Indus.(2) INE325C01035
A 539524 DR Lal Pathlabs INE600L01024
A 500124 DR Reddys Lab(5) INE089A01023
A 523618 Dredging Cor INE506A01018
A 532610 Dwarkes Sugar(1) INE366A01041
A 532927 Eclerx Serv.Ltd INE738I01010
A 532922 Edelweiss Fin(1) INE532F01054 Edelweiss Ca
A 505200 Eicher Motor(1) INE066A01021
A 500125 Eid Parry(1) INE126A01031
A 500840 Eih Ltd(2) INE230A01023
A 500123 Elantas Beck In INE280B01018 Schenec.Beck
A 522074 Elgi Equipment(1) INE285A01027
A 531162 Emami Ltd(1) INE548C01032 Himani Ltd.
A 540153 Endurance Tech. INE913H01037
A 532178 Engineers(i)(5) INE510A01028
A 500135 EPL Limited(2) INE255A01020 Essel Propp(
A 539844 Equitas Holdings INE988K01017
A 540596 Eris Lifescienc(1) INE406M01024
A 500133 Esab India INE284A01012
A 500495 Escorts Ltd INE042A01014
A 531508 Eveready Ind(5) INE128A01029
A 500650 Excel Ind.(5) INE369A01029
A 500086 Exide Ind(1) INE302A01020 Chloride Ltd
A 531599 Fdc Ltd(1) INE258B01022
A 505744 Fed Mog Goetze INE529A01010 Goetze India
A 500469 Federal Bank(2) INE171A01029
A 541557 Fine Organic Indus. INE686Y01026
A 500144 Finolex Cabl(2) INE235A01022
A 500940 Finolex Ind.(2) INE183A01024
A 532809 Firstsource Solut INE684F01012
A 500033 Force Motor INE451A01017
A 532843 Fortis Healthcare INE061F01013
A 533400 Future Consum.(6) INE220J01025 Future Consu
A 536507 Future Lifestyl(2) INE452O01016
A 540064 Future Retail Lt(2) INE752P01024
A 505714 Gabriel Ind(1) INE524A01029
A 532155 Gail (I) Ltd INE129A01019
A 540935 Galaxy surfactants INE600K01018
A 542011 Garden Reach ShipBld INE382Z01011
A 509557 Garware Technical INE276A01018 Garware Wall
A 532622 Gateway Dist INE852F01015
A 532345 Gati Ltd(2) INE152B01027 Gati Corpn.
A 532309 GE Power India Lt INE878A01011 Alstom India
A 500620 GE Shiping INE017A01032
A 522275 GE T&D India(2) INE200A01026 Alstom T&D (
A 540755 General Insur.Corp(5) INE481Y01014
A 500173 GFL Ltd(1) INE538A01037 Guj.Flouro(1
A 511676 GIC Housing INE289B01019
A 507815 Gillette Ind INE322A01010 Ind.Shaving
A 500660 Glaxo Ltd. INE159A01016
A 500676 Glaxosmithkl INE264A01014 Smithkl.Co
A 532296 Glenmark Phar(1) INE935A01035
A 505255 Gmm Pfaudler(2) INE541A01023 Guj.Machine
A 532754 Gmr Infrastr(1) INE776C01039
A 500163 Godfrey Phil(2) INE260B01028
A 540743 Godrej Agrovet INE850D01014
A 532424 Godrej Consum(1) INE102D01028
A 500164 Godrej Indus(1) INE233A01035 Godrej Soaps
A 533150 Godrej Propert(5) INE484J01027
A 500168 Goodyear (I) INE533A01012
A 532482 Granules(i)(1) INE101D01020
A 509488 Graphite Ind(2) INE371A01025 Carbon Everf
A 500300 Grasim Ind.(2) INE047A01021
A 501455 Greaves Cotton(2) INE224A01026 Greaves Ltd.
A 526797 Greenply Ind(1) INE461C01038
A 506076 Grindwell Nort(5) INE536A01023
A 511288 Gruh Finance(2) INE580B01029
A 530001 Guj.Alkali INE186A01019
A 524226 Guj.Amb.Exp(1) INE036B01030
A 500171 Guj.H.Chem INE539A01019
A 517300 Guj.Ind.Pow. INE162A01010
A 532181 Guj.Mineral(2) INE131A01031
A 500670 Guj.Narmada INE113A01013
A 532702 Guj.Petronet INE246F01010
A 500690 Guj.St.Fe.Ch(2) INE026A01025
A 542812 Gujarat Fluoroch(1) INE09N301011
A 539336 Gujarat Gas Ltd(2 INE844O01030
A 533248 Gujarat Pipavav INE517F01014
A 538567 Gulf Oil Lubri(2) INE635Q01029
A 541019 H.G.Infra Eng. INE926X01010
A 533162 Hathway Cable(2) INE982F01036
A 531531 Hatsun Agro(1) INE473B01035 Hatsun Milk
A 517354 Havells Ind(1) INE176B01034
A 508486 Hawkins Cook INE979B01015
A 532281 HCL Techno(2) INE860A01027
A 541729 HDFC Asset Manag(5) INE127D01025
A 500180 HDFC Bank(2) INE040A01026
A 540777 HDFC Insurance Co INE795G01014 HDFC Standar
A 500010 HDFC(2) INE001A01036
A 539787 Healthcare Global Ent INE075I01017
A 509631 HEG Ltd. INE545A01016
A 500292 Heideberg Cement INE578A01017 Mys Cement
A 519552 Heritage Food(5) INE978A01027
A 500182 Hero Motocorp(2) INE158A01026 Hero Honda(2
A 524669 Hester Biosciences INE782E01017 Hester Pharm
A 532129 Hexaware Tec(2) INE093A01033 Aptech Ltd.
A 524735 Hikal Ltd(2) INE475B01022
A 509675 HIL Ltd INE557A01011 Hyd.Indus
A 500183 Him.Fut.Comm(1) INE548A01028
A 500184 Himadri Spec(1) INE019C01026 Himadri Chem
A 514043 Himatsing Seid(5) INE049A01027
A 500185 Hind.Const(1) INE549A01026
A 513599 Hind.Copper (5) INE531E01026
A 500186 Hind.Oil Exp INE345A01011
A 500104 Hind.Petrol INE094A01015
A 500696 Hind.Unilever(1) INE030A01027 Hind.Lever(1
A 500188 Hind.Zinc(2) INE267A01025
A 500440 Hindalco(1) INE038A01020
A 541154 Hindustan Aeronautics INE066F01012
A 522064 Honda India Power Prod INE634A01018 Honda Siel
A 517174 Honeywell INE671A01010 Tata Honeywl
A 540530 Housing & Urban Dev INE031A01017
A 509820 Huntamaki India L INE275B01026 Huntamaki PP
A 532174 ICICI Bankin(2) INE090A01021
A 540716 ICICI Lombard INE765G01017
A 540133 ICICI Prudent Life INE726G01019
A 541179 ICICI Securit(5) INE763G01038
A 532835 Icra Ltd INE725G01011
A 500116 IDBI Ltd. INE008A01015 Industrial D
A 539437 IDFC FIRST BANK L INE092T01019 IDFC BANK LT
A 532659 Idfc Ltd INE043D01016
A 505726 IFB Ind.Ltd. INE559A01017
A 500106 IFCI Ltd. INE039A01010
A 517380 Igarshi Mot INE188B01013 CG Igarshi M
A 532636 IIFL Holding (2) INE530B01024 India Infoli
A 542773 IIFL Securities(2) INE489L01022
A 542772 IIFL Wealth Manag(2) INE466L01020
A 500201 Ind.Glycols INE560A01015
A 504741 Ind.Hume Pip(2) INE323C01030
A 530005 India Cement INE383A01012
A 532189 India Touris INE353K01014
A 535789 Indiabul Hous.F(2) INE148I01020
A 532832 Indiabul Real(2) INE069I01010
A 542726 IndiaMart Intermesh INE933S01016
A 532814 Indian Bank INE562A01011
A 540750 Indian Energy Exc(1) INE022Q01020
A 500850 Indian Hotel(1) INE053A01029
A 530965 Indian Oil Corp INE242A01010
A 532388 Indian Over. INE565A01014
A 521016 Indo Count(2) INE483B01026
A 532612 Indoco Rem(2) INE873D01024
A 541336 Indostar Capital Fin INE896L01010
A 532514 Indra Gas(2) INE203G01027
A 534816 Indus Towers INE121J01017 Bharti Infra
A 532187 Indusind Bnk INE095A01012
A 539807 Infibeam Avenues(1) INE483S01020 Infibeam Inc
A 532777 Info Edge INE663F01024
A 500209 Infosys Ltd(5) INE009A01021
A 500210 Ingersoll INE177A01018
A 532706 Inox Leisure INE312H01016
A 532851 Insecticides INEO7OI01018
A 538835 Intellect Design(5) INE306R01017
A 539448 InterGlobe Aviation INE646L01027
A 524164 IOL Chemical & Pharma INE485C01011 IndsOrganic
A 500214 ION Exchange INE570A01014
A 524494 Ipca Lab.Ltd (2) INE571A01020
A 532947 IRB Infrastruct INE821I01014
A 541956 Ircon Inter.Ltd(2 INE962Y01021
A 542830 IRCTC Railway Cateri INE335Y01020
A 533033 ISGEC Heavy Eng.( INE858B01029
A 500875 Itc Ltd(1) INE154A01025
A 509496 Itd Cement(1) INE686A01026
A 523610 ITI Ltd. INE248A01017
A 532209 J&K Bank(1) INE168A01041
A 532940 J.Kumar Infrapro(5) INE576I01022
A 532705 Jagran Praka(2) INE199G01027
A 512237 Jai Corp Ltd(1) INE070D01027
A 500219 Jain Irri(2) INE175A01038
A 532532 Jaipra.Associt(2) INE455F01025
A 520051 Jamna Auto(1) INE039C01032
A 506943 Jb Chemical(2) INE572A01028
A 500227 Jind.Polys INE197D01010
A 500378 Jind.Saw(2) INE324A01024 Saw Pipes
A 539597 Jind.Stain(His)(1) INE455T01018
A 532508 Jind.Stainl(2) INE220G01021 Jsl Stainles
A 532286 Jind.Steel Pow(1) INE749A01030
A 532644 Jk Cement INE823G01014
A 500380 JK Laksh.Cem(5) INE786A01032
A 532162 JK Paper INE789E01012 Central Pulp
A 530007 JK Tyre & Ind(2) INE573A01042
A 523405 JM Financial(1) INE780C01023
A 522263 JMC Projects(2) INE890A01024
A 523398 Johnson Controls- INE782A01015 Hitachi Home
A 532642 JSW Holdings Ltd INE824G01012 Jindal South
A 500228 JSW Steel(1) INE019A01038
A 520057 JTEKT India Ltd(1 INE643A01035 Sona Koyo St
A 533155 Jubilant Foodwork INE797F01012
A 530019 Jubilant Pharmova INE700A01033 Jubilant Lif
A 535648 Just Dial Ltd INE599M01018
A 532926 Jyothy Lab.(1) INE668F01031 Jyothy Labor
A 500233 Kajaria Cer(1) INE217B01028
A 532268 Kale Consul. INE793A01012
A 522287 Kalpa.Power(2) INE220B01022
A 500235 Kalyani Stel(5) INE907A01026
A 532468 Kama Holdings Ltd INE411F01010 Srf Polymers
A 500165 Kansai Nerolec(1) INE531A01024
A 532652 Karnatka Bank INE614B01018 590002
A 532899 Kaveri Seed (2) INE455I01029
A 532714 Kec Intern(2) INE389H01022
A 517569 Kei Indust(2) INE878B01027
A 505890 Kenna Metal W INE717A01029
A 532967 Kiri Industries INE415I01015 Kiridyes & C
A 533293 Kirl.OilEng(2) INE146L01010
A 521248 Kitex Garment(1) INE602G01020
A 532942 KNR Constr.(2) INE634I01029
A 532924 Kolte-Patil Dep INE094I01018
A 500247 Kotak Bank(5) INE237A01028
A 542651 Kpit Techno.Ltd INE04I401011
A 532889 KPR.Mill ltd(1 INE930H01031
A 530813 KRBL Ltd.(1) INE001B01026 Khushi Ram B
A 500249 KSB Ltd INE999A01015 KSB Pumps
A 533519 L&T Fin.Holding INE498L01015
A 540115 L&T Techno.Service(2) INE010V01017
A 534690 Laksh.Vilas Bank INE694C01018
A 500252 Lakshmi Mach(10) INE269B01029
A 526947 La-opala Rg(2) INE059D01020
A 540005 Larsen & Toubro Inf(1) INE214T01019
A 500510 Larsen Tubro(2) INE018A01030
A 540222 Laurus Labs Ltd(2 INE947Q01028
A 541233 Lemon Tree Hotels INE970X01018
A 500250 LG Balkrishnan INE337A01034 Lg Balabros
A 500253 LIC Hous.Fin(2) INE115A01026
A 531633 Lincoln Pharm INE405C01035
A 523457 Linde india Ltd. INE473A01011 BOC (i) Ltd.
A 532783 Lt Foods Ltd(1) INE818H01020
A 500257 Lupin Ltd.(2) INE326A01037 Lupin Chem
A 539542 Lux Indus.Ltd(2) INE150G01020
A 532720 M&M Finance(2) INE774D01024
A 500266 Mah.Scooter INE288A01013
A 500265 Maha.Seamless(5) INE271B01025
A 539957 Mahanagar Gas Ltd INE002S01010
A 532756 Mahind CIE Automo INE536H01010 Mahind Forgi
A 532313 Mahind Lifes Dev INE813A01018 Mahindra Ges
A 500520 Mahindra & Mah(5) INE101A01026
A 533088 Mahindra Holidays INE998I01010
A 540768 Mahindra Logistics INE766P01016
A 531213 Manappuram Fin(2) INE522D01027 Manappuram G
A 502157 Mangalam Cem INE347A01017
A 531642 Marico (1) INE196A01026
A 524404 Marksans(1) INE750C01026 Tasc Pharma.
A 532500 Maruti Suzuki(5) INE585B01010 Maruti Udyog
A 540749 MAS Financial Serv INE348L01012
A 523704 Mastek Ltd(5) INE759A01021
A 500271 Max Fin.Serv.(2) INE180A01020 Max India(2)
A 539981 Max India Ltd(2) INE153U01017
A 522249 Mayur Uniquo(5) INE040D01038
A 532865 Meghmani Organ(1) INE974H01013
A 542650 Metropolis Health(2) INE112L01020
A 538962 Minda Corpo.Lt(2) INE842C01021
A 532539 Minda Ind(2) INE405E01023
A 532819 Mindtree LTD INE018I01017
A 513377 Mineral&Metl(1) INE123F01029 MMTC Ltd
A 541195 Mishra Dhatu Nigam INE099Z01011
A 533286 MOIL LTD INE490G01020
A 533080 Mold-Tek Packaging INE893J01011 MoldTek Plas
A 524084 Monsanto (I) INE274B01011 Monsanto Che
A 500288 Morepen Lab(2) INE083A01026
A 517334 Motherson Sumi(1) INE775A01035
A 532892 Motilal OswalF(1) INE338I01027
A 526299 Mphasis Bfl INE356A01018 BFL Software
A 500290 MRF Ltd. INE883A01011
A 500109 MRPL INE103A01014
A 542597 MSTC Limited INE255X01014
A 534091 Multi Commodty.Exch INE745G01035
A 533398 Muthoot Finance INE414G01012
A 539551 Narayana Hrudayalaya INE410P01011
A 523630 Nat.Fertiliser INE870D01012
A 526371 Nat.Mineral(1) INE584A01023
A 500298 Nat.Peroxide INE585A01020
A 524816 Natco Pharma INE987B01018
A 537291 Nath Bio-Genes(I) INE448G01010
A 532234 National Alum(5) INE139A01034
A 513023 Nava Bharat V(2) INE725A01022
A 532504 Navin Fluori(2) INE048G01026
A 508989 Navneet Education(2) INE060A01024 Navneet Publ
A 534309 NBCC(India) Lt(1) INE095N01031
A 500294 NCC Ltd(2) INE868B01028 Nagar.Constr
A 502168 NCL Ind. INE732C01016
A 542665 Neogen Chemicals INE136S01016
A 505355 NESCO LTD INE317F01027 New Std.Eng.
A 500790 Nestle (I)Ltd. INE239A01016
A 532798 Network18 Med(5) INE870H01013 Network18 Fi
A 524558 Neuland Lab. INE794A01010
A 540900 Newgen Softw.Tech INE619B01017
A 533098 NHPC LIMITED INE848E01016
A 500304 NIIT Ltd.(2) INE161A01038
A 523385 Nilkamal Pls INE310A01015
A 540767 Nippon Life India Asset INE298J01013 Reliance Nip
A 500307 Nirlon INE910A01012
A 513683 NLC India Ltd INE589A01014 Neyveli Lign
A 500730 NOCIL INE163A01018
A 500672 Novartis India(5) INE234A01025 Hind.ciba
A 530367 Nrb Bearing(2) INE349A01021
A 532555 NTPC Ltd INE733E01010
A 531209 Nucle.Soft E INE096B01018
A 533273 Oberoi Realty INE093I01010
A 533106 OIL INDIA LTD INE274J01014
A 532880 Omaxe Ltd INE800H01010
A 500312 ONGC Corp.(5) INE213A01029
A 532466 Oracle Financ(5) INE881D01027 Iflex Solu(5
A 535754 Orient Cement(1) INE876N01018
A 541301 Orient Electric INE142Z01019
A 506579 Orient.Carb. INE321D01016
A 500315 Oriental Bank INE141A01014
A 532827 Page Industr INE761H01022
A 532900 Paisalo Digital INE420C01042 SE Investmen
A 531349 Panacea Biot(1) INE922B01023
A 539889 Parag Milk Foods INE883N01014
A 531120 Patel Engg.(1) INE244B01030
A 534809 PC Jeweller Ltd INE785M01013
A 533179 Persistent System INE262H01013
A 532522 Petronet Lng INE347G01014
A 500680 Pfizer Ltd. INE182A01018
A 506590 Phil.Carbon(2) INE602A01023
A 503100 Phoenix Mill(2) INE211B01039
A 523642 PI Indus.Ltd(1) INE603J01030
A 500331 Pidilite Ind(1) INE318A01026
A 539883 Pilani Inv.&Ind INE417C01014
A 500302 Piramal Enter(2) INE140A01024 Piramal Heal
A 540173 PNB Housing Fin INE572E01012
A 539150 PNC Infratech (2) INE195J01029
A 531768 Poly Medicur(5) INE205C01021
A 542652 Polycab India INE455K01017
A 524051 Polyplex INE633B01018
A 524000 Poonawalla Fincor INE511C01022 Magma Fincor
A 532810 Power Finan INE134E01011
A 532898 Power Grid Corp. INE752E01010
A 539302 Power Mech Proj. INE211R01019
A 506022 Prakash Ind. INE603A01013
A 540724 Prataap Snacks (5) INE393P01035
A 533274 Prestige EstatePr INE811K01011
A 540293 Pricol Limited(1) INE726V01018
A 542907 Prince Pipes&Fittings INE689W01016
A 500338 Prism Johnson INE010A01011 Prism Cement
A 530117 Privi Speciality INE959A01019 Fairchem Spe
A 500126 Procter & Gamble INE199A01012 Merck Ltd
A 500459 Procter&gamb INE179A01014
A 540544 PSP Projects Ltd INE488V01015
A 532524 PTC India INE877F01012
A 533344 PTC India Financ INE560K01014
A 532461 Punj.Nat.Bank(2) INE160A01022
A 532689 Pvr Ltd INE191H01014
A 539978 Quess Corp Ltd INE615P01015
.
.
.
A 532648 Yes Bank ltd(2) INE528G01027
A 505537 Zee Enter(1) INE256A01028 Zee Telefilm
A 504067 Zensar Tech.(2) INE520A01027
A 531335 Zydus Wellness INE768C01010 Carnation He
Reference
You can find a couple of relevant discussions in:
Ways to deal with #document under iframe
Switch to an iframe through Selenium and python
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element while trying to click Next button with selenium
selenium in python : NoSuchElementException: Message: no such element: Unable to locate element

Creating a function for my python web scraper that will output a dictionary

I have created my web scraper I have added an function unfortunately my function is not calling the out put is not coming out as a dictionary. How do I create and call the function and store the output as a dictionary. Below is my code and function so far.
from bs4 import BeautifulSoup
import requests
top_stories = []
def get_stories():
""" user agent to facilitates end-user interaction with web content"""
headers = {
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
}
base_url = 'www.example.com'
source = requests.get(base_url).text
soup = BeautifulSoup(source, 'html.parser')
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
try:
headline = article.h3.text.strip()
link = base_url + article.a['href']
text = article.find("div", class_="field--type-text-with-summary").text.strip()
img_url = base_url + article.picture.img['data-src']
print(headline,link,text,img_url)
stories_dict = {}
stories_dict['Headline'] = headline
stories_dict['Link'] = link
stories_dict['Text'] = text
stories_dict['Image'] = img_url
top_stories.append(stories_dict)
except AttributeError as ex:
print('Error:',ex)
get stories()
To get the data in a dictionary format (dict), you can create a dictionary as follows:
top_stories = {"Headline": [], "Link": [], "Text": [], "Image": []}
and append the correct data to it.
(by the way, when you have specified your headers, it should have been a dict not a set.)
from bs4 import BeautifulSoup
import requests
def get_stories():
"""user agent to facilitates end-user interaction with web content"""
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"
}
top_stories = {"Headline": [], "Link": [], "Text": [], "Image": []}
base_url = "https://www.jse.co.za/"
source = requests.get(base_url, headers=headers).text
soup = BeautifulSoup(source, "html.parser")
articles = soup.find_all("article", class_="card")
print(f"Number of articles found: {len(articles)}")
for article in articles:
try:
top_stories["Headline"].append(article.h3.text.strip())
top_stories["Link"].append(base_url + article.a["href"])
top_stories["Text"].append(
article.find("div", class_="field--type-text-with-summary").text.strip()
)
top_stories["Image"].append(base_url + article.picture.img["data-src"])
except AttributeError as ex:
print("Error:", ex)
print(type(top_stories))
print(top_stories)
get_stories()
Output:
Number of articles found: 6
<class 'dict'>
{'Headline': ['South Africa offers investment opportunities to Asia Pacific investors', 'South Africa to showcase investment opportunities to the UAE market', 'South Africa to showcase investment opportunities to UK investors', 'JSE to become 100% owner of JSE Investor Services and expands services to include share plan administration services', 'Thungela Resources lists on the JSE after unbundling from Anglo American', 'JSE welcomes SAB’s B-BBEE scheme that gives investors exposure to AB InBev global market'], 'Link': ['https://www.jse.co.za//news/market-news/south-africa-offers-investment-opportunities-asia-pacific-investors', 'https://www.jse.co.za//news/market-news/south-africa-showcase-investment-opportunities-uae-market', 'https://www.jse.co.za//news/market-news/south-africa-showcase-investment-opportunities-uk-investors', 'https://www.jse.co.za//news/market-news/jse-become-100-owner-jse-investor-services-and-expands-services-include-share-plan', 'https://www.jse.co.za//news/market-news/thungela-resources-lists-jse-after-unbundling-anglo-american', 'https://www.jse.co.za//news/market-news/jse-welcomes-sabs-b-bbee-scheme-gives-investors-exposure-ab-inbev-global-market'], 'Text': ['The Johannesburg Stock Exchange (JSE) and joint sponsors, Citi and Absa Bank are collaborating to host the annual SA Tomorrow Investor conference, which aims to showcase the country’s array of investment opportunities to investors in the Asia Pacific region, mainly from Hong Kong and Singapore.', 'The Johannesburg Stock Exchange (JSE) and joint sponsors, Citi and Absa Bank are collaborating to host the SA Tomorrow Investor conference, which aims to position South Africa as a preferred investment destination for the United Arab Emirates (UAE) market.', 'The Johannesburg Stock Exchange (JSE) and joint sponsors Citi and Absa Bank are collaborating to host the annual SA Tomorrow Investor conference, which aims to showcase the country’s array of investment opportunities to investors in the United Kingdom.', 'The Johannesburg Stock Exchange (JSE) is pleased to announce that it has embarked on a process to incorporate JSE Investor Services Proprietary Limited (JIS) as a wholly owned subsidiary of the JSE by acquiring the minority shareholding of 25.15 % from LMS Partner Holdings.', 'Shares in Thungela Resources, a South African thermal coal exporter, today commenced trading on the commodity counter of the Main Board of the Johannesburg Stock Exchange (JSE).', 'From today, Black South African retail investors will get the opportunity to invest in the world’s largest beer producer, AB InBev, following the listing of SAB Zenzele Kabili on the Johannesburg Stock Exchange’s (JSE) Empowerment Segment.'], 'Image': ['https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner_0.jpg?h=4ae650de&itok=hdGEy5jA', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner2.jpg?h=4ae650de&itok=DgPFtAx8', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Web_Banner.jpg?h=4ae650de&itok=Q0SsPtAz', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2020-12/DSC_0832.jpg?h=156fdada&itok=rL3M2gpn', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-06/Thungela_Web_Banner_1440x390.jpg?h=4ae650de&itok=kKRO5fQk', 'https://www.jse.co.za//sites/default/files/styles/standard_lg/public/medial/images/2021-05/SAB-Zenzele.jpg?h=4ae650de&itok=n9osAP33']}

Beautiful Soup Scraping

I'm having issues with old working code not functioning correctly anymore.
My python code is scraping a website using beautiful soup and extracting event data (date, event, link).
My code is pulling all of the events which are located in the tbody. Each event is stored in a <tr class="Box">. The issue is that my scraper seems to be stopping after this <tr style ="box-shadow: none;> After it reaches this section (which is a section containing 3 advertisements on the site for events that I don't want to scrape) the code stops pulling event data from within the <tr class="Box">. Is there a way to skip this tr style/ignore future cases?
import pandas as pd
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
source = urllib.request.urlopen('https://10times.com/losangeles-us/technology/conferences').read()
soup = bs.BeautifulSoup(source,'html.parser')
#---Get Event Data---
test1=[]
table = soup.find('tbody')
table_rows = table.find_all('tr') #find table rows (tr)
for x in table_rows:
data = x.find_all('td') #find table data
row = [x.text for x in data]
if len(row) > 2: #Exlcudes rows with only event name/link, but no data.
test1.append(row)
test1
The data is loaded dynamically via JavaScript, so you don't see more results. You can use this example to load more pages:
import requests
from bs4 import BeautifulSoup
url = "https://10times.com/ajax?for=scroll&path=/losangeles-us/technology/conferences"
params = {"page": 1, "ajax": 1}
headers = {"X-Requested-With": "XMLHttpRequest"}
for params["page"] in range(1, 4): # <-- increase number of pages here
print("Page {}..".format(params["page"]))
soup = BeautifulSoup(
requests.get(url, headers=headers, params=params).content,
"html.parser",
)
for tr in soup.select('tr[class="box"]'):
tds = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
print(tds)
Prints:
Page 1..
['Tue, 29 Sep - Thu, 01 Oct 2020', 'Lens Los Angeles', 'Intercontinental Los Angeles Downtown, Los Angeles', 'LENS brings together the entire Degreed community - our clients, invited prospective clients, thought leaders, partners, employees, executives, and industry experts for two days of discussion, workshops,...', 'Business Services IT & Technology', 'Interested']
['Wed, 30 Sep - Sat, 03 Oct 2020', 'FinCon', 'Long Beach Convention & Entertainment Center, Long Beach 20.1 Miles from Los Angeles', 'FinCon will be helping financial influencers and brands create better content, reach their audience, and make more money. Collaborate with other influencers who share your passion for making personal finance...', 'Banking & Finance IT & Technology', 'Interested 7 following']
['Mon, 05 - Wed, 07 Oct 2020', 'NetDiligence Cyber Risk Summit', 'Loews Santa Monica Beach Hotel, Santa Monica 14.6 Miles from Los Angeles', 'NetDiligence Cyber Risk Summit will conference are attended by hundreds of cyber risk insurance, legal/regulatory and security/privacy technology leaders from all over the world. Connect with leaders in...', 'IT & Technology', 'Interested']
... etc.

Trouble returning web scraping output as dictionary

So I am attempting to scrape a website of its staff roster and I want the end product to be a dictionary in the format of {staff: position}. I am currently stuck with it returning every staff name and position as a separate string. It is hard to clearly post the output, but it essentially goes down the list of names, then the position. So for example the first name on the list is to be paired with the first position, and so on. I have determined that each name and position are a class 'bs4.element.Tag. I believe I need to take the names and the positions and make a list out of each, then use zip to put the elements in a dictionary. I have tried implementing this but nothing so far has worked. The lowest I could get to the text I need by using the class_ parameter was the individual div that the p is contained in. I am still inexperienced with python and new to web scraping, but I am relativity well versed with html and css, so help would be greatly appreciated.
# Simple script attempting to scrape
# the staff roster off of the
# Greenville Drive website
import requests
from bs4 import BeautifulSoup
URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3')
for staff in staff:
data = staff.find('p')
if data:
print(data.text.strip())
position = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6')
for position in position:
data = position.find('p')
if data:
print(data.text.strip())
# This code so far provides the needed data, but need it in a dict()
BeautifulSoup has find_next() which can be used to get the next tag with the matching filters specified. Find the "staff" div and the use find_next() to get the adjacent "position" div.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3'
position_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6'
result = {}
for staff in soup.find_all('div', class_=staff_class):
data = staff.find('p')
if data:
staff_name = data.text.strip()
postion_div = staff.find_next('div', class_=position_class)
postion_name = postion_div.text.strip()
result[staff_name] = postion_name
print(result)
Output
{'Craig Brown': 'Owner/Team President', 'Eric Jarinko': 'General Manager', 'Nate Lipscomb': 'Special Advisor to the President', 'Phil Bargardi': 'Vice President of Sales', 'Jeff Brown': 'Vice President of Marketing', 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds', 'Jordan Smith': 'Vice President of Finance', 'Ned Kennedy': 'Director of Inside Sales', 'Patrick Innes': 'Director of Ticket Operations', 'Micah Gold': 'Senior Account Executive', 'Molly Mains': 'Senior Account Executive', 'Houghton Flanagan': 'Account Executive', 'Jeb Maloney': 'Account Executive', 'Olivia Adams': 'Inside Sales Representative', 'Tyler Melson': 'Inside Sales Representative', 'Toby Sandblom': 'Inside Sales Representative', 'Katie Batista': 'Director of Sponsorships and Community Engagement', 'Matthew Tezza': 'Sponsor Services and Activations Manager', 'Melissa Welch': 'Sponsorship and Community Events Manager', 'Beth Rusch': 'Director of West End Events', 'Kristin Kipper': 'Events Manager', 'Grant Witham': 'Events Manager', 'Alex Guest': 'Director of Game Entertainment & Production', 'Lance Fowler': 'Director of Video Production', 'Davis Simpson': 'Director of Media and Creative Services', 'Cameron White': 'Media Relations Manager', 'Ed Jenson': 'Broadcaster', 'Adam Baird': 'Accountant', 'Mike Agostino': 'Director of Food and Beverage', 'Roger Campana': 'Assistant Director of Food and Beverage', 'Wilbert Sauceda': 'Executive Chef', 'Elise Parish': 'Premium Services Manager', 'Timmy Hinds': 'Director of Facility Operations', 'Zack Pagans': 'Assistant Groundskeeper', 'Amanda Medlin': 'Business and Team Operations Manager', 'Allison Roedell': 'Office Manager'}
Solution using CSS selectors and zip():
import requests
from bs4 import BeautifulSoup
url = 'https://www.milb.com/greenville/ballpark/frontoffice'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = {}
for name, position in zip( soup.select('div:has(+ div p) b'),
soup.select('div:has(> div b) + div p')):
out[name.text] = position.text
from pprint import pprint
pprint(out)
Prints:
{'Adam Baird': 'Accountant',
'Alex Guest': 'Director of Game Entertainment & Production',
'Allison Roedell': 'Office Manager',
'Amanda Medlin': 'Business and Team Operations Manager',
'Beth Rusch': 'Director of West End Events',
'Brady Andrews': 'Assistant Director of Facility Operations',
'Brooks Henderson': 'Merchandise Manager',
'Bryan Jones': 'Facilities Cleanliness Manager',
'Cameron White': 'Media Relations Manager',
'Craig Brown': 'Owner/Team President',
'Davis Simpson': 'Director of Media and Creative Services',
'Ed Jenson': 'Broadcaster',
'Elise Parish': 'Premium Services Manager',
'Eric Jarinko': 'General Manager',
'Grant Witham': 'Events Manager',
'Greg Burgess, CSFM': 'Vice President of Operations/Grounds',
'Houghton Flanagan': 'Account Executive',
'Jeb Maloney': 'Account Executive',
'Jeff Brown': 'Vice President of Marketing',
'Jenny Burgdorfer': 'Director of Merchandise',
'Jordan Smith ': 'Vice President of Finance',
'Katie Batista': 'Director of Sponsorships and Community Engagement',
'Kristin Kipper': 'Events Manager',
'Lance Fowler': 'Director of Video Production',
'Matthew Tezza': 'Sponsor Services and Activations Manager',
'Melissa Welch': 'Sponsorship and Community Events Manager',
'Micah Gold': 'Senior Account Executive',
'Mike Agostino': 'Director of Food and Beverage',
'Molly Mains': 'Senior Account Executive',
'Nate Lipscomb': 'Special Advisor to the President',
'Ned Kennedy': 'Director of Inside Sales',
'Olivia Adams': 'Inside Sales Representative',
'Patrick Innes': 'Director of Ticket Operations',
'Phil Bargardi': 'Vice President of Sales',
'Roger Campana': 'Assistant Director of Food and Beverage',
'Steve Seman': 'Merchandise / Ticketing Advisor',
'Timmy Hinds': 'Director of Facility Operations',
'Toby Sandblom': 'Inside Sales Representative',
'Tyler Melson': 'Inside Sales Representative',
'Wilbert Sauceda': 'Executive Chef',
'Zack Pagans': 'Assistant Groundskeeper'}

Trying to scrape and segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel?
In the picture(https://ibb.co/8X5xY9C) or in the website link provided, Bold(Except Alphabet Letters(A) and later 'back to top' ) represents Heading and Explanation(non-bold just below bold) represents the content(Content even consists of 'li' and 'ul' blocks later in the site, which should come under respective Heading)
#Code to Start With
from bs4 import BeautifulSoup
import requests
url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
Heading = soup.findAll('strong')
content = soup.findAll('div', {"class": "comp-rich-text"})
Output Excel looks something Link this
https://i.stack.imgur.com/NsMmm.png
I've thought about it a little more and thought of a better solution. Rather than "crowd" my initial solution, I chose to add a 2nd solution here:
So thinking about it again, and following my logic of splitting the html by the headlines (essentially breaking it up where we find <strong> tags), I choose to convert to strings using .prettify(), and then split on those specific strings/tags and read back into BeautifulSoup to pull the text. From what I see, it looks like it hasn't missed anything, but you'll have to search through the dataframe to double check:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
splits = section.prettify().split('<strong>')
for each in splits:
try:
headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
headline = BeautifulSoup(headline, 'html.parser').text.strip()
content = BeautifulSoup(content, 'html.parser').text.strip()
content_split = content.split('\n')
content = ' '.join([ text.strip() for text in content_split if text != ''])
results[headline] = content
except:
continue
df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)
Output:
print (df)
Headings Content
0 Age requirements Applicants must be at least 18 years old at th...
1 Affordability Our affordability calculator is the same one u...
2 Agricultural restriction The only acceptable agricultural tie is where ...
3 Annual percentage rate of charge (APRC) The APRC is all fees associated with the mortg...
4 Adverse credit We consult credit reference agencies to look a...
5 Applicants (number of) The maximum number of applicants is two.
6 Armed Forces personnel Unsecured personal loans are only acceptable f...
7 Back to back Back to back is typically where the vendor has...
8 Customer funded purchase: when the customer has funded the purchase usin...
9 Bridging: residential mortgage applications where the cu...
10 Inherited: a recently inherited property where the benefi...
11 Porting: where a fixed/discounted rate was ported to a ...
12 Repossessed property: where the vendor is the mortgage lender in pos...
13 Part exchange: where the vendor is a large national house bui...
14 Bank statements We accept internet bank statements in paper fo...
15 Bonus For guaranteed bonuses we will consider an ave...
16 British National working overseas Applicants must be resident in the UK. Applica...
17 Builder's Incentives The maximum amount of acceptable incentive is ...
18 Buy-to-let (purpose) A buy-to-let mortgage can be used for: Purcha...
19 Capital Raising - Acceptable purposes permanent home improvem...
20 Buy-to-let (affordability) Buy to Let affordability must be assessed usin...
21 Buy-to-let (eligibility criteria) The property must be in England, Scotland, Wal...
22 Definition of a portfolio landlord We define a portfolio landlord as a customer w...
23 Carer's Allowance Carer's Allowance is paid to people aged 16 or...
24 Cashback Where a mortgage product includes a cashback f...
25 Casual employment Contract/agency workers with income paid throu...
26 Certification of documents When submitting copies of documents, please en...
27 Child Benefit We can accept up to 100% of working tax credit...
28 Childcare costs We use the actual amount the customer has decl...
29 When should childcare costs not be included? There are a number of situations where childca...
.. ... ...
108 Shared equity We lend on the Government-backed shared equity...
109 Shared ownership We do not lend against Shared Ownership proper...
110 Solicitors' fees We have a panel of solicitors for our fees ass...
111 Source of deposit We reserve the right to ask for proof of depos...
112 Sole trader/partnerships We will take an average of the last two years'...
113 Standard variable rate A standard variable rate (SVR) is a type of v...
114 Student loans Repayment of student loans is dependent on rec...
115 Tenure Acceptable property tenure: Feuhold, Freehold,...
116 Term Minimum term is 3 years Residential - Maximum...
117 Unacceptable income types The following forms of income are classed as u...
118 Bereavement allowance: paid to widows, widowers or surviving civil pa...
119 Employee benefit trusts (EBT): this is a tax mitigation scheme used in conjun...
120 Expenses: not acceptable as they're paid to reimburse pe...
121 Housing Benefit: payment of full or partial contribution to cla...
122 Income Support: payment for people on low incomes, working les...
123 Job Seeker's Allowance: paid to people who are unemployed or working 1...
124 Stipend: a form of salary paid for internship/apprentic...
125 Third Party Income: earned by a spouse, partner, parent who are no...
126 Universal Credit: only certain elements of the Universal Credit ...
127 Universal Credit The Standard Allowance element, which is the n...
128 Valuations: day one instruction We are now instructing valuations on day one f...
129 Valuation instruction A valuation will be automatically instructed w...
130 Valuation fees A valuation will always be obtained using a pa...
131 Please note: W hen upgrading the free valuation for a home...
132 Adding fees to the loan Product fees are the only fees which can be ad...
133 Product fee This fee is paid when the mortgage is arranged...
134 Working abroad Previously, we required applicants to be empl...
135 Acceptable - We may consider applications from people who: ...
136 Not acceptable - We will not consider applications from people...
137 Working and Family Tax Credits We can accept up to 100% of Working Tax Credit...
[138 rows x 2 columns]
EDIT: SEE OTHER SOLUTION PROVIDED
It's tricky. I tried to essentially to grab the headings, then use those to grab all the text after the heading, and that proceeds the next heading. The code below is a little messy, and requires some cleaning up, but hopefully gets you to a point to work with it or get you moving in the right direction:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
headlines = section.find_all('strong')
headlines = [each.text for each in headlines ]
for i, headline in enumerate(headlines):
if headline != headlines[-1]:
next_headline = headlines[i+1]
else:
next_headline = ''
try:
find_content = section(text=headline)[0].parent.parent.find_next_siblings()
if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
content = section(text=headline)[0].parent.nextSibling
results[headline] = content.strip()
break
except:
find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
if find_content == []:
try:
find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
except:
find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()
content = []
for sibling in find_content:
if next_headline not in sibling.text or headline == headlines[-1]:
content.append(sibling.text)
else:
content = '\n'.join(content)
results[headline.strip()] = content.strip()
break
if headline == headlines[-1]:
content = '\n'.join(content)
results[headline] = content.strip()
df = pd.DataFrame(results.items())

Resources