Python/Selenium - how to loop through hrefs in <li>? - python-3.x

Web URL: https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times
I want to parse the HTML as below:
I want to get all hrefs within the < li > elements and the highlighted text. I tried the code
elementList = driver.find_element_by_class_name('block-wysiwyg').find_elements_by_tag_name("li")
for i in range(len(elementList)):
driver.find_element_by_class_name('blcokwysiwyg').find_elements_by_tag_name("li").get_attribute("href")
But the block returned none.
Can anyone please help me with the above code?

I suppose it will fetch you the required content.
import requests
from bs4 import BeautifulSoup
link = 'https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times'
r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select(".block-wysiwyg li"):
item_text = item.get_text(strip=True)
item_link = item.select_one("a[href]").get("href")
print(item_text,item_link)

Try is this way:
coronas = driver.find_element_by_xpath("//div[#class='block-wysiwyg']/ul/li")
hr = coronas.find_element_by_xpath('./a')
print(coronas.text)
print(hr.get_attribute('href'))
Output:
The coronavirus is touching the lives of all Americans, but race, age, and income play a big role in the exact ways the virus — and the stalled economy — are affecting people. Here's what that means.
https://www.ipsos.com/en-us/america-under-coronavirus

Related

Python requests pull not always retrieving data

To practice programming, I am trying to help a friend review a subreddit's data via webscraping and requests and bs4. (I prefer requests for this task since I am moving this script over to my Rasberry Pi and don't think its little heart could even get Chrome installed.
I am running into an issue where my requests only outputs the results sometimes, meaning it will pull the name and url of the post, say 1 out of 5 times when run. When the request returns no data, it doesn't return an error, the program just stops.
from time import sleep
import requests
import os
import re
i = 1
selections = ""
r = requests.get("https://www.reddit.com/r/hardwareswap/search?q=Vive&restrict_sr=1", timeout = None)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs = {'data-click-id':'body'})
textitems = []
for result in results:
textitems.append(result.text.strip())
for result in textitems:
print(result)
links = soup.find_all('a', attrs = {'data-click-id':'body'})
for link in links:
print(link.attrs['href'])
Any thoughts as to why this happens? My initial thoughts were it was either due to a reddit policy or an invalid URL.
Thanks!

Could not find a class name for BeautifulSoup on the website to crawl on

I am new to Beautiful Soup.
I am trying to get the "Ranking Criteria" class in the below link.
Unfortunately, I used the "criteria" as its class for soup_findAll(),
but it showed no content there.
I could not find any other class names which can give me the data I want
(Overall score, academic reputation and so on)
I actually wanted to do web crawling for multiple universities,
so I hope to use the URL which I can format for various universities(Just change the universities name).
Otherwise, I would have just used the outerHTML for that(tested and it worked, but I did not know
how to customize that for multiple universities)
My code is as below. I ended up using get_text():
r = requests.get("https://www.topuniversities.com/universities/california-institute-technology-caltech")
html = r.text
soup = BeautifulSoup(html, 'html.parser')
tds = soup.get_text()
print(tds)
It was not successful, as it got too much stuff which left me hard to evaluate the information I want.
Any help would be highly appreciated! Thanks!
The link I am trying to scrape:
The data is loaded dynamically via JavaScript Ajax request. But you could use requests module to simulate it.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.topuniversities.com/universities/california-institute-technology-caltech'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
ajax_url = 'https://www.topuniversities.com' + soup.select_one('a.use-ajax')['href'].replace('nojs', 'ajax')
data = requests.post(ajax_url).json()
for d in data:
if 'data' in d:
soup = BeautifulSoup(d['data'], 'html.parser')
break
for div in soup.select('div.criteria'):
criteria = div.find(text=True).strip()
ranking = div.b.get_text(strip=True)
print('{:<30} {}'.format(criteria, ranking))
Prints:
Overall Score: 97
Academic Reputation: 97
Employer Reputation: 82.8
Faculty Student: 100
Citations per Faculty: 99.9
International Faculty: 100
International Students: 88.2

How to change the page no. of a page's search result for Web Scraping using Python?

I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?
At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()
Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

Problem exporting Web Url results into CSV using beautifulsoup3

Problem: I tried to export results (Name, Address, Phone) into CSV but the CSV code not returning expected results.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import json
import re
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
#Find all Companies Name under h2tag
company_name_list_heading = soup.findAll("h2")
#Find all Address on page Name under a tag
company_name_list_items = soup.findAll("a",{"class":"address"})
#Find all Phone numbers on page Name under ul
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Created for loop to print out all company Data
for company_address in company_name_list_items:
print(company_address.prettify())
# Create for loop to print out all company Names
for company_name in company_name_list_heading:
print(company_name.prettify())
# Create for loop to print out all company Numbers
for company_numbers in company_name_list_numbers:
print(company_numbers.prettify())
Below is the code to export the results (name, address & phonenumber) into CSV
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "Address", "Phone"])
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
company_name_list_heading = soup.findAll("h2")
company_name_list_items = soup.findAll("a",{"class":"address"})
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Here is the for loop to loop over data.
for company_name in company_name_list_heading:
names = company_name.contents[0]
for company_numbers in company_name_list_numbers:
names = company_numbers.contents[1]
for company_address in company_name_list_items:
address = company_address.contents[1]
writer.writerow([name, Address, Phone])
outfile.close()
You need to work on understanding how for loops work, and also the difference between strings, and variables and other datatypes. You also need to work on using what you have seen from other stackoverflow questions and learn to apply that. This is essentially the same as youre other 2 questions you already posted, but just a different site you're scraping from (but I didn't flag it as a duplicate, as you're new to stackoverflow and web scrpaing and I remember what it was like to try to learn). I'll still answer your questions, but eventually you need to be able to find the answers on your own and learn how to adapt it and apply (coding isn't a paint by colors. Which I do see you are adapting some of it. Good job in finding the "div",{"class":"CompanyInfo"} tag to get the company info)
That data you are pulling (name, address, phone) needs to be within a nested loop of the div class=CompanyInfo element/tag. You could theoretically have it the way you have it now, by putting those into a list, and then writing to the csv file from your lists, but theres a risk of data missing and then your data/info could be off or not with the correct corresponding company.
Here's what the full code looks like. notice that the variables are stored with in the loop, and then written. It then goes to the next block of CompanyInfo and continues.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
# Now loop through those elements
for element in product_name_list:
# Takes 1 block of the "div",{"class":"CompanyInfo"} tag and finds/stores name, address, phone
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
# writes the name, address, phone to csv
writer.writerow([name, address, phone])
# now will go to the next "div",{"class":"CompanyInfo"} tag and repeats
outfile.close()

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Resources