How to fix "businessObject not defined" - python-3.x

I am a newbie to Python and web scraping. To practice, I am just trying to pull some business names from some HTML tags a website. However, the code is not running and is throwing an 'object is not defined' error.
from bs4 import BeautifulSoup
import requests
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
content = BeautifulSoup(response.content, "html.parser")
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
businessObject = {
"BusinessName": business.find('h4', attrs={"class": "groomer-salon-card__name"}).text.encode('utf-8')
}
print (businessObject)
Expected: I am trying to retrieve the business names from this web page.
Result:
NameError: name 'businessObject' is not defined

When you did
content.find_all('div', attrs={"class": "groomer-salon-card__details"})
you actually got an empty list as no match.
So, when you did
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
you didn't generate
businessObject
As mentioned in comments, that led to your error.
Content is dynamically loaded from elswhere in the DOM using javascript (as well as other DOM modifications). You can still regex out the javascript object which contains the content used to update the DOM as you saw it in browser. You then parse with json parser as follows:
import requests, re, json
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
p = re.compile(r'state: (.*?)\n', re.DOTALL)
data = json.loads(p.findall(response.text)[0])
for listing in data['content']['search_results']['pages']['data']:
print(listing['organization_name'])
If you view page source on webpage you will see that the DOM is essentially dynamically populated from top to bottom with mutation observers monitoring progress.

Related

BeautifulSoup WebScraping Issue: Cannot find specific classes for this specific Website (Python 3.7)

I am a bit new to webscraping, I have created webscrapers with the methods below before, however with this specific website I am running into an issue where the parser cannot locate the specific class ('mainTitle___mbpq1') this is the class which refers to the text of announcement. Whenever I run the code it returns None. This also the case for the majority of other classes. I want to capture this info without using selenium, since this slows the process down from what I understand. I think the issue is that it is a json file, and so script tags are being used (I may be completely wrong, just a guess), but I do not know much about this area, so any help would be much appreciated.
The code below I have attempted using, with no success.
from bs4 import BeautifulSoup
import re
import requests
# Method 1
url_4 = "https://www.kucoin.com/news/categories/listing"
res = requests.get(url_4)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
texts = soup.body
text = soup.body.div.find('div',{'class':'mainTitle___mbpq1'})
print(text)
from bs4 import BeautifulSoup
import urllib3
import re
# Method2
http = urllib3.PoolManager()
comm = re.compile("<!--|-->")
def make_soup(url):
page = http.request('GET', url)
soupdata = BeautifulSoup(page.data,features="lxml")
return soupdata
soup = make_soup(url_4)
Annouce_Info = soup.find('div',{'class':'mainTitle___mbpq1'})
print(Annouce_Info)
linkKuCoin Listing
The data is loaded from external source via Javascript. To print all article titles, you can use this example:
import json
import requests
url = "https://www.kucoin.com/_api/cms/articles"
params = {"page": 1, "pageSize": 10, "category": "listing", "lang": ""}
data = requests.get(url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data["items"]:
print(item["title"])
Prints:
PhoenixDAO (PHNX) Gets Listed on KuCoin!
LABS Group (LABS) Gets Listed on KuCoin! World Premiere!
Polkadex (PDEX) Gets Listed on KuCoin! World Premiere!
Announcement of Polkadex (PDEX) Token Sale on KuCoin Spotlight
KuCoin Futures Has Launched USDT Margined NEO, ONT, XMR, SNX Contracts
Introducing the Polkadex (PDEX) Token Sale on KuCoin Spotlight
Huobi Token (HT) Gets Listed on KuCoin!
KuCoin Futures Has Launched USDT Margined XEM, BAT, XTZ, QTUM Contracts
RedFOX Labs (RFOX) Gets Listed on KuCoin!
Boson Protocol (BOSON) Gets Listed on KuCoin! World Premiere!
If you trying to scrape information about new listings at crypto exchanges, you can be interested in this API:
https://rapidapi.com/Diver44/api/new-cryptocurrencies-listings/
import requests
url = "https://new-cryptocurrencies-listings.p.rapidapi.com/new_listings"
headers = {
'x-rapidapi-host': "new-cryptocurrencies-listings.p.rapidapi.com",
'x-rapidapi-key': "your-key"
}
response = requests.request("GET", url, headers=headers)
print(response.text)
It includes an endpoint with New Listings from the biggest exchanges and a very useful endpoint with information about exchanges where you can buy specific coins and prices for this coin at that exchanges

Even the python code seems correct but attribute error occur, no text scraped

When I was using BeautifulSoup to scrape listing product name and price, the similar code worked on other website. But when running in this website, soup.findAll attributes are there but no text scraped, AttributeError occurred. Is anyone can help to take look the code and website inspect?
I checked and ran many times, the same issue remained
Codes are here:
url = 'https://shopee.co.id/Handphone-Aksesoris-cat.40'
re = requests.get(url,headers=headers)
print(str(re.status_code))
soup = BeautifulSoup(re.text, "html.parser")
for el in soup.findAll('div', attrs={"class": "collection-card_collecton-title"}):
name = el.get.text()
print(name)
AttributeError: 'NoneType' object has no attribute 'text'
You are missing an i in the class name. However, content is dynamically loaded from API call (which is why you can't find it in your call where js doesn't run and so this next call to update DOM doesn't occur); you can find in network tab. It returns json.
import requests
r = requests.get('https://shopee.co.id/api/v2/custom_collection/get?category_id=40&platform=0').json()
titles = [i['collection_title'] for i in r['collections'][0]['list_popular_collection']]
print(titles)
Prices as well:
import requests
r = requests.get('https://shopee.co.id/api/v2/custom_collection/get?category_id=40&platform=0').json()
titles,prices =zip(*[(i['collection_title'], i['price']) for i in r['collections'][0]['list_popular_collection']])
print(titles,prices)

How to change the page no. of a page's search result for Web Scraping using Python?

I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?
At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()
Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

Unable to use the site search function

I am trying to use the built-in search function from the site but I keep getting results from the main page. Not sure what I am doing wrong.
import requests
from bs4 import BeautifulSoup
body = {'input':'ferris'} # <-- also have tried'query'
con = requests.post('http://www.collegedata.com/', data=body)
soup = BeautifulSoup(con.content, 'html.parser')
products = soup.findAll('div', {'class': 'schoolCityCol'})
print(soup)
print (products)
You have 2 issues in your code:
POST url is incorrect. You should correct this:
con = session.post('http://www.collegedata.com/cs/search/college/college_search_tmpl.jhtml', data=body)
Your POST data is incorrect too.
body = {'method':'submit', 'collegeName':'ferris', 'searchType':'1'}
You can use Developer tools in any browser (Chrome preferably) and check POST url and data on page Network.

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Resources