Not able to parse webpage contents using beautiful soup

Not able to parse webpage contents using beautiful soup - python-3.x

I have been using Beautiful Soup for parsing webpages for some data extraction. It has worked perfectly well for me so far, for other webpages. But however I'm trying to count the number of < a> tags in this page,
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)
I'm getting the output as 0. This makes me think that the 2 lines after r=requests.get(url) is probably not working(there's obviously no chance that there's zero < a> tags in the page), and i'm not sure about what alternative solution I can use here. Does anybody have any solution or faced a similar kind of problem before?
Thanks, in advance.

You need to pass some of the information along with the request to the server.
Following code should work...You can play along with other parameter as well
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
headers = {
'User-agent': 'Mozilla/5.0'
}
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)

Put any url in the parser and check the number of "a" tags available on that page:
from bs4 import BeautifulSoup
import requests
url_base = "http://www.dnaindia.com/cricket?page=1"
res = requests.get(url_base, headers={'User-agent': 'Existed'})
soup = BeautifulSoup(res.text, 'html.parser')
a_tag = soup.select('a')
print(len(a_tag))

Related

How To Refactor Web Scraping Code In Python

I am web scraping data from the below url and was able to do it correctly but i am looking for more reliable and beautiful way to do it
import pandas as pd
from bs4 import BeautifulSoup
import requests
pages = list(range(1, 548))
list_of_url = []
for page in pages:
URL = "https://www.stats.gov.sa/ar/isic4?combine=&combine_1=All&items_per_page=5" + "&page=" + str(page)
#print (URL)
list_of_url.append(URL)
print(list_of_url)
list_activities = []
#page_number = 1
for url in list_of_url:
URL = url
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('div', class_='view-content')
#print(results.prettify())
try:
activities = results.find_all("tr", class_=["views-row-first odd","even","odd","even","views-row-last odd"])
except:
print("in the activities line thisis a pad url", URL)
continue
try:
for activity in activities:
activity_section = activity.find('td', class_='views-field views-field-field-chapter-desc-en-et').text.strip()
activity_name = activity.find("td", class_="views-field views-field-field-activity-description-en-et").text.strip()
activity_code = activity.find("td", class_="views-field views-field-field-activity-code active").text.strip()
list_activities.append([activity_section,activity_name,activity_code])
except:
print("url not founf")
continue
page_number += 1
df = pd.DataFrame(list_activities, columns=["activity_section", "activity_name", "activity_code"])
df.head()
I am web scraping data from the below url and was able to do it correctly but i am looking for more reliable and beautiful way to do it

Here is a shorter version for your code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
list_activities = []
URLS = [f'https://www.stats.gov.sa/ar/isic4?combine=&combine_1=All&items_per_page=5&page={page}' for page in range(1,3)]
for URL in URLS:
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
results = soup.find('div', class_='view-content')
activities = results.find_all("tr", class_=["views-row-first odd","even","odd","even","views-row-last odd"])
list_activities += [[
activity.find('td', class_='views-field views-field-field-chapter-desc-en-et').text.strip(),
activity.find("td", class_="views-field views-field-field-activity-description-en-et").text.strip(),
activity.find("td", class_="views-field views-field-field-activity-code active").text.strip()
] for activity in activities]
df = pd.DataFrame(list_activities, columns=["activity_section", "activity_name", "activity_code"])
df.head()
However, as an engineer at WebScrapingAPI I would recommend you implement a stealthier scraper if you want to scrape this website on the long run. As per my testing, it does not feature any known bot detection providers right now. But being a government website it might use a private detection system.

How to add new data to an array in Python?

What is wrong in my code ?
I would like to add all urls from website to the array
import requests
from bs4 import BeautifulSoup
url = 'https://globalthiel.pl'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
url.append(link.get('href').text)
for i in urls:
print(i, end="")

You have a typo in your code. Replace
url.append(link.get('href').text)
with
urls.append(link.get('href').text)

Parse through website with Beautiful Soup to find matching Data

I am trying Python + BeautifulSoup to loop through a website in order to find a matching string contained in a tag.
When the matching substring is found stop the iteration and print the span, can't find a way to make this work.
this is what I could manage to work out so far
import urllib.request
from bs4 import BeautifulSoup as b
num = 1
base_url = "https://v-tac.it/led-products-results-page/?q="
request = '500'
separator = '&start='
page_num = "1"
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
for i in range(100) :
for post in soup.findAll("div",{"class" : "spacer"}):
h = post.findAll("span")[0].text
if "request" in h:
break
print(h)
num += 1
page_num = str(num)
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
print("We are at page " + page_num)
But it doesn't return anything, it only cycles through the pages.
Thanks in advance for any help

If it is in the text then with bs4 4.7.1 you should be able to use :contains
soup.select_one('.spacer span:contains("request")').text if soup.select_one('.spacer span:contains("request")') is not None else 'Not found'
I'm not sure why when you have for i in range(100) , you don't use i instead of num later; then you wouldn't need +=

Why am I getting duplicate links ? And how do I fetch links on the next pages?

I am getting duplicate links when for the links I am trying to obtain, I am not sure why. Also I am trying to fetch all the links like the ones I am getting from all the pages. But I am not sure how to write the code to click next page. Could someone please help me understand how I would go about this?
import requests
from bs4 import BeautifulSoup
url = 'http://www.gosugamers.net/counterstrike/teams'
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
#all_teams = []
for team_links in soup.find_all('a', href=True):
if team_links['href'] == '' or team_links['href'].startswith('/counterstrike/teams'):
print (team_links.get('href').replace('/counterstrike/teams', url))

The team links are in anchor tags inside the h3 tags which are inside the div with the details class:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
base = "http://www.gosugamers.net"
url = 'http://www.gosugamers.net/counterstrike/teams'
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
print ( urljoin(base, team_links["href"]))
Which gives you:
http://www.gosugamers.net/counterstrike/teams/16338-motv
http://www.gosugamers.net/counterstrike/teams/16337-absolute-monster
http://www.gosugamers.net/counterstrike/teams/16258-immortals-cs
http://www.gosugamers.net/counterstrike/teams/16251-ireal-star-gaming
http://www.gosugamers.net/counterstrike/teams/16176-team-genesis
http://www.gosugamers.net/counterstrike/teams/16175-potadies
http://www.gosugamers.net/counterstrike/teams/16174-crowns-gg
http://www.gosugamers.net/counterstrike/teams/16173-visomvet
http://www.gosugamers.net/counterstrike/teams/16172-team-phenomenon
http://www.gosugamers.net/counterstrike/teams/16152-kriklekrakle
http://www.gosugamers.net/counterstrike/teams/16148-begenius
http://www.gosugamers.net/counterstrike/teams/16144-blubblub
http://www.gosugamers.net/counterstrike/teams/16142-team-1231
http://www.gosugamers.net/counterstrike/teams/16141-vsv
http://www.gosugamers.net/counterstrike/teams/16140-tbi
http://www.gosugamers.net/counterstrike/teams/16136-deadweight
http://www.gosugamers.net/counterstrike/teams/16135-me-myself-and-i
http://www.gosugamers.net/counterstrike/teams/16085-pur-esports
http://www.gosugamers.net/counterstrike/teams/15850-falken
http://www.gosugamers.net/counterstrike/teams/15815-team-abyssal
You are literally parsing all the links on the page, that is why you see the dupes.
To get all the teams we can parse the next page link until the span with the "Next" text is not there any more which only happens for the last page:
def get_all(url, base):
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
yield (urljoin(base, team_links["href"]))
nxt = soup.find("div", {"class": "pages"}).find("span", text="Next")
while nxt:
r = requests.get(urljoin(base, nxt.find_previous("a")["href"]))
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select("div.details h3 a"):
yield (urljoin(base, team_links["href"]))
nxt = soup.find("div", {"class": "pages"}).find("span", text="Next")
If we run it for a couple of seconds, you can see we get the next pages:
In [26]: for link in (get_all(url, base)):
....: print(link)
....:
http://www.gosugamers.net/counterstrike/teams/16386-cantonese-cs
http://www.gosugamers.net/counterstrike/teams/16338-motv
http://www.gosugamers.net/counterstrike/teams/16337-absolute-monster
http://www.gosugamers.net/counterstrike/teams/16258-immortals-cs
http://www.gosugamers.net/counterstrike/teams/16251-ireal-star-gaming
http://www.gosugamers.net/counterstrike/teams/16176-team-genesis
http://www.gosugamers.net/counterstrike/teams/16175-potadies
http://www.gosugamers.net/counterstrike/teams/16174-crowns-gg
http://www.gosugamers.net/counterstrike/teams/16173-visomvet
http://www.gosugamers.net/counterstrike/teams/16172-team-phenomenon
http://www.gosugamers.net/counterstrike/teams/16152-kriklekrakle
http://www.gosugamers.net/counterstrike/teams/16148-begenius
http://www.gosugamers.net/counterstrike/teams/16144-blubblub
http://www.gosugamers.net/counterstrike/teams/16142-team-1231
http://www.gosugamers.net/counterstrike/teams/16141-vsv
http://www.gosugamers.net/counterstrike/teams/16140-tbi
http://www.gosugamers.net/counterstrike/teams/16136-deadweight
http://www.gosugamers.net/counterstrike/teams/16135-me-myself-and-i
http://www.gosugamers.net/counterstrike/teams/16085-pur-esports
http://www.gosugamers.net/counterstrike/teams/15850-falken
http://www.gosugamers.net/counterstrike/teams/15815-team-abyssal
http://www.gosugamers.net/counterstrike/teams/15810-ex-deathtrap
http://www.gosugamers.net/counterstrike/teams/15808-mix123
http://www.gosugamers.net/counterstrike/teams/15651-undertake-esports
http://www.gosugamers.net/counterstrike/teams/15644-five
http://www.gosugamers.net/counterstrike/teams/15630-five
http://www.gosugamers.net/counterstrike/teams/15627-inetkoxtv
http://www.gosugamers.net/counterstrike/teams/15626-tetr-s
http://www.gosugamers.net/counterstrike/teams/15625-rozenoir-esports-white
http://www.gosugamers.net/counterstrike/teams/15619-fragment-gg
http://www.gosugamers.net/counterstrike/teams/15615-monarchs-gg
http://www.gosugamers.net/counterstrike/teams/15602-ottoman-fire
http://www.gosugamers.net/counterstrike/teams/15591-respect
http://www.gosugamers.net/counterstrike/teams/15569-moonbeam-gaming
http://www.gosugamers.net/counterstrike/teams/15563-team-tilt
http://www.gosugamers.net/counterstrike/teams/15534-dynasty-uk
http://www.gosugamers.net/counterstrike/teams/15507-urbantech
http://www.gosugamers.net/counterstrike/teams/15374-innova
http://www.gosugamers.net/counterstrike/teams/15373-g3x
http://www.gosugamers.net/counterstrike/teams/15372-cnb
http://www.gosugamers.net/counterstrike/teams/15370-intz
http://www.gosugamers.net/counterstrike/teams/15369-2kill
http://www.gosugamers.net/counterstrike/teams/15368-supernova
http://www.gosugamers.net/counterstrike/teams/15367-biggods
http://www.gosugamers.net/counterstrike/teams/15366-playzone
http://www.gosugamers.net/counterstrike/teams/15365-pride
http://www.gosugamers.net/counterstrike/teams/15359-rising-orkam
http://www.gosugamers.net/counterstrike/teams/15342-team-foxez
http://www.gosugamers.net/counterstrike/teams/15336-angels
http://www.gosugamers.net/counterstrike/teams/15331-atlando-esports
http://www.gosugamers.net/counterstrike/teams/15329-xfinity-esports
http://www.gosugamers.net/counterstrike/teams/15326-nano-reapers
http://www.gosugamers.net/counterstrike/teams/15322-erase-team
http://www.gosugamers.net/counterstrike/teams/15318-heyguys
http://www.gosugamers.net/counterstrike/teams/15317-illusory
http://www.gosugamers.net/counterstrike/teams/15285-dismay
http://www.gosugamers.net/counterstrike/teams/15284-kingdom-esports
http://www.gosugamers.net/counterstrike/teams/15283-team-rival
http://www.gosugamers.net/counterstrike/teams/15282-ze-pug-godz
http://www.gosugamers.net/counterstrike/teams/15281-unlimited-potential1
You can see in the source for the first and any bar the last page the span with Next:
And when we get to the last, there is only spans with Previous and First:

slowing down a webscrape using python

I am trying to scrape a site and the problem I am running into is the page takes time to load. So by the time my scraping is done I may get only five items when there may be 25. Is there a way to slow down python. I am using beautifulSoup
Here is the code I am using
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://agscompany.com/product-category/fittings/tube-nuts/316-tube/"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for pn in soup.find_all('div',{"class":"shop-item-text"}):
pn2 = pn.text
print(pn2)
Thank you

All the results can be accessed from theses pages :
http://agscompany.com/product-category/fittings/tube-nuts/316-tube/page/
http://agscompany.com/product-category/fittings/tube-nuts/316-tube/page/2/
...
So you can access them with a loop on the page number :
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://agscompany.com/product-category/fittings/tube-nuts/316-tube/"
for i in range(1,5):
thepage = urllib.request.urlopen(theurl + '/page/' + str(i) + '/')
soup = BeautifulSoup(thepage,"html.parser")
for pn in soup.find_all('div',{"class":"shop-item-text"}):
pn2 = pn.text
print(pn2)

More generic version of #Kenavoz's answer.
This approach doesn't care about how many pages there are.
Also, I would go for requests rather than urllib.
import requests
from bs4 import BeautifulSoup
url_pattern = 'http://agscompany.com/product-category/fittings/tube-nuts/316-tube/page/{index}/'
status_code = 200
url_index = 1
while status_code == 200:
url = url_pattern.format(index=url_index)
response = requests.get(url)
status_code = response.status_code
url_index += 1
soup = BeautifulSoup(response.content, 'html.parser')
page_items = soup.find_all('div', {'class': 'shop-item-text'})
for page_item in page_items:
print(page_item.text)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Not able to parse webpage contents using beautiful soup - python-3.x

Related

How To Refactor Web Scraping Code In Python

How to add new data to an array in Python?

Parse through website with Beautiful Soup to find matching Data

Why am I getting duplicate links ? And how do I fetch links on the next pages?

slowing down a webscrape using python

Categories

Resources