slowing down a webscrape using python - python-3.x

I am trying to scrape a site and the problem I am running into is the page takes time to load. So by the time my scraping is done I may get only five items when there may be 25. Is there a way to slow down python. I am using beautifulSoup
Here is the code I am using
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://agscompany.com/product-category/fittings/tube-nuts/316-tube/"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for pn in soup.find_all('div',{"class":"shop-item-text"}):
pn2 = pn.text
print(pn2)
Thank you

All the results can be accessed from theses pages :
http://agscompany.com/product-category/fittings/tube-nuts/316-tube/page/
http://agscompany.com/product-category/fittings/tube-nuts/316-tube/page/2/
...
So you can access them with a loop on the page number :
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://agscompany.com/product-category/fittings/tube-nuts/316-tube/"
for i in range(1,5):
thepage = urllib.request.urlopen(theurl + '/page/' + str(i) + '/')
soup = BeautifulSoup(thepage,"html.parser")
for pn in soup.find_all('div',{"class":"shop-item-text"}):
pn2 = pn.text
print(pn2)

More generic version of #Kenavoz's answer.
This approach doesn't care about how many pages there are.
Also, I would go for requests rather than urllib.
import requests
from bs4 import BeautifulSoup
url_pattern = 'http://agscompany.com/product-category/fittings/tube-nuts/316-tube/page/{index}/'
status_code = 200
url_index = 1
while status_code == 200:
url = url_pattern.format(index=url_index)
response = requests.get(url)
status_code = response.status_code
url_index += 1
soup = BeautifulSoup(response.content, 'html.parser')
page_items = soup.find_all('div', {'class': 'shop-item-text'})
for page_item in page_items:
print(page_item.text)

Related

Beautiful soup cannot find any element

I am just starting out with web scraping. I am having trouble with beautiful soup. I have tried changing the div class to other classes as well but it always returns []. Here is my code.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="C:/Users/MuhIsmail/Downloads/cd79/chromedriver.exe")
url = "https://www.cricbuzz.com/cricket-match/live-scores"
driver.get(url)
driver.maximize_window()
time.sleep(4)
content = driver.page_source
soup = BeautifulSoup(content, "html.parser")
scores = soup.find_all('div', class_='col-xs-9 col-lg-9 dis-inline')
print(scores)
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.cricbuzz.com/cricket-match/live-scores")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.select("a.cb-mat-mnu-itm:nth-child(5)"):
print(item.text)
Output:
MLR vs SYS - SYS Won
It is returning [] because there are no elements on the page with that class.
If you open your browser console and do a simple
document.getElementsByClassName('col-xs-9 col-lg-9 dis-inline')
it will return no results.
I tried this as well:
import requests
from bs4 import BeautifulSoup
url = "https://www.cricbuzz.com/cricket-match/live-scores"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
scores = soup.find_all('div', {'class':'col-xs-9 col-lg-9 dis-inline'})
print(scores)

How to scrape from web all children of an attribute with one class?

I have tried to get the highlighted area (in the screenshot) in the website using BeautifulSoup4, but I cannot get what I want. Maybe you have a recommendation doing it with another way.
Screenshot of the website I need to get data from
from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip
import urllib
import csv
import html5lib
urls = ['https://e-mehkeme.gov.az/Public/Cases?page=1',
'https://e-mehkeme.gov.az/Public/Cases?page=2'
]
# scrape elements
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
content = soup.findAll("input", class_="casedetail filled")
print(content)
My expected output is like this:
Ətraflı məlumat:
İşə baxan hakim və ya tərkib
Xəyalə Cəmilova - sədrlik edən hakim
İlham Kərimli - tərkib üzvü
İsmayıl Xəlilov - tərkib üzvü
Tərəflər
Cavabdeh: MAHMUDOV MAQSUD SOLTAN OĞLU
Cavabdeh: MAHMUDOV MAHMUD SOLTAN OĞLU
İddiaçı: QƏHRƏMANOVA AYNA NUĞAY QIZI
İşin mahiyyəti
Mənzil mübahisələri - Mənzildən çıxarılma
Using the base url first get all the caseid and then pass those caseid to target url and then get the value of the first td tag.
import requests
from bs4 import BeautifulSoup
urls = ['https://e-mehkeme.gov.az/Public/Cases?page=1',
'https://e-mehkeme.gov.az/Public/Cases?page=2'
]
target_url="https://e-mehkeme.gov.az/Public/CaseDetail?caseId={}"
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for caseid in soup.select('input.casedetail'):
#print(caseid['value'])
soup1=BeautifulSoup(requests.get(target_url.format(caseid['value'])).content,'html.parser')
print(soup1.select_one("td").text)
I would write it this way. Extracting the id that needs to be put in GET request for detailed info
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://e-mehkeme.gov.az/Public/Cases?page=1','https://e-mehkeme.gov.az/Public/Cases?page=2']
def get_soup(url):
r = s.get(url)
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
for url in urls:
soup = get_soup(url)
detail_urls = [f'https://e-mehkeme.gov.az/Public/CaseDetail?caseId={i["value"]}' for i in soup.select('.caseId')]
for next_url in detail_urls:
soup = get_soup(next_url)
data = [string for string in soup.select_one('[colspan="4"]').stripped_strings]
print(data)

How to get the values from this with beautifulsoup?

I am trying to learn beautifulsoup and is scraping this website.
My python code looks like this:
import requests
from bs4 import BeautifulSoup
print("Enter the last 3 characters from the share link")
share_link = input()
link = "https://website.com" + share_link
print(link)
r = requests.get(link)
raw = r.text
soup = BeautifulSoup(raw, features="html.parser")
print(soup.prettify)
inputTag = soup.find("input", {"id": "hiddenInput"})
output = inputTag["value"]
print(output)
It gives me this output:
{"broadcastId":"BroadcastID: 252940","rtmp_url":"rtmp://live.gchao.cn/live/23331_9wx2w0c9","sex":0,"accountType":"26073","hls_url":"http://live.gchao.cn/live/23331_9wx2w0c9.m3u8","onlineNum":99,"likeNum":67,"live_id":282878,"flv_url":"http://live.gchao.cn/live/23331_9wx2w0c9.flv?txSecret=40d318efbbbca6afb8be2450b8d1f8fa&txTime=5D6086D1","user_id":252940,"stream_id":"23331_9wx2w0c9","nick_name":"Princess","sdkAppID":"1400088004","info_id":33189,"info_name":"Hi","IM_ID":"#TGS#aXMZYZ7FB","earning":424}
How do I get inside this and with beautifulsoup get the values?
If it is json you can load with json library then parse e.g.
import json
s = '{"broadcastId":"BroadcastID: 252940","rtmp_url":"rtmp://live.gchao.cn/live/23331_9wx2w0c9","sex":0,"accountType":"26073","hls_url":"http://live.gchao.cn/live/23331_9wx2w0c9.m3u8","onlineNum":99,"likeNum":67,"live_id":282878,"flv_url":"http://live.gchao.cn/live/23331_9wx2w0c9.flv?txSecret=40d318efbbbca6afb8be2450b8d1f8fa&txTime=5D6086D1","user_id":252940,"stream_id":"23331_9wx2w0c9","nick_name":"Princess","sdkAppID":"1400088004","info_id":33189,"info_name":"Hi","IM_ID":"#TGS#aXMZYZ7FB","earning":424}'
data = json.loads(s)
print(data['broadcastId'])

Not able to use BeautifulSoup to get span content of Nasdaq100 future

from bs4
import BeautifulSoup
import re
import requests
url = 'www.barchart.com/futures/quotes/NQU18'
r = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data)
price = soup.find('span', {'class': 'last-change',
'data-ng-class': "highlightValue('priceChange’)”}).text
print(price)
Result:
[[ item.priceChange ]]
It is not the span content. The result should be price. Where am I going wrong?
The following is the span tag of the page:
2nd screenshot: How can I get the time?
Use price = soup.find('span', {'class': 'up'}).text instead to get the +X.XX value:
from bs4 import BeautifulSoup
import requests
url = 'www.barchart.com/futures/quotes/NQU18'
r = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data, "lxml")
price = soup.find('span', {'class': 'up'}).text
print(price)
Output currently is:
+74.75
The tradeTime you seek seems to not be present in the page_source, since it's dynamically generated through JavaScript. You can, however, find it elsewhere if you're a little clever, and use the json library to parse the JSON data from a certain script element:
import json
trade_time = soup.find('script', {"id": 'barchart-www-inline-data'}).text
json_data = json.loads(trade_time)
print(json_data["NQU18"]["quote"]["tradeTime"])
This outputs:
2018-06-14T18:14:05
If these don't solve your problem then you will have to resort to something like Selenium that can run JavaScript to get what you're looking for:
from selenium import webdriver
driver = webdriver.Chrome()
url = ("https://www.barchart.com/futures/quotes/NQU18")
driver.get(url)
result = driver.find_element_by_xpath('//*[#id="main-content-column"]/div/div[1]/div[2]/span[2]/span[1]')
print(result.text)
Currently the output is:
-13.00

Not able to parse webpage contents using beautiful soup

I have been using Beautiful Soup for parsing webpages for some data extraction. It has worked perfectly well for me so far, for other webpages. But however I'm trying to count the number of < a> tags in this page,
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)
I'm getting the output as 0. This makes me think that the 2 lines after r=requests.get(url) is probably not working(there's obviously no chance that there's zero < a> tags in the page), and i'm not sure about what alternative solution I can use here. Does anybody have any solution or faced a similar kind of problem before?
Thanks, in advance.
You need to pass some of the information along with the request to the server.
Following code should work...You can play along with other parameter as well
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
headers = {
'User-agent': 'Mozilla/5.0'
}
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)
Put any url in the parser and check the number of "a" tags available on that page:
from bs4 import BeautifulSoup
import requests
url_base = "http://www.dnaindia.com/cricket?page=1"
res = requests.get(url_base, headers={'User-agent': 'Existed'})
soup = BeautifulSoup(res.text, 'html.parser')
a_tag = soup.select('a')
print(len(a_tag))

Resources