Get an xhr document that loads when you visit a page - python-3.x

I tried to obtain the elements that we can see below the photos on the following site or on others, equivalent:
https://www.nosetime.com/xiangshui/947895-oulong-xuecheng-atelier-cologne-orange.html
https://www.nosetime.com/xiangshui/705357-pomelo-paradis.html
https://www.nosetime.com/xiangshui/592260-cl-mentine-california.html
https://www.nosetime.com/xiangshui/612353-oulong-atelier-cologne-trefle.html
https://www.nosetime.com/xiangshui/911317-oulong-nimingmeigui-atelier-cologne.html
But I can't get it from the source code. It is supposed to download dynamically with a javascript script. In fact it seems to be in an xhr document:
So how can I get an xhr document that is downloaded when I visit a page?
I tried:
url = "https://www.nosetime.com/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html"
r = requests.post(url, headers=headers)
data = r.json()
print(data)
Pero me devuelve:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-8-e72156ddb336> in <module>()
2
3 r = requests.post(url, headers=headers)
----> 4 data = r.json()
5
6 print(data)
3 frames
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Just add the right headers and there you have the data.
import requests
headers = {
"referer": "https://www.nosetime.com/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
response = requests.get("https://www.nosetime.com/app/item.php?id=350870", headers=headers).json()
print(response["id"], response["isscore"], response["brandid"])
For some reason I can't paste the entire JSON output as SO thinks this is spam... o.O. Anyhow, this should get you the JSON response.
This prints:
350870 8.6 10091761
EDIT:
If you have more products, you can simply look over the product URLS and extract from the JSON what you need. For example,
import requests
product_urls = [
"https://www.nosetime.com/xiangshui/947895-oulong-xuecheng-atelier-cologne-orange.html",
"https://www.nosetime.com/xiangshui/705357-pomelo-paradis.html",
"https://www.nosetime.com/xiangshui/592260-cl-mentine-california.html",
"https://www.nosetime.com/xiangshui/612353-oulong-atelier-cologne-trefle.html",
"https://www.nosetime.com/xiangshui/911317-oulong-nimingmeigui-atelier-cologne.html",
]
for product_url in product_urls:
headers = {
"referer": product_url,
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
product_id = product_url.split("/")[-1].split("-")[0]
response = requests.get(
f"https://www.nosetime.com/app/item.php?id={product_id}",
headers=headers,
).json()
print(f"Product name: {response['enname']} | Rating: {response['isscore']}")
Output:
Product name: Atelier Cologne Orange Sanguine, 2010 | Rating: 8.9
Product name: Atelier Cologne Pomelo Paradis, 2015 | Rating: 8.8
Product name: Atelier Cologne Clémentine California, 2016 | Rating: 8.6
Product name: Atelier Cologne Trefle Pur, 2010 | Rating: 8.6
Product name: Atelier Cologne Rose Anonyme, 2012 | Rating: 7.7

Related

Loop pages and save contents in Excel file from website in Python

I'm trying to loop pages from this link and extract the interesting part.
Please see the contents in the red circle in the image below.
Here's what I've tried:
url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}'
for page in range(10):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
xpath for each element (might be helpful for those that don't read Chinese):
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span --> 【润华物业】
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/a --> 润华物业:关于公司购买理财产品的公告
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/label --> 2017-04-24
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/span --> 公告编号:2017-019 证券代码:836007 证券简称:润华物业 主办券商:国联证券
/html/body/div[3]/div/div[2]/div[2]/div[3]/a --> http://data.eastmoney.com/notices/detail/836007/AN201704250530124271,JWU2JWI2JWE2JWU1JThkJThlJWU3JTg5JWE5JWU0JWI4JTlh.html
I need to save the output to an Excel file. How could I do that in Python? Many thanks.
BeautifulSoup won't see this stuff, as it's rendered dynamically by JS, but there's an API endpoint you can query to get what you're after.
Here's how:
import requests
import pandas as pd
def clean_up(text: str) -> str:
return text.replace('</em>', '').replace(':<em>', '').replace('<em>', '')
def get_data(page_number: int) -> dict:
url = f"http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={page_number}&pagesize=10&keyword=购买物业&name=normal"
headers = {
"Referer": f"http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={page_number}",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
}
return requests.get(url, headers=headers).json()
def parse_response(response: dict) -> list:
for item in response["Data"]:
title = clean_up(item['NoticeTitle'])
date = item['NoticeDate']
url = item['Url']
notice_content = clean_up(" ".join(item['NoticeContent'].split()))
company_name = item['SecurityFullName']
print(f"{company_name} - {title} - {date}")
yield [title, url, date, company_name, notice_content]
def save_results(parsed_response: list):
df = pd.DataFrame(
parsed_response,
columns=['title', 'url', 'date', 'company_name', 'content'],
)
df.to_excel("test_output.xlsx", index=False)
if __name__ == "__main__":
output = []
for page in range(1, 11):
for parsed_row in parse_response(get_data(page)):
output.append(parsed_row)
save_results(output)
This outputs:
栖霞物业购买资产的公告 - 2019-09-03 16:00:00 - 871792
索克物业购买资产的公告 - 2020-08-17 00:00:00 - 832816
中都物业购买股权的公告 - 2019-12-09 16:00:00 - 872955
开元物业:开元物业购买银行理财产品的公告 - 2015-05-21 16:00:00 - 831971
开元物业:开元物业购买银行理财产品的公告 - 2015-04-12 16:00:00 - 831971
盛全物业:拟购买房产的公告 - 2017-10-30 16:00:00 - 834070
润华物业购买资产暨关联交易公告 - 2016-08-23 16:00:00 - 836007
润华物业购买资产暨关联交易公告 - 2017-08-14 16:00:00 - 836007
萃华珠宝:关于拟购买物业并签署购买意向协议的公告 - 2017-07-10 16:00:00 - 002731
赛意信息:关于购买办公物业的公告 - 2020-12-02 00:00:00 - 300687
And saves this to a .csv file that can be easily handled by excel.
PS. I don't know Chinese (?) so you'd have to look into the response contents and pick more stuff out.
Updated answer based on #baduker's solution, but not working out for loop pages.
import requests
import pandas as pd
for page in range(10):
url = "http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={}&pagesize=10&keyword=购买物业&name=normal"
headers = {
"Referer": "http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={}",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
}
response = requests.get(url, headers=headers).json()
output_data = []
for item in response["Data"]:
# print(item)
# print('*' * 40)
title = item['NoticeTitle'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
url = item['Url']
date = item['NoticeDate'].split(' ')[0]
company_name = item['SecurityFullName']
content = item['NoticeContent'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
# url_code = item['Url'].split('/')[5]
output_data.append([title, url, date, company_name, content])
names = ['title', 'url', 'date', 'company_name', 'content']
df = pd.DataFrame(output_data, columns = names)
df.to_excel('test.xlsx', index = False)

how to use scrapy-rotating-proxies with full settings or rotate ip/per request?

hello folks,
I am scraping a website and using scrapy-rotating-proxies, however i also tried other proxies but they are not suited my requirements or i can't implement them as i want.
There are lots of APIs and I want to fetch APIs like one ip per request. one time when i run the scraper, i have max 100 ips for use but it is not using all and reanimating some of them. you can see some logs below:
2020-11-06 09:35:56 [scrapy.extensions.logstats] INFO: Crawled 21 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-06 09:35:56 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 87, reanimated: 1, mean backoff time: 122s)
2020-11-06 09:36:26 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 87, reanimated: 1, mean backoff time: 122s)
2020-11-06 09:36:56 [scrapy.extensions.logstats] INFO: Crawled 21 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-06 09:36:56 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 87, reanimated: 1, mean backoff time: 122s)
2020-11-06 09:37:26 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 87, reanimated: 1, mean backoff time: 122s)
2020-11-06 09:37:56 [scrapy.extensions.logstats] INFO: Crawled 21 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-06 09:37:56 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 87, reanimated: 1, mean backoff time: 122s)
2020-11-06 09:37:56 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2020-11-06 09:37:59 [rotating_proxies.expire] DEBUG: Proxy <https://92.60.190.249:50335> is DEAD
2020-11-06 09:37:59 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id=243&newest=0&order=desc&page_type=search&version=2> with another proxy (failed 3 times, max retries: 5)
2020-11-06 09:37:59 [scrapy_user_agents.middlewares] DEBUG: Proxy is detected https://92.60.190.249:50335
2020-11-06 09:37:59 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36
2020-11-06 09:38:10 [rotating_proxies.expire] DEBUG: Proxy <https://162.223.89.220:8080> is GOOD
2020-11-06 09:38:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id=243&newest=0&order=desc&page_type=search&version=2> (referer: https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id=243&newest=0&order=desc&page_type=search&version=2)
2020-11-06 09:38:10 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
2020-11-06 09:38:26 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 1, unchecked: 85, reanimated: 2, mean backoff time: 249s)
2020-11-06 09:38:56 [scrapy.extensions.logstats] INFO: Crawled 22 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2020-11-06 09:38:56 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 1, unchecked: 85, reanimated: 2, mean backoff time: 249s)
INFO: Proxies(good: 0, dead: 1, unchecked: 87, reanimated: 1, mean backoff time: 122s)
see there i don't want to reanimate ips because there are '87' unchecked ip's still remaining. if i send 1 ip per request so the webiste will not catch me and send me a valid response.
My standalone spider which i am using:
from scrapy import Spider, Request
import json
from micheal_app import db, server
import ast
from micheal_app import db
import requests as r
from micheal_app.models import Top_sales_data, Categories
from parsel import Selector
import random
from scrapy.crawler import CrawlerProcess
headers = {
'authority': 'shopee.com.my',
'method': 'GET',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'if-none-match': '*',
'x-api-source': 'pc',
'x-requested-with': 'XMLHttpRequest',
'x-shopee-language': 'en'
}
if server.pythonanywhere:
path = '/home/wholesalechina/mysite/proxies.txt'
else:
path = 'D:\Freelancer\Orders\micheal git\proxies.txt'
class TopSalesAutoScraper(Spider):
name = 'top_sales_auto_scraper'
start_urls = ['https://free-proxy-list.net', 'https://www.sslproxies.org/']
proxies = []
li = [1, 2, 3]
# allowed_domains = ['shopee.com', 'shopee.com.my', 'shopee.com.my/api/']
custom_settings = {
'AUTOTHROTTLE_ENABLED': 'True',
# The initial download delay
'AUTOTHROTTLE_START_DELAY': '1',
# The maximum download delay to be set in case of high latencies
'AUTOTHROTTLE_MAX_DELAY': '60',
# The average number of requests Scrapy should be sending in parallel to
# each remote server
'AUTOTHROTTLE_TARGET_CONCURRENCY': '1.0',
'CONCURRENT_REQUESTS': '1',
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 500,
# ...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
# ...
},
'ROTATING_PROXY_LIST_PATH': path,
}
def __init__(self):
# Removing that category data from database
for i in self.li:
cat_data = Categories.query.filter_by(id=i).first()
Top_sales_data.query.filter_by(category=cat_data.category).delete()
db.session.commit()
f = open('proxies.txt', 'w')
for url in self.start_urls:
req = r.get(url)
response = Selector(text=req.text)
table = response.css('table')
rows = table.css('tr')
cols = [row.css('td::text').getall() for row in rows]
for col in cols:
if col and col[4] == 'elite proxy' and col[6] == 'yes':
self.proxies.append('https://' + col[0] + ':' + col[1])
for proxy in self.proxies:
f.write(proxy+'\n')
print('proxies found: {0}'.format(len(self.proxies)))
f.close()
def start_requests(self):
# getting all categories
all_categories = []
for i in self.li:
all_categories.append(Categories.query.get(i))
# Looping through all categories and their subcategories
for category in all_categories:
data = ast.literal_eval(category.subcategories)
for d in data:
subcat_url = d['subcategory url']
id = subcat_url.split('.')[-1]
headers['path'] = '/api/v2/search_items/?by=sales&limit=50&match_id={0}&newest=0&order=desc&page_type=search&version=2'.format(
id)
headers['referer'] = 'https://shopee.com.my{0}?page=0&sortBy=sales'.format(
subcat_url)
url = 'https://shopee.com.my/api/v2/search_items/?by=sales&limit=50&match_id={0}&newest=0&order=desc&page_type=search&version=2'.format(
id)
yield Request(url=url, headers=headers, callback=self.parse,
meta={'proxy': None},
cb_kwargs={'subcat': d['subcategory name'], 'category': category.category})
def parse(self, response, subcat, category):
try:
jdata = json.loads(response.body.decode('utf-8'))
except Exception as e:
print(
'This is failed subcat url: {0} and tring again.'.format(subcat))
print('and the exception is: {0}'.format(e))
yield Request(response.url, dont_filter=True, callback=self.parse,
cb_kwargs={'subcat': subcat, 'category': category})
else:
items = jdata['items']
for item in items:
name = item['name']
image_path = item['image']
absolute_image = 'https://cf.shopee.com.my/file/{0}_tn'.format(
image_path)
subcategory = subcat
monthly_sold = item['sold']
price = float(item['price'])/100000
price = '{:.2f}'.format(price)
total_sold = item['historical_sold']
location = item['shop_location']
stock = item['stock']
shop_id = str(item['shopid'])
item_id = str(item['itemid'])
# making product url
if name[0] == '-':
name = name[1:]
p_name = name.replace(
' ', '-').replace('[', '').replace(']', '').replace(
'/', '').replace('--', '-').replace('+', '').replace(
'%', '').replace('#', '')
pro_url = 'https://shopee.com.my/' + \
p_name + '-i.' + shop_id + '.' + item_id
top_data = Top_sales_data(photo=absolute_image, product_name=name, category=category,
subcategory=subcategory, monthly_sold=monthly_sold, price=price,
total_sold=total_sold, location=location, stock=stock,
product_url=pro_url)
db.session.add(top_data)
db.session.commit()
process = CrawlerProcess()
process.crawl(TopSalesAutoScraper)
process.start()
I am checking in an try and except block that if response if 200 but an empty json object then i am yielding again so i am expecting that yielding again will be change ip or if not then how to change it.
see the below parse code again:
def parse(self, response, subcat, category):
try:
jdata = json.loads(response.body.decode('utf-8'))
except Exception as e:
print(
'This is failed subcat url: {0} and tring again.'.format(subcat))
print('and the exception is: {0}'.format(e))
yield Request(response.url, dont_filter=True, callback=self.parse,
cb_kwargs={'subcat': subcat, 'category': category})

Result is not stacking as desired via tuple or dict

I am grabbing some financial data off a website, a list of names and a list of numbers, respectively. However, if i print them independently, I get the results separately fine, but I can't manage to put them together.
import requests
import bs4
import numpy as np
def fundamentals(name, number):
url = 'http://quotes.money.163.com/f10/dbfx_002230.html?date=2020-03-31#01c08'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
response = requests.get(url, headers=headers).content
soup = bs4.BeautifulSoup(response, 'html5lib')
DuPond_name = soup.find_all(name='td', attrs={'class': 'dbbg01'})
DuPond_number = soup.find_all(name='td', attrs={'class': 'dbbg02'})
for i, names in enumerate(DuPond_name, 1):
name = names.getText()
print(name)
for i, numbers in enumerate(DuPond_number, 1):
number = numbers.getText()
print(number)
return {name: number}
if __name__ == '__main__':
print(fundamentals(name=[], number=[]))
DOM
净资产收益率
总资产收益率
权益乘数
销售净利率
总资产周转率
净利润
营业收入
营业收入
平均资产总额
营业收入
全部成本
投资收益
所得税
其他
营业成本
销售费用
管理费用
财务费用
-1.16%
-0.63%
1/(1-40.26%)
-9.33%
0.07
-131,445,229.01
1,408,820,489.46
1,408,820,489.46
9,751,224,017.79
1,408,820,489.46
1,704,193,442.22
5,971,254
17,965,689
--
776,103,494
274,376,792.25
186,977,519.02
5,173,865.88
{'财务费用': '5,173,865.88'}
Process finished with exit code 0
the final dict is only giving me the very last combination, how can i fix it? or if i can put them into dataframe form, sweeter. thank you all for the help!
Can you try this:
import pandas as pd
df = pd.DataFrame([list(DuPond_name), list(DuPond_number)]).T
For reference, This is how I tested it:
import pandas as pd
ls1 = ['净资产收益率','总资产收益率','权益乘数','销售净利率','总资产周转率','净利润','营业收入','营业收入','平均资产总额','营业收入','全部成本','投资收益','所得税','其他','营业成本','销售费用','管理费用','财务费用']
ls2 = ['-1.16%','-0.63%','1/(1-40.26%)','-9.33%','0.07','-131,445,229.01','1,408,820,489.46','1,408,820,489.46','9,751,224,017.79','1,408,820,489.46','1,704,193,442.22','5,971,254','17,965,689','--','776,103,494','274,376,792.25','186,977,519.02','5,173,865.88']
df = pd.DataFrame([ls1, ls2]).T
print(df)
This is the output I got:
0 1
0 净资产收益率 -1.16%
1 总资产收益率 -0.63%
2 权益乘数 1/(1-40.26%)
3 销售净利率 -9.33%
4 总资产周转率 0.07
5 净利润 -131,445,229.01
6 营业收入 1,408,820,489.46
7 营业收入 1,408,820,489.46
8 平均资产总额 9,751,224,017.79
9 营业收入 1,408,820,489.46
10 全部成本 1,704,193,442.22
11 投资收益 5,971,254
12 所得税 17,965,689
13 其他 --
14 营业成本 776,103,494
15 销售费用 274,376,792.25
16 管理费用 186,977,519.02
17 财务费用 5,173,865.88

Receiving a list index out of range error when returning soup.select

Attempting to scrape a price from an amazon page but I'm getting a list index out of range error with return elems[0].text.strip()
How do I test the len of the soup.select parameter?
Here's the code:
def getAmazonPrice(productUrl):
res = requests.get(productUrl, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36"
})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('#buyNewSection > a > h5 > div > div.a-column.a-span8.a-text-right.a-span-last > div > span.a-size-medium.a-color-price.offer-price.a-text-normal')
return elems[0].text.strip()
price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd/dp/1593279922/ref=sr_1_1?keywords=automate+the+boring+stuff+with+python&qid=1574259797&sr=8-1')
print('The price is ' + price)
Traceback (most recent call last):
File "/Users/xxx/Documents/MyPythonScripts/scrape.py", line 14, in
price = getAmazonPrice('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd/dp/1593279922/ref=sr_1_1?keywords=automate+the+boring+stuff+with+python&qid=1574259797&sr=8-1')
File "/Users/xxx/Documents/MyPythonScripts/scrape.py", line 12, in getAmazonPrice
return elems[0].text.strip()
IndexError: list index out of range
Test elems before indexing it:
if elems: # Or if len(elems) > 0:
return elems[0].text.strip()
else:
# Handle failure

How to use python to click the “load more” to extract links of names

I want to get links of names from all the pages by clicking load more and needs help with pagination
I've got the logic to print links for names but needs help with pagination
for pos in positions:
url = "https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool"
two = requests.get("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool" + pos,headers=HEADERS)
bsObj = BeautifulSoup(two.content , 'lxml')
main_content = urljoin(url,bsObj.select(".data-js")[1]['href']) ## ['href']InstitutionGroup" extracting the link leading to the page containing everything available here
response = requests.get(main_content)
obj = BeautifulSoup(response.content , 'lxml')
names = obj.findAll("div",{"class" : "recruit"})
for player_name in names:
player_name.find('a',{'class' : ' rankings-page__name-link'})
for all_players in player_name.find_all('a', href=True):
player_urls = site + all_players.get('href')
# print(player_urls)
I expect output : https://247sports.com/Player/Jack-Sawyer-46049925/
(links of all player names)
Can just iterate through the parameters in the requests. Since you can just continue to iterate forever, I had it check for when players started to repeat (essentially when the next iteration doesn't add new players). Seem to stop after 21 pages which gives 960 players.
import requests
from bs4 import BeautifulSoup
url = 'https://247sports.com/Season/2021-Football/CompositeRecruitRankings/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
player_links = []
prior_count = 0
for page in range(1,101):
#print ('Page: %s' %page)
payload = {
'ViewPath': '~/Views/SkyNet/PlayerSportRanking/_SimpleSetForSeason.ascx',
'InstitutionGroup': 'HighSchool',
'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
recruits = soup.find_all('div',{'class':'recruit'})
for recruit in recruits:
print ('https://247sports.com' + recruit.find('a')['href'])
player_links.append('https://247sports.com' + recruit.find('a')['href'])
current_count = len(list(set(player_links)))
if prior_count == current_count:
print ('No more players')
break
else:
prior_count = current_count
Output:
print (player_links)
['https://247sports.com/Player/Korey-Foreman-46056100', 'https://247sports.com/Player/Jack-Sawyer-46049925', 'https://247sports.com/Player/Tommy-Brockermeyer-46040211', 'https://247sports.com/Player/James-Williams-46049981', 'https://247sports.com/Player/Payton-Page-46055295', 'https://247sports.com/Player/Camar-Wheaton-46050152', 'https://247sports.com/Player/Brock-Vandagriff-46050870', 'https://247sports.com/Player/JT-Tuimoloau-46048440', 'https://247sports.com/Player/Emeka-Egbuka-46048438', 'https://247sports.com/Player/Tony-Grimes-46048912', 'https://247sports.com/Player/Sam-Huard-46048437', 'https://247sports.com/Player/Amarius-Mims-46079928', 'https://247sports.com/Player/Savion-Byrd-46078964', 'https://247sports.com/Player/Jake-Garcia-46053996', 'https://247sports.com/Player/Agiye-Hall-46055274', 'https://247sports.com/Player/Caleb-Williams-46040610', 'https://247sports.com/Player/JJ-McCarthy-46042742', 'https://247sports.com/Player/Dylan-Brooks-46079585', 'https://247sports.com/Player/Nolan-Rucci-46058902', 'https://247sports.com/Player/GaQuincy-McKinstry-46052990', 'https://247sports.com/Player/Will-Shipley-46056925', 'https://247sports.com/Player/Maason-Smith-46057128', 'https://247sports.com/Player/Isaiah-Johnson-46050757', 'https://247sports.com/Player/Landon-Jackson-46049327', 'https://247sports.com/Player/Tunmise-Adeleye-46050288', 'https://247sports.com/Player/Terrence-Lewis-46058521', 'https://247sports.com/Player/Lee-Hunter-46058922', 'https://247sports.com/Player/Raesjon-Davis-46056065', 'https://247sports.com/Player/Kyle-McCord-46047962', 'https://247sports.com/Player/Beaux-Collins-46049126', 'https://247sports.com/Player/Landon-Tengwall-46048781', 'https://247sports.com/Player/Smael-Mondon-46058273', 'https://247sports.com/Player/Derrick-Davis-Jr-46049676', 'https://247sports.com/Player/Troy-Franklin-46048840', 'https://247sports.com/Player/Tywone-Malone-46081337', 'https://247sports.com/Player/Micah-Morris-46051663', 'https://247sports.com/Player/Donte-Thornton-46056489', 'https://247sports.com/Player/Bryce-Langston-46050326', 'https://247sports.com/Player/Damon-Payne-46041148', 'https://247sports.com/Player/Rocco-Spindler-46049869', 'https://247sports.com/Player/David-Daniel-46076804', 'https://247sports.com/Player/Branden-Jennings-46049721', 'https://247sports.com/Player/JaTavion-Sanders-46058800', 'https://247sports.com/Player/Chris-Hilton-46055801', 'https://247sports.com/Player/Jason-Marshall-46051367', ... ]

Resources