not able to scrape the data and even the links are not changing while clciking on pagination using python - python-3.x

want to scrape data from of each block and want to change the pages, but not able to do that,help me someone to crack this.
i tried to crawl data using headers and form data , but fail to do that.
below is my code.
from bs4 import BeautifulSoup
import requests
url='http://www.msmemart.com/msme/listings/company-list/agriculture-product-stocks/1/585/Supplier'
headers={
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "unique_visitor=49.35.36.33; __utma=189548087.1864205671.1549441624.1553842230.1553856136.3; __utmc=189548087; __utmz=189548087.1553856136.3.3.utmcsr=nsic.co.in|utmccn=(referral)|utmcmd=referral|utmcct=/; csrf=d665df0941bbf3bce09d1ee4bd2b079e; ci_session=ac6adsb1eb2lcoogn58qsvbjkfa1skhv; __utmt=1; __utmb=189548087.26.10.1553856136",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json, text/javascript, */*; q=0.01",
}
data ={
'catalog': 'Supplier',
'category':'1',
'subcategory':'585',
'type': 'company-list',
'csrf': '0db0757db9473e8e5169031b7164f2a4'
}
r = requests.get(url,data=data,headers=headers)
soup = BeautifulSoup(html,'html.parser')
div = soup.find('div',{"id":"listings_result"})
for prod in div.find_all('b',string='Products/Services:').next_sibling:
print(prod)
getting "ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host" 2-3 times, i want to crawl all text details in a block.plz someone help me to found this.

Related

Scraping kenpom data in 2023

I have been scouring across the web to figure out how to use beautifulsoup and pandas to scrape kenpom.com college basketball data. I do not have an account to his website, hence why I am not using the kenpompy library. I have seen some past examples to scrape it from years past, including using the pracpred library (though I have zero experience on it, I'll admit) or using the curlconverter to grab the headers, cookies, and parameters during the requests.get, but now the website seems stingier in terms of grabbing the main table these days. I have used the following code so far:
import requests
from bs4 import BeautifulSoup
url ='https://kenpom.com/index.php?y=2023'
import requests
import requests
cookies = {
'PHPSESSID': 'f04463ec42584dbd1bf7a480098947d1',
'_ga': 'GA1.2.120164513.1673124870',
'_gid': 'GA1.2.622021765.1673496414',
'__stripe_mid': '71a4117b-cbef-4d3c-b31b-9d18e5c99a33183485',
'__stripe_sid': '99b77b80-1222-4f7a-b2a8-acf5cf19c7d18a637f',
'kenpomtry': 'https%3A%2F%2Fkenpom.com%2Fsummary.php%3Fs%3DRankAPL_Off%26y%3D2021',
'_gat_gtag_UA_11558853_1': '1',
}
headers = {
'authority': 'kenpom.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
# 'cookie': 'PHPSESSID=f04463ec42584dbd1bf7a480098947d1; _ga=GA1.2.120164513.1673124870; _gid=GA1.2.622021765.1673496414; __stripe_mid=71a4117b-cbef-4d3c-b31b-9d18e5c99a33183485; __stripe_sid=99b77b80-1222-4f7a-b2a8-acf5cf19c7d18a637f; kenpomtry=https%3A%2F%2Fkenpom.com%2Fsummary.php%3Fs%3DRankAPL_Off%26y%3D2021; _gat_gtag_UA_11558853_1=1',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
}
params = {
'y': '2023',
}
response = requests.get('https://kenpom.com/index.php', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
soup
# table = soup.find('table',{'id':'ratings-table'}).tbody
Any suggestions beyond this would be truly appreciated.
Using requests and adding UserAgent as headers (It's pretty messy with multiple hierarchal indexes so will need further parsing to clean completely):
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36"
}
url = "https://kenpom.com/index.php?y=2023"
with requests.Session() as request:
response = request.get(url, timeout=30, headers=headers)
if response.status_code != 200:
print(response.raise_for_status())
soup = BeautifulSoup(response.text, "html.parser")
df = pd.concat(pd.read_html(str(soup)))

Scraping values from View Source using Requests Python 3

So this code below is working fine but when i change the url to another site it doesn't work
import requests
import re
url = "https://www.autotrader.ca/a/ram/1500/hamilton/ontario/19_12052335_/?showcpo=ShowCpo&ncse=no&ursrc=pl&urp=2&urm=8&sprx=-2"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url, headers=headers)
phone_number = re.findall('"phoneNumber":"([\d-]+)"', response.text)
print(phone_number)
['905-870-7127']
This code below doesn't work it gives the output [] Please tell me what am i doing wrong
import requests
import re
urls = "https://www.kijijiautos.ca/vip/22686710/","https://www.kijijiautos.ca/vip/22686710/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
for url in urls:
response = requests.get(url, headers=headers)
number = re.findall('"number":"([\d-]+)"', response.text)
print(number)
[]
I think you are not getting The HTTP 200 OK success status as a response.for that cause you are unable to get the exptected ouptput. To get the HTTP 200 OK success status, I have changed the headers from inspecting http requests.
please try this
import requests
import re
import requests
headers = {
'authority': 'www.kijijiautos.ca',
'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
'pragma': 'no-cache',
'accept-language': 'en-CA',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
'content-type': 'application/json',
'accept': 'application/json',
'cache-control': 'no-cache',
'x-client-id': 'c89e7ff8-1d5a-4c2b-a095-c08dc08ccd3b',
'x-client': 'ca.move.web.app',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.kijijiautos.ca/cars/hyundai/sonata/used/',
'cookie': 'mvcid=c89e7ff8-1d5a-4c2b-a095-c08dc08ccd3b; locale=en-CA; trty=e; _gcl_au=1.1.1363596757.1633936124; _ga=GA1.2.1193080228.1633936126; _gid=GA1.2.71842091.1633936126; AAMC_kijiji_0=REGION%7C3; aam_uuid=43389576784435124231935699643302941454; _fbp=fb.1.1633936286669.1508597061; __gads=ID=bb71a6fc168c1c33:T=1633936286:S=ALNI_MZk3lgy-9xgSGLPnfrkBET60uS6fA; GCLB=COyIgrWs-PWPsQE; lux_uid=163402080128473094; cto_bundle=zxCnjF95NFglMkZrTG5EZ2dzNHFSdjJ6QSUyQkJvM1BUbk5WTkpjTms0aWdZb3RxZUR3Qk1nd1BmcSUyQjZaZVFUdFdSelpua3pKQjFhTFk0U2ViTHVZbVg5ODVBNGJkZ2NqUGg1cHZJN3V0MWlwRkQwb1htcm5nNDlqJTJGUUN3bmt6ZFkzU1J0bjMzMyUyRkt5aGVqWTJ5RVJCa2ZJQUwxcFJnJTNEJTNE; _td=7f855061-c320-4570-b2d2-73c94bd22b13; rbzid=54THgSkyCRKwhVBqy+iHmjb1RG+uE6uH1XjpsXIazO5nO45GtpIXHGYii/PbJcdG3ahjIgKaBrjh0Yx2J6YCOLHEv3QYL559oz3jQaVrssH2/1Ui9buvIpuCwBOGG2xXGWW2qvcU5807PGsdubQDUvLkxmy4sor+4EzCI1OoUHMOG2asQwsgChqwzJixVvrE21E/NJdRfDLlejb5WeGEgU4B3dOYH95yYf5h+7fxV6H/XLhqbNa8e41DM3scfyeYWeqWCWmOH2VWZ7i3oQ0OXW1SkobLy0D6G+V9J5QMxb0=; rbzsessionid=ca53a07d3404ca93b3f8bc879291dc83; _uetsid=131a47702a6211ecba407d9ff6588dde; _uetvid=131b13602a6211ecacd0b56b0815e9b2',
}
response = requests.get('https://www.kijijiautos.ca/consumer/svc/a/22686710', headers=headers)
if response.status_code == 200:
# print(response.text)
numbers = re.findall(r'"number":"\+\d+"', response.text) # number one or more
print(numbers[0])
else:
print('status code is ', response.status_code)
output
# "number":"+17169905088"

How to show the hidden contents under "View More" when using Selenium - Python

driver = webdriver.Chrome(r'XXXX\chromedriver.exe')
FB_bloomberg_URL = 'https://www.bloomberg.com/quote/FB:US'
driver.get(FB_bloomberg_URL)
board_members = driver.find_elements_by_xpath('//* [#id="root"]/div/div/section[3]/div[10]/div[1]/div[2]/div/div[2]')[0]
board=board_members.text
board.split('\n')
I wrote the coding above to scrape the board information from Bloomberg for FaceBook. But I have trouble to extract all board members because others are hidden behind the "View More". How can I extract all names?
Thanks for the help.
You can do whole thing with requests and grab the appropriate cookie to pass to API from prior GET. The API can be found in the network tab when clicking the view more link and inspecting the web traffic.
import requests
headers = {
'dnt': '1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': '*/*',
'referer': 'https://www.bloomberg.com/quote/FB:US',
'authority': 'www.bloomberg.com',
'cookie':''
}
with requests.Session() as s:
r = s.get('https://www.bloomberg.com/quote/FB:US')
headers['cookie'] = s.cookies.get_dict()['_pxhd']
r = s.get('https://www.bloomberg.com/markets2/api/peopleForCompany/11092218', headers = headers).json()
board_members = [item['name'] for item in r['boardMembers']]
print(board_members)

LinkedIn HTTP Error 999 - Request denied

I am writing a simple script to get public profile visible without login on LinkedIn.
Below is my code to get the page for beautifulsoup. I am using public proxies as well.
import urllib.request, urllib.error
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/company/amazon"
proxy = urllib.request.ProxyHandler({'https': proxy, })
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
hdr = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3218.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,hi;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request(url, headers=hdr)
page = urllib.request.urlopen(req, timeout=20)
self.soup = BeautifulSoup(page.read(), "lxml")
But it is raising "HTTPError 999 - request Denied" error. This is only for testing purpose till I am getting access via partnership program.
What am I doing wrong? Please help.
You did not do anything wrong, LinkedIn blacklist cloud servers ip addresses to prevent "stealing" their data. Questionable practice but this is how it is.

Extracting data tables from website

I want to extract a data table from a website. Pandas read_html is giving a HTTP error 403. Is there any other module through which I can extract the data by python.
Here is the website: https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT
Mask your session as if you were using a browser:
import requests
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, headers=header)
dfs = pd.read_html(r.text)

Resources