Scraping values from View Source using Requests Python 3

Scraping values from View Source using Requests Python 3 - python-3.x

So this code below is working fine but when i change the url to another site it doesn't work
import requests
import re
url = "https://www.autotrader.ca/a/ram/1500/hamilton/ontario/19_12052335_/?showcpo=ShowCpo&ncse=no&ursrc=pl&urp=2&urm=8&sprx=-2"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url, headers=headers)
phone_number = re.findall('"phoneNumber":"([\d-]+)"', response.text)
print(phone_number)
['905-870-7127']
This code below doesn't work it gives the output [] Please tell me what am i doing wrong
import requests
import re
urls = "https://www.kijijiautos.ca/vip/22686710/","https://www.kijijiautos.ca/vip/22686710/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
for url in urls:
response = requests.get(url, headers=headers)
number = re.findall('"number":"([\d-]+)"', response.text)
print(number)
[]

I think you are not getting The HTTP 200 OK success status as a response.for that cause you are unable to get the exptected ouptput. To get the HTTP 200 OK success status, I have changed the headers from inspecting http requests.
please try this
import requests
import re
import requests
headers = {
'authority': 'www.kijijiautos.ca',
'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
'pragma': 'no-cache',
'accept-language': 'en-CA',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
'content-type': 'application/json',
'accept': 'application/json',
'cache-control': 'no-cache',
'x-client-id': 'c89e7ff8-1d5a-4c2b-a095-c08dc08ccd3b',
'x-client': 'ca.move.web.app',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.kijijiautos.ca/cars/hyundai/sonata/used/',
'cookie': 'mvcid=c89e7ff8-1d5a-4c2b-a095-c08dc08ccd3b; locale=en-CA; trty=e; _gcl_au=1.1.1363596757.1633936124; _ga=GA1.2.1193080228.1633936126; _gid=GA1.2.71842091.1633936126; AAMC_kijiji_0=REGION%7C3; aam_uuid=43389576784435124231935699643302941454; _fbp=fb.1.1633936286669.1508597061; __gads=ID=bb71a6fc168c1c33:T=1633936286:S=ALNI_MZk3lgy-9xgSGLPnfrkBET60uS6fA; GCLB=COyIgrWs-PWPsQE; lux_uid=163402080128473094; cto_bundle=zxCnjF95NFglMkZrTG5EZ2dzNHFSdjJ6QSUyQkJvM1BUbk5WTkpjTms0aWdZb3RxZUR3Qk1nd1BmcSUyQjZaZVFUdFdSelpua3pKQjFhTFk0U2ViTHVZbVg5ODVBNGJkZ2NqUGg1cHZJN3V0MWlwRkQwb1htcm5nNDlqJTJGUUN3bmt6ZFkzU1J0bjMzMyUyRkt5aGVqWTJ5RVJCa2ZJQUwxcFJnJTNEJTNE; _td=7f855061-c320-4570-b2d2-73c94bd22b13; rbzid=54THgSkyCRKwhVBqy+iHmjb1RG+uE6uH1XjpsXIazO5nO45GtpIXHGYii/PbJcdG3ahjIgKaBrjh0Yx2J6YCOLHEv3QYL559oz3jQaVrssH2/1Ui9buvIpuCwBOGG2xXGWW2qvcU5807PGsdubQDUvLkxmy4sor+4EzCI1OoUHMOG2asQwsgChqwzJixVvrE21E/NJdRfDLlejb5WeGEgU4B3dOYH95yYf5h+7fxV6H/XLhqbNa8e41DM3scfyeYWeqWCWmOH2VWZ7i3oQ0OXW1SkobLy0D6G+V9J5QMxb0=; rbzsessionid=ca53a07d3404ca93b3f8bc879291dc83; _uetsid=131a47702a6211ecba407d9ff6588dde; _uetvid=131b13602a6211ecacd0b56b0815e9b2',
}
response = requests.get('https://www.kijijiautos.ca/consumer/svc/a/22686710', headers=headers)
if response.status_code == 200:
# print(response.text)
numbers = re.findall(r'"number":"\+\d+"', response.text) # number one or more
print(numbers[0])
else:
print('status code is ', response.status_code)
output
# "number":"+17169905088"

Related

Response provided for One Page(200) but not for Other(401)

On the same website here, a consistent valid response is provided for One Page(200-url/nifty) but not provided for anotherpage(401-url/oispurtscontracts).
It does provide a valid response sometimes and other times returns a 401 error.
Browser cache cleared and reloaded.
Please provide a solution.
Error :
response.status_code = 401 for https://www.nseindia.com/api/live-analysis-oi-spurts-contracts
Code:
import requests
def connRequest(url,headers):
session = requests.Session()
request = session.get(url, headers=headers)
cookies = dict(request.cookies)
# print(cookies)
print(f"response.status_code = {request.status_code} for {url}")
response = session.get(url, headers=headers, cookies=cookies).json()
print(f"response = {response}")
return response
# Working - Response Provided
def nifty_Working():
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY'
# data = requests.get(url)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.5',
'Accept':'application/json'
}
response = connRequest(url, headers)
# 401 Error
def oiSpurtsContracts_NotWorking():
url = 'https://www.nseindia.com/api/live-analysis-oi-spurts-contracts'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en',
'Accept': 'application/json'
}
response = connRequest(url, headers)
def main():
# Working - Response Provided
nifty_Working()
print()
print()
print()
print()
# 401 Error
time.sleep(1)
oiSpurtsContracts_NotWorking()
main()

Can I use post method in requests lib on this Binance site?

Here is the site:
https://www.binance.com/ru/futures-activity/leaderboard?type=myProfile&tradeType=PERPETUAL&encryptedUid=E921F42DCD4D9F6ECC0DFCE3BAB1D11A
I am parsing the positions of the trader by selenium, but today realize, that I can use post method.
Here is what "NETWORK" shows:
Here is response preview:
I have no experience with post method of requests, I tried this, but doesn't work.
import requests
hd = {'accept':"*/*",'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
ses = requests.Session()
c = ses.post('https://www.binance.com/bapi/futures/v1/public/future/leaderboard/getOtherPosition',headers=hd)
print(c.text)
Output is:
{"code":"000002","message":"illegal parameter","messageDetail":null,"data":null,"success":false}
Can someone help me to do it, please? Is it real?

It's working as POST method
import requests
url='https://www.binance.com/bapi/futures/v1/public/future/leaderboard/getOtherPerformance'
headers= {
"content-type": "application/json",
"x-trace-id": "4c3d6fce-a2d8-421e-9d5b-e0c12bd2c7c0",
"x-ui-request-trace": "4c3d6fce-a2d8-421e-9d5b-e0c12bd2c7c0"
}
payload = {"encryptedUid":"E921F42DCD4D9F6ECC0DFCE3BAB1D11A","tradeType":"PERPETUAL"}
req=requests.post(url,headers=headers,json=payload).json()
#print(req)
for item in req['data']:
roi = item['value']
print(roi)
Output:
-0.023215
-91841.251668
0.109495
390421.996614
-0.063094
-266413.73955621
0.099181
641189.24407088
0.072079
265977.556474
-0.09197
-400692.52138279
-0.069988
-469016.33171481
0.0445
292594.20440128

I am used Curlconverter, and it helped me a lot! Here is the working code:
import requests
headers = {
'authority': 'www.binance.com',
'accept': '*/*',
'accept-language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,ja;q=0.6,ko;q=0.5,zh-CN;q=0.4,zh;q=0.3',
'bnc-uuid': '0202c537-8c2b-463a-bdef-33761d21986a',
'clienttype': 'web',
'csrftoken': 'd41d8cd98f00b204e9800998ecf8427e',
'device-info': 'eyJzY3JlZW5fcmVzb2x1dGlvbiI6IjE5MjAsMTA4MCIsImF2YWlsYWJsZV9zY3JlZW5fcmVzb2x1dGlvbiI6IjE5MjAsMTA0MCIsInN5c3RlbV92ZXJzaW9uIjoiV2luZG93cyAxMCIsImJyYW5kX21vZGVsIjoidW5rbm93biIsInN5c3RlbV9sYW5nIjoicnUtUlUiLCJ0aW1lem9uZSI6IkdNVCszIiwidGltZXpvbmVPZmZzZXQiOi0xODAsInVzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoV2luZG93cyBOVCAxMC4wOyBXaW42NDsgeDY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBDaHJvbWUvMTAxLjAuNDk1MS42NyBTYWZhcmkvNTM3LjM2IiwibGlzdF9wbHVnaW4iOiJQREYgVmlld2VyLENocm9tZSBQREYgVmlld2VyLENocm9taXVtIFBERiBWaWV3ZXIsTWljcm9zb2Z0IEVkZ2UgUERGIFZpZXdlcixXZWJLaXQgYnVpbHQtaW4gUERGIiwiY2FudmFzX2NvZGUiOiI1ZjhkZDMyNCIsIndlYmdsX3ZlbmRvciI6Ikdvb2dsZSBJbmMuIChJbnRlbCkiLCJ3ZWJnbF9yZW5kZXJlciI6IkFOR0xFIChJbnRlbCwgSW50ZWwoUikgVUhEIEdyYXBoaWNzIDYyMCBEaXJlY3QzRDExIHZzXzVfMCBwc181XzAsIEQzRDExKSIsImF1ZGlvIjoiMTI0LjA0MzQ3NTI3NTE2MDc0IiwicGxhdGZvcm0iOiJXaW4zMiIsIndlYl90aW1lem9uZSI6IkV1cm9wZS9Nb3Njb3ciLCJkZXZpY2VfbmFtZSI6IkNocm9tZSBWMTAxLjAuNDk1MS42NyAoV2luZG93cykiLCJmaW5nZXJwcmludCI6IjE5YWFhZGNmMDI5ZTY1MzU3N2Q5OGYwMmE0NDE4Nzc5IiwiZGV2aWNlX2lkIjoiIiwicmVsYXRlZF9kZXZpY2VfaWRzIjoiMTY1MjY4OTg2NTQwMGdQNDg1VEtmWnVCeUhONDNCc2oifQ==',
'fvideo-id': '3214483f88c0abbba34e5ecf5edbeeca1e56e405',
'lang': 'ru',
'origin': 'https://www.binance.com',
'referer': 'https://www.binance.com/ru/futures-activity/leaderboard?type=myProfile&tradeType=PERPETUAL&encryptedUid=E921F42DCD4D9F6ECC0DFCE3BAB1D11A',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="101", "Google Chrome";v="101"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'x-trace-id': 'e9d5223c-5d71-4834-8563-c253a1fc3ae8',
'x-ui-request-trace': 'e9d5223c-5d71-4834-8563-c253a1fc3ae8',
}
json_data = {
'encryptedUid': 'E921F42DCD4D9F6ECC0DFCE3BAB1D11A',
'tradeType': 'PERPETUAL',
}
response = requests.post('https://www.binance.com/bapi/futures/v1/public/future/leaderboard/getOtherPosition', headers=headers, json=json_data)
print(response.text)
So output now is:
{"code":"000000","message":null,"messageDetail":null,
"data":{
"otherPositionRetList":[{"symbol":"ETHUSDT","entryPrice":1985.926527932,"markPrice":2013.57606795,"pnl":41926.93300012,"roe":0.05492624,"updateTime":[2022,5,22,15,35,39,358000000],"amount":1516.370,"updateTimeStamp":1653233739358,"yellow":true,"tradeBefore":false},{"symbol":"KSMUSDT","entryPrice":80.36574159583,"markPrice":79.46000000,"pnl":-1118.13799285,"roe":-0.01128900,"updateTime":[2022,5,16,11,0,5,608000000],"amount":1234.5,"updateTimeStamp":1652698805608,"yellow":false,"tradeBefore":false},{"symbol":"IMXUSDT","entryPrice":0.9969444089129,"markPrice":0.97390429,"pnl":-13861.75961996,"roe":-0.02365747,"updateTime":[2022,5,22,15,57,3,329000000],"amount":601636,"updateTimeStamp":1653235023329,"yellow":true,"tradeBefore":false},{"symbol":"MANAUSDT","entryPrice":1.110770201096,"markPrice":1.09640000,"pnl":-6462.14960820,"roe":-0.05242685,"updateTime":[2022,5,21,16,6,2,291000000],"amount":449691,"updateTimeStamp":1653149162291,"yellow":false,"tradeBefore":false},{"symbol":"EOSUSDT","entryPrice":1.341744945184,"markPrice":1.35400000,"pnl":-4572.78323455,"roe":-0.09051004,"updateTime":[2022,5,22,11,47,48,542000000],"amount":-373134.3,"updateTimeStamp":1653220068542,"yellow":true,"tradeBefore":false},{"symbol":"BTCUSDT","entryPrice":29174.44207538,"markPrice":30015.10000000,"pnl":-173841.33354801,"roe":-0.47613317,"updateTime":[2022,5,21,15,13,0,252000000],"amount":-206.792,"updateTimeStamp":1653145980252,"yellow":false,"tradeBefore":false},{"symbol":"DYDXUSDT","entryPrice":2.21378804417,"markPrice":2.11967778,"pnl":-48142.71521969,"roe":-0.08879676,"updateTime":[2022,5,18,16,40,18,654000000],"amount":511556.5,"updateTimeStamp":1652892018654,"yellow":false,"tradeBefore":false}],"updateTime":[2022,5,16,11,0,5,608000000],"updateTimeStamp":1652698805608},"success":true}

why postmant request and scrapy request send me to diferent web?

i did this spider, and when i make a request with that, send me to diferent web
a regret that I put the same parameters and almost the same header
def start_requests(self):
url ="https://assr.parcelquest.com/Statewide"
rows =self.getPin('parcels/Parcel.csv')
for row in rows:
params = {
'co3': 'MRN',
'apn': 022-631-01,
'recaptchaSuccess':'0',
'IndexViewModel': 'PQGov.Models.IndexViewModel'
}
header = {
'Content-Type': 'application/x-www-form-urlencoded',
'Accept' : '*/*',
'host':'assr.parcelquest.com',
'Referer': 'https://assr.parcelquest.com/Statewide/Index',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
'Accept-Encoding':'gzip, deflate, br',
'Connection':'keep-alive'
}
#print(params)
yield scrapy.Request(url=self.url,headers=header,body=json.dumps(params),method='POST',callback=self.property,meta = {'parcel':row},dont_filter=True)
this is postman:
could explain me anybody why?

How to prevent beautifulsoap from extracting the information as strange symbols?

I am trying to extract an information with beautifulsoap, however when I do it it extracts it with very rare symbols. But when I enter directly to the page everything looks good and the page has the label <meta charset="utf-8">
my code is:
HEADERS = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = bs4.BeautifulSoup(r.text, "html.parser")
print (soup)
Nevertheless, the result I get is this:
J{$%X Àà’8}±ŸÅ
I guess it's something with the encoding, however I don't understand why since the page is utf-8.
It is worth clarifying that this only happens in some cases, since with others I manage to extract the information without problems.
Edit: updated with a sample url.
Edit2: added the headers dictionary, which is the one that generates the problem.

The problem is Accept-Encoding HTTP header. There you have br specified, which means brotli compression method. requests module cannot handle that. Remove br and the server responds without this compression method.
import requests
from bs4 import BeautifulSoup
HEADERS = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate', # <-- remove br
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = BeautifulSoup(r.text, "html.parser")
print (soup)
Prints:
<!DOCTYPE html>
<html lang="fr-FR">
<head><style>img.lazy{min-height:1px}</style><link as="script" href="https://www.jcchouinard.com/wp-content/plugins/w3-total-cache/pub/js/lazyload.min.js?x73818" rel="preload"/>
...and so on.

import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'
}
def main(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())
main("https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/")
You've just to add headers

How to show the hidden contents under "View More" when using Selenium - Python

driver = webdriver.Chrome(r'XXXX\chromedriver.exe')
FB_bloomberg_URL = 'https://www.bloomberg.com/quote/FB:US'
driver.get(FB_bloomberg_URL)
board_members = driver.find_elements_by_xpath('//* [#id="root"]/div/div/section[3]/div[10]/div[1]/div[2]/div/div[2]')[0]
board=board_members.text
board.split('\n')
I wrote the coding above to scrape the board information from Bloomberg for FaceBook. But I have trouble to extract all board members because others are hidden behind the "View More". How can I extract all names?
Thanks for the help.

You can do whole thing with requests and grab the appropriate cookie to pass to API from prior GET. The API can be found in the network tab when clicking the view more link and inspecting the web traffic.
import requests
headers = {
'dnt': '1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': '*/*',
'referer': 'https://www.bloomberg.com/quote/FB:US',
'authority': 'www.bloomberg.com',
'cookie':''
}
with requests.Session() as s:
r = s.get('https://www.bloomberg.com/quote/FB:US')
headers['cookie'] = s.cookies.get_dict()['_pxhd']
r = s.get('https://www.bloomberg.com/markets2/api/peopleForCompany/11092218', headers = headers).json()
board_members = [item['name'] for item in r['boardMembers']]
print(board_members)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping values from View Source using Requests Python 3 - python-3.x

Related

Response provided for One Page(200) but not for Other(401)

Can I use post method in requests lib on this Binance site?

why postmant request and scrapy request send me to diferent web?

How to prevent beautifulsoap from extracting the information as strange symbols?

How to show the hidden contents under "View More" when using Selenium - Python

Categories

Resources