How Do I get the Browser Cookies passed in Headers Using rvest/httr? - get

I am trying to get a table online. I am able to manually add the cookies into my GET request but wanted to know if there is a way to get these in R?
require("httr");require("rvest");require("dplyr")
# build urls
uaString = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:108.0) Gecko/20100101 Firefox/108.0"
pg <- session(paste0("https://www.barchart.com/etfs-funds/etfs-by-asset-class?dropdown1=1&viewName=main&orderBy=managedAssets&orderDir=desc"),httr::user_agent(uaString))
#pg <- session(url="https://www.barchart.com")
# save page cookies
cookies <- pg$response$cookies
# get token
token <- URLdecode(dplyr::recode("XSRF-TOKEN", !!!setNames(cookies$value,
cookies$name)))
pg2 <- httr::GET(url=paste0("https://www.barchart.com/proxies/core-api/v1/quotes/get?",
"lists=etfs.byAsset(1%3Bus)&fields=symbol%2CsymbolName%2ClastPrice",
"%2CpercentChange%2CpercentChange1m%2CpercentChange3m%2CaverageVolume20d",
"%2CtrendSpotterSignal%2Cleverage%2CmanagedAssets%2CsymbolCode%2CsymbolType",
"%2Copinion%2CrelativeStrength20d%2ChistoricVolatility20d%2ClowPrice1y%2ChighPrice1y",
"%2CweightedAlpha%2CpercentChangeYtd%2CpercentChange1y%2CdailyLastPrice",
"%2CmanagementFee%2CdividendRateForward%2CdividendYieldForward",
"%2ChasOptions&meta=field.shortName%2Cfield.type%2Cfield.description",
"%2Clists.lastUpdate&hasOptions=true&orderBy=managedAssets&orderDir=desc",
"&page=1&limit=10000&in(leverage%2C(1x%2C2x%2C3x%2C-1x%2C-2x%2C-3x))=&raw=1"),
config=httr::add_headers(`accept` = "application/json",
`accept-encoding` = 'gzip, deflate, br',
`accept-language` = 'en-US,en;q=0.5',
`connection` = "keep-alive",
`Host` = 'www.barchart.com',
`referer` = 'https://www.barchart.com/etfs-funds/etfs-by-asset-class?dropdown1=262&viewName=main&orderBy=managedAssets&orderDir=desc',
`user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:108.0) Gecko/20100101 Firefox/108.0',
`sec-fetch-dest` = 'empty',
`sec-fetch-mode` = 'cors',
`cookie` = 'BROWSER_COOKIES',
`sec-fetch-site`= 'same-origin',
`x-xsrf-token` = token),
handle = pg$handle)
# raw data
data_raw <- httr::content(pg2)
It works when I manually copy the cookie value from my browser and add it above in 'BOWSER_COOKIES':

Related

How can I specify the exact http method with python requests?

In Burp Suite the first line of a captured request is usually GET / HTTP/1.1. However, I am currently practicing Host Header injection using the method of supplying an absolute URL in order to something like this:
GET https://vulnerable-website.com/ HTTP/1.1
Host: bad-stuff-here
In python I am using the requests library and am unable to specify the exact GET request I need.
import requests
burp0_url = "https://vulnerable-website.com:443/"
burp0_cookies = {[redacted]}
burp0_headers = {"Host": "bad-stuff-here", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "https://vulnerable-website.com/", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
output = requests.get(burp0_url, headers=burp0_headers, cookies=burp0_cookies)
print(output, output.text)
I have tried specifying the GET request in the header dictionary (header = {"GET":" / HTTP/1.1", ...}), however this only results in a GET Header not r
Request on the 6th line being sent:
GET / HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Connection: close
GET: /
Host: bad-stuff-here
Accept-Language: en-US,en;q=0.5
Referer: https://vulnerable-website.com/
Upgrade-Insecure-Requests: 1
Cookie: [redacted]
This is a very specific problem and I'm not sure if anyone has had the same issues but any help is appreciated. Maybe a workaround with urllib or something I'm missing. Thanks.
requests uses urllib3 under the hood.
You have to craft the request yourself because of non of the clients [urlib, requests, http.client] won't allow you to insert a control character by design.
You can use a plain socket for this
msg = 'GET / HTTP/1.1\r\n\r\n'
s = socket.create_connection(("vulnerable-website.com", 80))
with closing(s):
s.send(msg)
buf = ''.join(iter(partial(s.recv, 4096), ''))

Understanding Bearer Authorization for web scraping using python 3.8 and requests

So I am looking to scrape the following site:
https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
What I am running into using the Python Requests library is that the header requires I pass along an Authorization header that bears a token of some kind. While I can get this to work if I manually go to the page, copy and paste it, and then run my program, I am wondering how I could bypass this issue (After all, what is the point in running a scraper if I still have to visit the actual site manually and retrieve the authorization token).
I am newer to authorization/ bearer headers and am hoping someone might be able to clarify how the browser generates a token to retrieve this information/ how I can simulate this. Here is my code:
import requests
import json
import datetime
today = datetime.datetime.today()
url = "https://hyland.csod.com/services/x/career-site/v1/search"
# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
headers = {
'authority': 'hyland.csod.com',
'origin': 'https://hyland.csod.com',
'authorization': 'Bearer eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCIsImNsaWQiOiI0bDhnbnFhbGk3NjgifQ.eyJzdWIiOi0xMDMsImF1ZCI6IjRxNTFzeG5oY25yazRhNXB1eXZ1eGh6eCIsImNvcnAiOiJoeWxhbmQiLCJjdWlkIjoxLCJ0emlkIjoxNCwibmJkIjoiMjAxOTEyMzEyMTE0MTU5MzQiLCJleHAiOiIyMDE5MTIzMTIyMTUxNTkzNCIsImlhdCI6IjIwMTkxMjMxMjExNDE1OTM0In0.PlNdWXtb1uNoMuGIhI093ZbheRN_DwENTlkNoVr0j7Zah6JHd5cukudVFnZEiQmgBZ_nlDU4C-9JO_2We380Vg',
'content-type': 'application/json',
'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'csod-accept-language': 'en-US',
'referer': 'https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland',
'accept-encoding': 'gzip, deflate, br',
'cookie': 'CYBERU_lastculture=en-US; ASP.NET_SessionId=4q51sxnhcnrk4a5puyvuxhzx; cscx=hyland^|-103^|1^|14^|KumB4VhzYXML22MnMxjtTB9SKgHiWW0tFg0HbHnOek4=; c-s=expires=1577909201~access=/clientimg/hyland/*^!/content/hyland/*~md5=78cd5252d2efff6eb77d2e6bf0ce3127',
}
data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
'{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']
def hyland(url, data):
# for openings in data:
dirty = requests.post(url, headers=headers, data=data).text
if 'Unauthorized' in dirty:
print(dirty)
print("There was an error connecting. Check Info")
# print(dirty)
clean = json.loads(dirty)
cleaner = json.dumps(clean, indent=4)
print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
for i in range(0,60):
try:
print(clean["data"]["requisitions"][i]["displayJobTitle"])
print("")
print("")
except:
print("{} Openings at Hyland".format(i))
break
for datum in data:
hyland(url, data=datum)
So basically what my code is doing is sending a post request to the url above along with the headers and necessary data to retrieve what I want. This scraper works for a short period of time, but if I leave and come back after a few hours it no longer works due to authorization (at least that is what I have concluded).
Any help/ clarification on how all this works would be greatly appreciated.
Your code has a few problems:
As you noted you have to get the bearer token
You have to send your requests using requests.session() (as this webpage seems to pay attention to the cookies you send)
Optional: your headers had a lot of unnecessary headers that could be removed
All in all, here bellow is the working code:
import requests
import json
import datetime
today = datetime.datetime.today()
session = requests.session()
url = "https://hyland.csod.com:443/ux/ats/careersite/4/home?c=hyland"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
raw = session.get(url, headers=headers).text
token = raw[raw.index("token")+8:]
token = token[:token.index("\"")]
bearer_token = f"Bearer {token}"
url = "https://hyland.csod.com/services/x/career-site/v1/search"
# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0", "Authorization": bearer_token}
data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
'{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']
def hyland(url, data, session= session):
# for openings in data:
dirty = session.post(url, headers=headers, data=data).text
if 'Unauthorized' in dirty:
print(dirty)
print("There was an error connecting. Check Info")
# print(dirty)
clean = json.loads(dirty)
cleaner = json.dumps(clean, indent=4)
print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
for i in range(0,60):
try:
print(clean["data"]["requisitions"][i]["displayJobTitle"])
print("")
print("")
except:
print("{} Openings at Hyland".format(i))
break
for datum in data:
hyland(url, data=datum, session = session)
hope this helps

How to show the hidden contents under "View More" when using Selenium - Python

driver = webdriver.Chrome(r'XXXX\chromedriver.exe')
FB_bloomberg_URL = 'https://www.bloomberg.com/quote/FB:US'
driver.get(FB_bloomberg_URL)
board_members = driver.find_elements_by_xpath('//* [#id="root"]/div/div/section[3]/div[10]/div[1]/div[2]/div/div[2]')[0]
board=board_members.text
board.split('\n')
I wrote the coding above to scrape the board information from Bloomberg for FaceBook. But I have trouble to extract all board members because others are hidden behind the "View More". How can I extract all names?
Thanks for the help.
You can do whole thing with requests and grab the appropriate cookie to pass to API from prior GET. The API can be found in the network tab when clicking the view more link and inspecting the web traffic.
import requests
headers = {
'dnt': '1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': '*/*',
'referer': 'https://www.bloomberg.com/quote/FB:US',
'authority': 'www.bloomberg.com',
'cookie':''
}
with requests.Session() as s:
r = s.get('https://www.bloomberg.com/quote/FB:US')
headers['cookie'] = s.cookies.get_dict()['_pxhd']
r = s.get('https://www.bloomberg.com/markets2/api/peopleForCompany/11092218', headers = headers).json()
board_members = [item['name'] for item in r['boardMembers']]
print(board_members)

why i can't get the result from lagou this web site by using web scraping

I m using python 3.6.5 and my os system is macOS 10.13.6.
I m learning Web Scraping and I want to catch data from this web site(https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=)
Here is my code:
# encoding: utf-8
import requests
from lxml import etree
def parse_list_page():
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest',
}
data = {
'first':'false',
'pn':1,
'kd':'python',
}
response = requests.post(url,headers=headers,data=data)
print(response.json())
def main():
parse_list_page()
if __name__ == '__main__':
main()
I m appreciate you for spending the time to answer my question.
I got the answer, here is the code below:
# encoding: utf-8
import requests
from lxml import etree
import time
def parse_list_page():
url = 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/',
'Connection':'keep-alive',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8',
'Upgrade-Insecure-Requests':'1',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Cache-Control':'no-cache',
'Pragma':'no-cache',
}
response = requests.get(url,headers=headers)
# print(response.text)
r = requests.utils.dict_from_cookiejar(response.cookies)
print(r)
print('='*30)
# r['LGUID'] = r['LGRID']
# r['user_trace_token'] = r['LGRID']
# r['LGSID'] = r['LGRID']
cookies = {
# 'X_MIDDLE_TOKEN':'df7c1d3cfdf279f0caf13df990723620',
# 'JSESSIONID':'ABAAABAAAIAACBI29FE9BDFB6838D8DD69C580E517292C9',
# '_ga':'GA1.2.820168368.1551196380',
# '_gat':'1',
# 'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6':'1551196381',
# 'user_trace_token':'20190226235303-99bc357a-39de-11e9-921f-525400f775ce',
# 'LGSID':'20190311094827-c3bc2393-439f-11e9-a15a-525400f775ce',
# 'PRE_UTM':'',
# 'PRE_HOST':'',
# 'PRE_SITE':'',
# 'PRE_LAND':'https%3A%2F%2Fwww.lagou.com%2F',
# 'LGUID':'20190226235303-99bc3944-39de-11e9-921f-525400f775ce',
# '_gid':'GA1.2.1391680888.1552248111',
# 'index_location_city':'%E6%B7%B1%E5%9C%B3',
# 'TG-TRACK-CODE':'index_search',
# 'LGRID':'20190311100452-0ed0525c-43a2-11e9-9113-5254005c3644',
# 'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6':'1552269893',
# 'SEARCH_ID':'aae3c38ec76545fc86cd4e23153afe44',
}
cookies.update(r)
print(r)
print('=' * 30)
print(cookies)
print('=' * 30)
headers = {
'Origin':'https://www.lagou.com',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': None,
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'h-CN,zh;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3',
'Connection': 'keep-alive',
}
params = {
'px':'default',
'city':'深圳',
'needAddtionalResult':'false'
}
data = {
'first':'true',
'pn':1,
'kd':'python',
}
url_json = 'https://www.lagou.com/jobs/positionAjax.json'
response = requests.post(url=url_json,headers=headers,params=params,cookies=cookies,data=data)
print(response.json())
def main():
parse_list_page()
if __name__ == '__main__':
main()
The reason why I can't get the json as response is the against web scraping rules here is you need to use the first cookie when you send the request.
so when you first send the request you need to save the cookies and then update it to use your second page request. Hope it will helpful for you to do web scraping when you face this problem

i want to crawl a website with python,but i meet a trouble . requests library is ok but 400 with Scrapy,the code below

i want to crawl a website with python,but i meet a trouble . requests library is ok but 400 with Scrapy,the code below
import requests
urls = "https://pan.baidu.com/s/1sj1JLJv"
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate",
'Content-Length': '0',
"Connection": "keep-alive"<br>
}
print(str((requests.get(urls, headers=header)).content, 'utf-8'))
from scrapy_redis.spiders import RedisCrawlSpider
class baiduuSpider(RedisCrawlSpider):
...
...
...
urls = "https://pan.baidu.com/s/1sj1JLJv"
yield scrapy.Request(url = urls,headers = headers,callback = self.first_parse)
def first_parse(self, response):
print(response.body.decode('utf-8'))
How do I fix this question
I'm sorry, but you won't succeed, because the page loads dynamically.
It is necessary to compile javascript on the fly - Selenium, Splash

Resources