Find marked elements, using python

Find marked elements, using python - python-3.x

I need to get an element from this website: https://channelstore.roku.com/en-gb/details/38e7b84fe064cf927ad471ed632cc3d8/vlrpdd2
Task image:
I tried this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://channelstore.roku.com/en-gb/details/38e7b84fe064cf927ad471ed632cc3d8/vlrpdd2')
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
I got a document, but I only see metadata without the result I expected:

If you inspect your browsers Network calls (Click on F12), you'll see that the data is loaded dynamically from:
https://channelstore.roku.com/api/v6/channels/detailsunion/38e7b84fe064cf927ad471ed632cc3d8
So, to mimic the response, you can send a GET request to the URL.
Note, there's no need for BeautifulSoup:
import requests
headers = {
'authority': 'channelstore.roku.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'cookie': '_csrf=ZaFYG2W7HQA4xqKW3SUfuta0; ks.locale=j%3A%7B%22language%22%3A%22en%22%2C%22country%22%3A%22GB%22%7D; _usn=c2062c71-f89e-456f-9374-7c7767afc665; _uc=54c11aa8-597e-4155-bfa1-379add00fc85%3Aa044adeb2f02798a3c8335d874d49562; _ga=GA1.3.760826055.1671811832; _gid=GA1.3.2471563.1671811832; _ga=GA1.1.760826055.1671811832; _cs_c=0; roku_test; AWSELB=0DC9CDB91658555B919B869A2ED9157DFA13B446022D0100EBAAD7261A39D5A536AC0223E5570FAECF0099832FA9F5DB8028018FCCD9D0A49D8F2BDA087916BC1E51F73D1E; AWSELBCORS=0DC9CDB91658555B919B869A2ED9157DFA13B446022D0100EBAAD7261A39D5A536AC0223E5570FAECF0099832FA9F5DB8028018FCCD9D0A49D8F2BDA087916BC1E51F73D1E; _gat_UA-678051-1=1; _ga_ZZXW5ZLMQ5=GS1.1.1671811832.1.1.1671812598.0.0.0; _cs_id=8a1f1aec-e083-a585-e054-158b6714ab4a.1671811832.1.1671812598.1671811832.1.1705975832137; _cs_s=10.5.0.1671814398585',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
}
params = {
'country': 'GB',
'language': 'en',
}
response = requests.get(
'https://channelstore.roku.com/api/v6/channels/detailsunion/38e7b84fe064cf927ad471ed632cc3d8',
params=params,
headers=headers,
).json()
# Uncomment to print all the data
# from pprint import pprint
# pprint(response)
print(response.get("feedChannel").get("name"))
print("rating: ", response.get("feedChannel").get("starRatingCount"))
Prints:
The Silver Collection Comedies
rating: 3

you will need to use selenium for python as you need to load the javascript.
beautifulsoup can only really handle static websites.

Related

Scraping kenpom data in 2023

I have been scouring across the web to figure out how to use beautifulsoup and pandas to scrape kenpom.com college basketball data. I do not have an account to his website, hence why I am not using the kenpompy library. I have seen some past examples to scrape it from years past, including using the pracpred library (though I have zero experience on it, I'll admit) or using the curlconverter to grab the headers, cookies, and parameters during the requests.get, but now the website seems stingier in terms of grabbing the main table these days. I have used the following code so far:
import requests
from bs4 import BeautifulSoup
url ='https://kenpom.com/index.php?y=2023'
import requests
import requests
cookies = {
'PHPSESSID': 'f04463ec42584dbd1bf7a480098947d1',
'_ga': 'GA1.2.120164513.1673124870',
'_gid': 'GA1.2.622021765.1673496414',
'__stripe_mid': '71a4117b-cbef-4d3c-b31b-9d18e5c99a33183485',
'__stripe_sid': '99b77b80-1222-4f7a-b2a8-acf5cf19c7d18a637f',
'kenpomtry': 'https%3A%2F%2Fkenpom.com%2Fsummary.php%3Fs%3DRankAPL_Off%26y%3D2021',
'_gat_gtag_UA_11558853_1': '1',
}
headers = {
'authority': 'kenpom.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
# 'cookie': 'PHPSESSID=f04463ec42584dbd1bf7a480098947d1; _ga=GA1.2.120164513.1673124870; _gid=GA1.2.622021765.1673496414; __stripe_mid=71a4117b-cbef-4d3c-b31b-9d18e5c99a33183485; __stripe_sid=99b77b80-1222-4f7a-b2a8-acf5cf19c7d18a637f; kenpomtry=https%3A%2F%2Fkenpom.com%2Fsummary.php%3Fs%3DRankAPL_Off%26y%3D2021; _gat_gtag_UA_11558853_1=1',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
}
params = {
'y': '2023',
}
response = requests.get('https://kenpom.com/index.php', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
soup
# table = soup.find('table',{'id':'ratings-table'}).tbody
Any suggestions beyond this would be truly appreciated.

Using requests and adding UserAgent as headers (It's pretty messy with multiple hierarchal indexes so will need further parsing to clean completely):
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36"
}
url = "https://kenpom.com/index.php?y=2023"
with requests.Session() as request:
response = request.get(url, timeout=30, headers=headers)
if response.status_code != 200:
print(response.raise_for_status())
soup = BeautifulSoup(response.text, "html.parser")
df = pd.concat(pd.read_html(str(soup)))

HTTP Request using Python

For those who are reading my thread, I'd like to thank you in advance for your assistance in advance and would also like to ask for a bit of leniency when it comes to incorrect terminology, as I am still a 'Newbie'.
I've been trying to retrieve stock codes from KRX website, as I could not find any other resource to retrieve the information that I need. I tried to use requests library in python, but because the data I needed was loaded Asynchronously, which made the data inaccessible.
The problem is that in order to retrieve the information, I need to make two requests to an endpoint, one to retrieve code to be used as body for the second request, but when I made the second request, it returns empty list.
I managed to locate the API calls which retrieved the stock codes as shown below.
TwoRequests
To my knowledge, it requires two API calls, one to retrieve code, which works as access token for the second request in order to retrieve the Stock code that I am trying to retrieve.
I've managed to retrieve the code for the first request with the following codes
import requests
url = 'https://global.krx.co.kr/contents/COM/GenerateOTP.jspx'
headers = {
'Cookie': 'SCOUTER=x22rkf7ltsmr7l; __utma=88009422.986813715.1652669493.1652669493.1652669493.1; SCOUTER=z6pj0p85muce99; JSESSIONID=bOnAJtLWSpK1BiCuhWD0ldj1TqW5z6wEcn65oVgtyie841OlbdJs3fEHpUs1QtAV.bWRjX2RvbWFpbi9tZGNvd2FwMS1tZGNhcHAwMQ==; JSESSIONID=C2794518AD56B7119F0DA630B73B05AA.58tomcat2',
'Connection': 'keep-alive',
'accept': '*/*',
'accept-enconding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
'host': 'global.krx.co.kr',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
}
params = {
'bld': 'COM/stock_isu_info',
'name': 'finderBld',
'_': '1668677450106',
}
# make get request to the url and keep the connection open
response = requests.get(url, headers=headers, params=params, stream=True)
# response = requests.get(url, params=params, headers=headers)
relay_data = response.text
but upon sending a request to the second endpoint with the code as payload, it returns empty list, but I was expecting the response value for the second request as the following:
PayloadNeeded
The code I used to make the second request is the following (I added lots values for the header and body in hopes to retrieve the data by simulating the values used on the web page):
url = 'https://global.krx.co.kr/contents/GLB/99/GLB99000001.jspx'
headers = {
# ':authority': 'global.krx.co.kr',
# ':method': 'POST',
# ':path': '/contents/GLB/99/GLB99000001.jspx',
# ':scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
'content-length': '0',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': 'SCOUTER=x22rkf7ltsmr7l; __utma=88009422.986813715.1652669493.1652669493.1652669493.1; SCOUTER=z6pj0p85muce99; JSESSIONID=bOnAJtLWSpK1BiCuhWD0ldj1TqW5z6wEcn65oVgtyie841OlbdJs3fEHpUs1QtAV.bWRjX2RvbWFpbi9tZGNvd2FwMS1tZGNhcHAwMQ==; JSESSIONID=C2794518AD56B7119F0DA630B73B05AA.58tomcat2',
'origin': 'https://global.krx.co.kr',
'referer': 'https://global.krx.co.kr/contents/GLB/99/GLB99000001.jsp',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'sec-gpc': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
payload = {
'market_gubun': '0',
'isu_cdnm': 'All',
'isu_cd': '',
'isu_nm': '',
'isu_srt_cd': '',
'sort':'',
'ck_std_ind_cd': '20',
'par_pr': '',
'cpta_scl': '',
'sttl_trm': '',
'lst_stk_vl': '1',
'in_lst_stk_vl': '',
'in_lst_stk_vl2': '',
'cpt': '1',
'in_cpt': '',
'in_cpt2': '',
'nat_tot_amt': '1',
'in_nat_tot_amt': '',
'in_nat_tot_amt2': '',
'pagePath': '/contents/GLB/03/0308/0308010000/GLB0308010000.jsp',
'code': relay_data,
'pageFirstCall': 'Y',
}
# make request with url, headers, body
response = requests.post(url, headers=headers, data=payload)
print(response.text)
And here is the output for the code above:
{"DS1":[]}
Any help would be very much appreciated

How can get the json data automatically instead of copy and paste manually?

I want to get the json data in the target url:
target url
To get it manually :open it in brower manually and copy,paste.I want a more samrt way--programmatically and automatically,have tried with several way,all failed.
Method 1--traditional way with wget or curl:
wget https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300
--2021-02-09 11:55:44-- https://xueqiu.com/stock/cata/stocktypelist.json?page=1
Resolving xueqiu.com (xueqiu.com)... 39.96.249.191
Connecting to xueqiu.com (xueqiu.com)|39.96.249.191|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-02-09 11:55:44 ERROR 403: Forbidden.
Method 2--scrapy with selenium:
>>> from selenium import webdriver
>>> browser = webdriver.Chrome()
>>> url="https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300"
>>> browser.get(url)
It happen to me in the browser:
{"error_description":"遇到错误，请刷新页面或者重新登录帐号后再试","error_uri":"/stock/cata/stocktypelist.json","error_code":"400016"}
Method 3--build a mitmproxy:
mitmweb --listen-host 127.0.0.1 -p 8080
Set proxy in browser and open the target url in browser
Error info in terminal:
Web server listening at http://127.0.0.1:8081/
Opening in existing browser session.
Proxy server listening at http://127.0.0.1:8080
127.0.0.1:41268: clientconnect
127.0.0.1:41270: clientconnect
127.0.0.1:41268: HTTP/2 connection terminated by client: error code: 0, last stream id: 0, additional data: None
Error info in browser:
error_description "遇到错误，请刷新页面或者重新登录帐号后再试"
error_uri "/stock/cata/stocktypelist.json"
error_code "400016"
So powerful site to protect the data ,is there no way to get the data automatically?

You could use requests module
import json
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",}
import requests
cookies = {
'xq_a_token': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
'xqat': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
'xq_r_token': '2c9b0faa98159f39fa3f96606a9498edb9ddac60',
'xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYxMzQ0MzE3MSwiY3RtIjoxNjEyODQ5MDY2ODI3LCJjaWQiOiJkOWQwbjRBWnVwIn0.VuyNicSjIvVkp9FrCzIlRyx8487XM4HH1C3X9KsFA2FipFiilSifBhux9pMNRyziHHiEifhX-xOgccc8IG1mn8cOylOVy3b-L1YG2T5Hs8MKgx7qm4gnV5Mzm_5_G5BiNtO44aczUcmp0g53dp7-0_Bvw3RlwXzT1DTvCKTV-s_zfBsOPyFTfiqyDUxU-oBRvkz1GpgVJzJL4EmZ8zDE2PBqeW00ueLLC7qPW50WeDCsEFS4ZPAvd2SbX9JPk-lU2WzlcMck2S9iFYmpDwuTeQuPbSeSl6jt5suwTImSgJDIUP9o2TX_Z7nNRDTYxvbP8XlejSt8X0pRDPDd_zpbMQ',
'u': '661612849116563',
'device_id': '24700f9f1986800ab4fcc880530dd0ed',
'Hm_lvt_1db88642e346389874251b5a1eded6e3': '1612849123',
's': 'c111f3y1kn',
'Hm_lpvt_1db88642e346389874251b5a1eded6e3': '1612849252',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'sec-ch-ua': '"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'no-cors',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'image',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
'Referer': '',
}
params = (
('page', '1'),
('size', '300'),
)
response = requests.get('https://xueqiu.com/stock/cata/stocktypelist.json', headers=headers, params=params, cookies=cookies)
print(response.status_code)
json_data = response.json()
print(json_data)

You could use scrapy:
import json
import scrapy
class StockSpider(scrapy.Spider):
name = 'stock_spider'
start_urls = ['https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300']
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:85.0) Gecko/20100101 Firefox/85.0',
'Host': 'xueqiu.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US',
'Accept-Encoding': 'gzip,deflate,br',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Referer': '',
},
'ROBOTSTXT_OBEY': False
}
handle_httpstatus_list = [400]
def parse(self, response):
json_result = json.loads(response.body)
yield json_result
Run spider: scrapy crawl stock_spider

How to prevent beautifulsoap from extracting the information as strange symbols?

I am trying to extract an information with beautifulsoap, however when I do it it extracts it with very rare symbols. But when I enter directly to the page everything looks good and the page has the label <meta charset="utf-8">
my code is:
HEADERS = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = bs4.BeautifulSoup(r.text, "html.parser")
print (soup)
Nevertheless, the result I get is this:
J{$%X Àà’8}±ŸÅ
I guess it's something with the encoding, however I don't understand why since the page is utf-8.
It is worth clarifying that this only happens in some cases, since with others I manage to extract the information without problems.
Edit: updated with a sample url.
Edit2: added the headers dictionary, which is the one that generates the problem.

The problem is Accept-Encoding HTTP header. There you have br specified, which means brotli compression method. requests module cannot handle that. Remove br and the server responds without this compression method.
import requests
from bs4 import BeautifulSoup
HEADERS = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate', # <-- remove br
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = BeautifulSoup(r.text, "html.parser")
print (soup)
Prints:
<!DOCTYPE html>
<html lang="fr-FR">
<head><style>img.lazy{min-height:1px}</style><link as="script" href="https://www.jcchouinard.com/wp-content/plugins/w3-total-cache/pub/js/lazyload.min.js?x73818" rel="preload"/>
...and so on.

import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'
}
def main(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())
main("https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/")
You've just to add headers

why i can't get the result from lagou this web site by using web scraping

I m using python 3.6.5 and my os system is macOS 10.13.6.
I m learning Web Scraping and I want to catch data from this web site(https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=)
Here is my code:
# encoding: utf-8
import requests
from lxml import etree
def parse_list_page():
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest',
}
data = {
'first':'false',
'pn':1,
'kd':'python',
}
response = requests.post(url,headers=headers,data=data)
print(response.json())
def main():
parse_list_page()
if __name__ == '__main__':
main()
I m appreciate you for spending the time to answer my question.

I got the answer, here is the code below:
# encoding: utf-8
import requests
from lxml import etree
import time
def parse_list_page():
url = 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/',
'Connection':'keep-alive',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8',
'Upgrade-Insecure-Requests':'1',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Cache-Control':'no-cache',
'Pragma':'no-cache',
}
response = requests.get(url,headers=headers)
# print(response.text)
r = requests.utils.dict_from_cookiejar(response.cookies)
print(r)
print('='*30)
# r['LGUID'] = r['LGRID']
# r['user_trace_token'] = r['LGRID']
# r['LGSID'] = r['LGRID']
cookies = {
# 'X_MIDDLE_TOKEN':'df7c1d3cfdf279f0caf13df990723620',
# 'JSESSIONID':'ABAAABAAAIAACBI29FE9BDFB6838D8DD69C580E517292C9',
# '_ga':'GA1.2.820168368.1551196380',
# '_gat':'1',
# 'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6':'1551196381',
# 'user_trace_token':'20190226235303-99bc357a-39de-11e9-921f-525400f775ce',
# 'LGSID':'20190311094827-c3bc2393-439f-11e9-a15a-525400f775ce',
# 'PRE_UTM':'',
# 'PRE_HOST':'',
# 'PRE_SITE':'',
# 'PRE_LAND':'https%3A%2F%2Fwww.lagou.com%2F',
# 'LGUID':'20190226235303-99bc3944-39de-11e9-921f-525400f775ce',
# '_gid':'GA1.2.1391680888.1552248111',
# 'index_location_city':'%E6%B7%B1%E5%9C%B3',
# 'TG-TRACK-CODE':'index_search',
# 'LGRID':'20190311100452-0ed0525c-43a2-11e9-9113-5254005c3644',
# 'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6':'1552269893',
# 'SEARCH_ID':'aae3c38ec76545fc86cd4e23153afe44',
}
cookies.update(r)
print(r)
print('=' * 30)
print(cookies)
print('=' * 30)
headers = {
'Origin':'https://www.lagou.com',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': None,
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'h-CN,zh;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3',
'Connection': 'keep-alive',
}
params = {
'px':'default',
'city':'深圳',
'needAddtionalResult':'false'
}
data = {
'first':'true',
'pn':1,
'kd':'python',
}
url_json = 'https://www.lagou.com/jobs/positionAjax.json'
response = requests.post(url=url_json,headers=headers,params=params,cookies=cookies,data=data)
print(response.json())
def main():
parse_list_page()
if __name__ == '__main__':
main()
The reason why I can't get the json as response is the against web scraping rules here is you need to use the first cookie when you send the request.
so when you first send the request you need to save the cookies and then update it to use your second page request. Hope it will helpful for you to do web scraping when you face this problem

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find marked elements, using python - python-3.x

you will need to use selenium for python as you need to load the javascript. beautifulsoup can only really handle static websites.

Related

Scraping kenpom data in 2023

HTTP Request using Python

How can get the json data automatically instead of copy and paste manually?

How to prevent beautifulsoap from extracting the information as strange symbols?

why i can't get the result from lagou this web site by using web scraping

Categories

Resources