Check if the path is dynamically generated in web scraping

Check if the path is dynamically generated in web scraping - python-3.x

I am scraping data from trip.com . It is a hotel listing website. After entering the details when i click on the search button, the search results are displayed in a new tab with the results being generated dynamically. When i scroll doen the website more results are downloaded and displayed. Now as I understand to generate the data dynamically and scrape it I need to have information about the header of the API returning the JSON value dynamically. But the issue here is this site I am scraping genrates is header param dynamically and in an encrypted format as well. What i mean is this is my request URL:
Request URL: https://www.trip.com/restapi/soa2/16709/json/rateplan?testab=ec23b14de9ad450c7b74612efc288bfdd523314036afe19b5fe135f206284aab
and this is my request header:
:authority: www.trip.com
:method: POST
:path: /restapi/soa2/16709/json/rateplan?testab=ec23b14de9ad450c7b74612efc288bfdd523314036afe19b5fe135f206284aab
:scheme: https
accept: application/json
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8
cache-control: no-cache
content-length: 1697
content-type: application/json
cookie: ibulanguage=EN; cookiePricesDisplayed=USD; ibu_online_home_language_match={"isFromTWNotZh":false,"isFromIPRedirect":false,"isFromLastVisited":false,"isRedirect":false,"isShowSuggestion":false,"lastVisited":""}; _abtest_userid=55c19cf3-dcd6-4f4a-bfba-5965c52ac66c; _tp_search_latest_channel_name=hotels; _RF1=45.115.185.74; _RSG=BJ4Q9HdNV80BpEgEyf8ZZ9; _RDG=286d5feba1bdad2eee089fc228174f22ec; _RGUID=021f5e74-4968-44cb-98e3-229f0ea8eccb; ibulocale=en_us; g_state={"i_p":1600591022929,"i_l":3}; Union=AllianceID=1078337&SID=2036545&OUID=ctag.hash.d23ecf76442c&SourceID=&AppID=&OpenID=&Expires=1602581159329&createtime=1599989159; IBU_TRANCE_LOG_URL=/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(; librauuid=3lSNuDO18464CG5a; intl_ht1=h4%3D724_762871; hotel=762871; hotelhst=1164390341; _bfa=1.1599889636407.b231b.1.1599996200640.1600004365027.18.57; _bfs=1.1; _bfi=p1%3D10320668147%26p2%3D10320668147%26v1%3D57%26v2%3D56; IBU_TRANCE_LOG_P=22266407054
origin: https://www.trip.com
p: 22266407054
pid: 584e7499-4df6-45dd-8242-94cb5dec36c5
pragma: no-cache
referer: https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-origin
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
Now here the value of testab parameter is generated dynamically when i scroll down in the site. But I am not able to understand how this testab value is being generated. Is it generated byb encrypting the rest of the request header info. FYI, I have all the request header info except the "path" value. So if the value is generated by encryption, how do I proceed with scraping this. Also, I cannot use selenuim or any browser based scraping here.

The testab value is being generated at random using the following JavaScript in the file https://ak-s.tripcdn.com/modules/ibu/ibu-hotel-online/smart/smart.353046e23a610af9fcf9.js
key: "gencb",
value: function gencb(r) {
var o = function() {
for (var e = "qwertyuiopasdfg$hjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM", t = "", n = 0; n < 10; n++)
t += e.charAt(~~(Math.random() * e.length));
return t
}();
return window[o] = function(e) {
delete window[o];
var t = e()
, n = "?";
r.realUrl && 0 < r.realUrl.indexOf("?") && (n = "&"),
r.realUrl += n + "testab=" + encodeURIComponent(t)
}
,
o
}
It is then being written to the server on a POST request to https://www.trip.com/restapi/soa2/16709/json/getHotelScript but it is encrypted in the hotelUuidKey unless you are able to crack the encryption you had better render the page using JavaScript.
You say you can't use Selenium or any browser based solution have you looked at PyQt?
https://doc.qt.io/qt-5/qtwebengine-overview.html#qt-webengine-core-module
The Qt WebEngine core is based on the Chromium Project. Chromium provides its own network and painting engines and is developed tightly together with its dependent modules.
Note: Qt WebEngine is based on Chromium, but does not contain or use any services or add-ons that might be part of the Chrome browser that is built and delivered by Google.
import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEnginePage, QWebEngineProfile
class WebEngineUrlRequestInterceptor(QWebEngineUrlRequestInterceptor):
def interceptRequest(self, info):
if info.requestUrl().url().startswith('https://www.trip.com/restapi/soa2/16709/json/rateplan?testab='):
print(info.requestUrl().url())
# Do stuff
sys.exit()
class MyWebEnginePage(QWebEnginePage):
def acceptNavigationRequest(self, url, _type, isMainFrame):
return QWebEnginePage.acceptNavigationRequest(self, url, _type, isMainFrame)
if __name__ == "__main__":
app = QApplication(sys.argv)
browser = QWebEngineView()
interceptor = WebEngineUrlRequestInterceptor()
profile = QWebEngineProfile()
profile.setRequestInterceptor(interceptor)
page = MyWebEnginePage(profile, browser)
url = 'https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA('
page.setUrl(QUrl(url))
browser.setPage(page)
browser.show()
sys.exit(app.exec_())
Adapted from https://stackoverflow.com/a/50786759/839338 authored by eyllanesc
Outputs the link (and a few warnings) e.g.
https://www.trip.com/restapi/soa2/16709/json/rateplan?testab=15feb5b1067d2e4e2b979fe97830d884c5e3a07*e145f7¼(5955400Z380ac6a6
Updated in response to comment
You just need to grab the cookies and make a request. Very quick and dirty code is below.
import requests
import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEnginePage, QWebEngineProfile
from PyQt5.QtNetwork import QNetworkCookie
class WebEngineUrlRequestInterceptor(QWebEngineUrlRequestInterceptor):
def __init__(self, on_network_call):
super().__init__()
self.on_network_call = on_network_call
def interceptRequest(self, info):
if info.requestUrl().url().startswith('https://www.trip.com/restapi/soa2/16709/json/rateplan?testab='):
self.on_network_call(info)
sys.exit()
class MyWebEnginePage(QWebEnginePage):
def acceptNavigationRequest(self, url, _type, isMainFrame):
return QWebEnginePage.acceptNavigationRequest(self, url, _type, isMainFrame)
def on_network_call(info):
print(info.requestUrl().url())
headers = {
'authority': 'www.trip.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'application/json',
'dnt': '1',
'p': '99783168614',
'pid': '256f8038-1c06-4173-99b5-880dc120042f',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'content-type': 'application/json',
'origin': 'https://www.trip.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(sec-fetch-dest:%20empty',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = '{"checkIn":"2020-09-15","checkOut":"2020-09-16","priceType":"0","adult":2,"popularFacilityType":"","hotelUniqueKey":"H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA(sec-fetch-dest:%20empty","child":0,"roomNum":1,"masterHotelId":762871,"age":"","cityId":"724","hotel":"762871","versionControl":[{"key":"RoomCardVersionB","value":"T"}],"signInRoomKey":"","signInType":0,"filterCondition":null,"unAvailableRoomInfo":null,"minPriceRoomKey":"","Head":{"Locale":"en-XX","Currency":"USD","AID":"","SID":"","ClientID":"1600039009299.2v21ry","OUID":"","CAID":"","CSID":"","COUID":"","TimeZone":"1","PageID":"10320668147","HotelExtension":{"WebpSupport":true,"Qid":"","hasAidInUrl":false,"group":"TRIP","PID":"256f8038-1c06-4173-99b5-880dc120042f","hotelUuidKey":"S96K39i7Te47IA7idYlfYp6E3YLpemawnOWOYhgjs6wZFv0lEPYtNjoSwHSybpjsY1pKL4KazvlLjFYoTvU1YQByTZjUBvc9ed7YG9jHZy5Y1fekTv0NEghwGqWbsenZi8BwMYtY5OInLeo9YmDvFSeDrNbeUZjnkwDfY7bwzSEkY1dRSYX0INbWBYaqYonikdikSiXNj5Y5bjSQi4gYBkwPoJoGRcaYT7woY0ZR7fwa7W6XW4hR7BRqpJT4JMfy9SEcbRgaE4ZEaY4FyfQK11xomETtvc1KQtY3aWGBr90yBXET9vSOvhkyg1E9DJGYUaRkNwG3W9fW6QWf7iDOv5DWqbWFHvfSYHdvdtvOYaXjOcwLkvthjUYAqR9ZwqdjAHW53eZPROqWzSJ3PWPYPnRgqwmFW43jDSePDRBPWtcY3niTYHpRqLwUgWz6WPURD1RUZJ8bJ73ytTEFlWGmW6G","hotelUuid":"dhX4uhn0MdpHusaD"},"Frontend":{"vid":"1600039009299.2v21ry","sessionID":2,"pvid":6},"P":"99783168614","Device":"PC","Version":"0"}}'
r = requests.post(info.requestUrl().url(), cookies=to_cookie_dict(), data=data, headers=headers)
print(r.json())
def on_cookie_added(cookie):
for c in cookies:
if c.hasSameIdentifier(cookie):
return
cookies.append(QNetworkCookie(cookie))
def to_cookie_dict():
cookie_dict = {}
for c in cookies:
cookie_dict[bytearray(c.name()).decode()] = bytearray(c.value()).decode()
print(cookie_dict)
return cookie_dict
if __name__ == "__main__":
app = QApplication(sys.argv)
browser = QWebEngineView()
interceptor = WebEngineUrlRequestInterceptor(on_network_call)
profile = QWebEngineProfile()
cookie_store = profile.cookieStore()
cookie_store.cookieAdded.connect(on_cookie_added)
cookies = []
profile.setRequestInterceptor(interceptor)
page = MyWebEnginePage(profile, browser)
url = 'https://www.trip.com/hotels/mumbai-hotel-detail-762871/grand-hyatt-mumbai/?checkIn=2020-09-14&checkOut=2020-09-15&cityId=724&adult=2&children=0&ages=&crn=1&travelpurpose=0&curr=USD&showtotalamt=0&hoteluniquekey=H4sIAAAAAAAAAOPaycjFK8Fk8B8GGIWYOBilFjNyfJl7U12Iy9DE0sTczNzQwMhgCrNFs44jAwgcaHDwBDMKWh0CeCYxSnKCeef3OAiC6AbVnQ5OrBxr_SRYZjB-P663gpFxIyNEY5LDDkamE4x-C5j-PnnDvIuJleM1uwTTISA9SVCC5RQTwyUmhltMDI-YGF4xMXxiYvgFVdHEzNDFzDCJGaJuFjPDImYGIRaQG6UUjMxTjI0NE00tzYzMTSwT00B0qplJYpKxUXKiuaW5ArdG16GPv1iNGKyYpRjdPBiD2Iwd3SyMXKJkuJg9_YIE4xpqS16d2m4vxRwa7KKoqyj_JSdM2iGJNTVPNyIi4x1LAWMXI5MA4yRGTo7m3U8-Mp5gTAYA1R43aDgBAAA('
page.setUrl(QUrl(url))
browser.setPage(page)
browser.show()
sys.exit(app.exec_())
Thanks to How to capture the response of a request intercepted by QWebEngineUrlRequestInterceptor? authored by eriel marimon and https://stackoverflow.com/a/48154459/839338 authored by eyllanesc

Related

Requests vs Curl

I have an application running on AWS that makes a request to a page to pull meta tags using requests. I'm finding that page is allowing curl requests, but not allowing requests from the requests library.
Works:
curl https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/
Hangs Forever:
imports requests
requests.get('https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/')
What is the difference between curl and requests here? Should I just spawn a curl process to make my requests?

Either of the agents below do indeed work. One can also use the user_agent module (located on pypi here) to generate random and valid web user agents.
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
# or can use
# agent = "curl/7.61.1"
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
Or, using the user_agent module:
import requests
from user_agent import generate_user_agent
agent = generate_user_agent()
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
To further explain, requests sets a default user agent here, and the seattle times is blocking this user agent. However, with python-requests one can easily change the header parameters in the request as shown above.
To illustrate the default parameters:
r = requests.get('https://google.com/')
print(r.request.headers)
>>> {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
vs. the updated header parameter
agent = "curl/7.61.1"
r = requests.get('https://google.com/', headers={'user-agent': agent})
print(r.request.headers)
>>>{'user-agent': 'curl/7.61.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Scraping links with BeautifulSoup from all pages in Amazon results in error

I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
products = set()
for i in range(1, 21):
url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup) # prints the HTML content saying Error on Amazon's side
links = soup.select('a.a-link-normal.a-text-normal')
for tag in links:
url_product = 'https://www.amazon.fr' + tag.attrs['href']
products.add(url_product)
Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?

According to your question:
Be informed that AMAZON not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code ! which can lead you to have that error MSG:
To discuss automated access to Amazon data please contact api-services-support#amazon.com
Therefore you can use AMAZON API or you can pass a list of proxies to the GET request via proxies = list_proxies.
Here's the correct way to pass headers to Amazon without getting block and it's Works.
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.fr',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
for item in range(1, 21):
r = requests.get(
'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
print(f"https://www.amazon.fr{item.get('href')}")
Run Online: Click Here

how to extract links and handle the page as it is loading again and again using python beautifulsoup

tring to extract links and want to handle loading. but not the links even.
code:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.indiabusinessguide.in/business-categories/agriculture/agricultural-equipment.html')
soup = BeautifulSoup(r.text,'lxml')
links = soup.find_all('a',class_='link_orange')
for link in links:
print(link['href'])
please help me to handle this loading and extraction of links.

Try using the lxml library. Response is received by posting a request to the url using Requests.
import requests
import lxml
from lxml import html
contact_list = []
def scrape(url, pages):
for page in range(1, pages):
headers = {
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Cookie": "PHPSESSID=2q0tk3fi1kid0gbdfboh94ed56",
}
data = {
"page": f"{page}"
}
r = requests.post(url, headers=headers, data=data)
tree = html.fromstring(r.content)
links= tree.xpath('//a[#class="link_orange"]')
for link in links:
# print(link.get('href'))
contact_list.append(link.get('href'))
url = "http://www.indiabusinessguide.in/ajax_advertiselist.php"
scrape(url, 10)
print(contact_list)
print(len(contact_list))

Multiple POST requests, second request gets a 404 error code

I am new to python and am having this same issue that was brought up as an issue on the requests GitHub. I am trying to authenticate to a website that redirects you to a security question after the initial login. Both the initial login and subsequent page use the same "action URL" and on the second post request I am receiving a 404, here is my code, any help would be greatly appreciated, after asking the question on GitHub they said it was a question to ask here as it wasn't on their end. (although they had an issue on their GitHub about this):
from bs4 import BeautifulSoup as bs
import requests
import time
sources = ["https://www.dandh.com/v4/view?pageReq=dhMainNS"]
req = requests.Session()
def login():
authentication_url = "https://www.dandh.com/v4/dh"
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; rv:60.0) Gecko/20100101 Firefox/60.0"}
payload = {"Login": "12345",
"PW": "12345",
"Request": "Login"}
payload2 = {"securityAnswer": "12345",
"Request": "postForm"}
req.post(authentication_url, data=payload, headers=header)
time.sleep(3)
req.post(authentication_url, data=payload2, headers=header)
time.sleep(3)
def print_object(sources):
for url in sources:
soup_object = bs(req.get(url).text, "html.parser")
print(soup_object.get_text())
def main():
login()
print_object(sources)
main()

Part 1
After looking around the website, that a bad part of the problem relies on the payload2 you just have to add another item to it which is: "formName":"loginChallengeValidation" so overall payload2should look something like this:
payload2 = {"formName":"loginChallengeValidation","securityAnswer": your_security_answer,
"Request": "postForm"}
This will prevent you from getting the status code 404. Hope this helps.
Part 2
Even though this the issue in your questions I doubt that this is what you really want (as the code in Part 1 will redirect you to another verification form). in order to get access to the website itself you would have to add the following lines:
header2 = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; rv:60.0) Gecko/20100101 Firefox/60.0", "Referer":"https://www.dandh.com/v4/view?pageReq=LoginChallengeValidation"}
and
req.post("https://www.dandh.com/v4/view?pageReq=LoginChallengeValidation", headers=header)
so your final code should look like this:
from bs4 import BeautifulSoup as bs
import requests
import time
sources = ["https://www.dandh.com/v4/view?pageReq=dhMainNS"]
req = requests.Session()
def login():
authentication_url = "https://www.dandh.com/v4/dh"
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; rv:60.0) Gecko/20100101 Firefox/60.0"}
header2 = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; rv:60.0) Gecko/20100101 Firefox/60.0", "Referer":"https://www.dandh.com/v4/view?pageReq=LoginChallengeValidation"}
payload = {"Login": your_username,
"PW": your_pasword,
"Request": "Login"}
payload2 = {"formName":"loginChallengeValidation","securityAnswer": your_security_answer,
"Request": "postForm", "btContinue": ""}
req.post(authentication_url, data=payload, headers=header)
req.post("https://www.dandh.com/v4/view?pageReq=LoginChallengeValidation", headers=header)
time.sleep(3)
req.post(authentication_url, data=payload2, headers=header2)
time.sleep(3)
def print_object(sources):
for url in sources:
soup_object = bs(req.get(url).text, "html.parser")
print(soup_object.get_text())
def main():
login()
print_object(sources)
main()
(PS: you should replace your_username, your_password and your_security_answer with your credentials)
Also, I would like to note that I don't think that time.sleep(3) is useful in the code.
Really hope this helps.

Instagram Scraping with endpoints requires authentication for all requests

As you know, Instagram announced they has changed their endpoint apis this month.
Looks like in the wake of Cambridge Analytica instagram has changed up their endpoint formats and require a logged in user session for all requests.....
Not sure which endpoints need updating but I was specifically using the media/comments endpoints which are now as follows:
Media OLD:
https://www.instagram.com/graphql/query/?query_id=17888483320059182&id={0}&first=100&after={1}
Media NEW:
https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%2221575514%22%2C%22first%22%3A12%2C%22after%22%3A%22AQAHXuz1DPmI3FFLOzy5iKEhHOLKw3lt_ozVR40TphSdns0Vp5j_ZEU6Qj0CW6IqNtVGO5pmLCQoX0Y8RVS9aRTT2lWPp6vf8vFqjo1QfxRYmA%22%7D
The script that I used for avoiding this problem is as following:
#!/usr/bin/env python3
import requests
import urllib.parse
import hashlib
import json
#CHROME_UA = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
CHROME_UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
def getSession_old(rhx_gis, csrf_token, variables):
""" Get session preconfigured with required headers & cookies. """
#"rhx_gis:csfr_token:user_agent:variables"
print(variables)
values = "%s:%s:%s:%s" % (
rhx_gis,
csrf_token,
CHROME_UA,
variables)
x_instagram_gis = hashlib.md5(values.encode()).hexdigest()
session = requests.Session()
session.headers = {
'user-agent': CHROME_UA,
'x-instagram-gis': x_instagram_gis
}
print(x_instagram_gis)
session.cookies.set('ig_pr', '2')
session.cookies.set('csrftoken', csrf_token)
return session
def getSession(rhx_gis, variables):
""" Get session preconfigured with required headers & cookies. """
#"rhx_gis:csfr_token:user_agent:variables"
values = "%s:%s" % (
rhx_gis,
variables)
x_instagram_gis = hashlib.md5(values.encode()).hexdigest()
session = requests.Session()
session.headers = {
'x-instagram-gis': x_instagram_gis
}
return session
if __name__ == '__main__':
session = requests.Session()
session.headers = { 'user-agent': CHROME_UA }
response = session.get("https://www.instagram.com/selenagomez")
data = json.loads(response.text.split("window._sharedData = ")[1].split(";</script>")[0])
csrf = data['config']['csrf_token']
rhx_gis = data['rhx_gis']
variables = '{"id":"460563723","first":10,"after":"AQBf8puhlt8nU2JzmYdMMTuH0FbMgUM1fnIOZIH7n94DM4VLWkVILUAKVB-5dqvxQEI-Wd0ttlEDzimaaqwC98jccQaDQT4tSF56c_NlWi_shg"}'
session = getSession(rhx_gis, variables)
query_hash = '42323d64886122307be10013ad2dcc44'
encoded_vars = urllib.parse.quote(variables, safe='"')
url = 'https://www.instagram.com/graphql/query/?query_hash=%s&variables=%s' % (query_hash, encoded_vars)
print(url)
print(session.get(url).text)
I am sure this script was working well before 11 days ago, but not working now.
Does anyone know the solution how to get user posts without authenticating?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Check if the path is dynamically generated in web scraping - python-3.x

Related

Requests vs Curl

Scraping links with BeautifulSoup from all pages in Amazon results in error

how to extract links and handle the page as it is loading again and again using python beautifulsoup

Multiple POST requests, second request gets a 404 error code

Instagram Scraping with endpoints requires authentication for all requests

Categories

Resources