Problems with session in requests on python - python-3.x

Im trying to login to Plus500 with in python. Everething is ok, with status code 200, and response from server. But the server will not accept my requuest.
I did every single step that the webBrowser thoes. Headers like web browser. Always the same result.
url = "https://trade.plus500.com/AppInitiatedImm/WebTrader2/?webvisitid=" + self.tokesession+ "&page=login&isInTradeContext=false"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Host": "trade.plus500.com",
"Connection": "keep-alive",
"Referer": "https://app.plus500.com/trade",
"Cookie":"webvisitid=" + self.cookiesession + ";"+\
"IP="+self.hasip}
param = "ClientType=WebTrader2&machineID=33b5db48501c9b0e5552ea135722b2c6&PrimaryMachineId=33b5db48501c9b0e5552ea135722b2c6&hl=en&cl=en-GB&AppVersion=87858&refurl=https%3A%2F%2Fwww.plus500.co.uk%2F&SessionID=0&SubSessionID=0"
response = self.session.request(method="POST",
url=url,
params=param,
headers=header,stream=True)
The code above is the initialization of the web app. after that do the login. But it always come up with JSON reply : AppSessionRquired. I think I already try everinthing that i can think of. If some one as idea.

Related

Python requests code worked yesterday but now returns TooManyRedirects: Exceeded 30 redirects

I am trying to get the data from a site using requests using this simple code (running on Google Colab):
import requests, json
def GetAllStocks():
url = 'https://iboard.ssi.com.vn/dchart/api/1.1/defaultAllStocks'
res = requests.get(url)
return json.loads(res.text)
This worked well until this morning and I could not figure out why it is returning "TooManyRedirects: Exceeded 30 redirects." error now.
I can still get the data just by browsing the url directly from Google Chrome in Incognito mode so I donot think this is because of the Cookies. I tried passing the whole headers but still it does not work. I tried passing 'allow_redirects=False' and the returned status_code is 302.
I am not sure if there is anything I could try as this is so strange to me.
Any guidance is much appreciated. Thank you very much!
You need to send user-agent header to mimic a regular browser behaviour.
import requests, json, random
def GetAllStocks():
user_agents = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:77.0) Gecko/20190101 Firefox/77.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:77.0) Gecko/20100101 Firefox/77.0",
]
headers = {
"User-Agent": random.choice(user_agents),
"Accept": "application/json",
}
url = "https://iboard.ssi.com.vn/dchart/api/1.1/defaultAllStocks"
res = requests.get(url, headers=headers)
return json.loads(res.text)
data = GetAllStocks()
print(data)

Understanding Bearer Authorization for web scraping using python 3.8 and requests

So I am looking to scrape the following site:
https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
What I am running into using the Python Requests library is that the header requires I pass along an Authorization header that bears a token of some kind. While I can get this to work if I manually go to the page, copy and paste it, and then run my program, I am wondering how I could bypass this issue (After all, what is the point in running a scraper if I still have to visit the actual site manually and retrieve the authorization token).
I am newer to authorization/ bearer headers and am hoping someone might be able to clarify how the browser generates a token to retrieve this information/ how I can simulate this. Here is my code:
import requests
import json
import datetime
today = datetime.datetime.today()
url = "https://hyland.csod.com/services/x/career-site/v1/search"
# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
headers = {
'authority': 'hyland.csod.com',
'origin': 'https://hyland.csod.com',
'authorization': 'Bearer eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCIsImNsaWQiOiI0bDhnbnFhbGk3NjgifQ.eyJzdWIiOi0xMDMsImF1ZCI6IjRxNTFzeG5oY25yazRhNXB1eXZ1eGh6eCIsImNvcnAiOiJoeWxhbmQiLCJjdWlkIjoxLCJ0emlkIjoxNCwibmJkIjoiMjAxOTEyMzEyMTE0MTU5MzQiLCJleHAiOiIyMDE5MTIzMTIyMTUxNTkzNCIsImlhdCI6IjIwMTkxMjMxMjExNDE1OTM0In0.PlNdWXtb1uNoMuGIhI093ZbheRN_DwENTlkNoVr0j7Zah6JHd5cukudVFnZEiQmgBZ_nlDU4C-9JO_2We380Vg',
'content-type': 'application/json',
'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'csod-accept-language': 'en-US',
'referer': 'https://hyland.csod.com/ux/ats/careersite/4/home?c=hyland',
'accept-encoding': 'gzip, deflate, br',
'cookie': 'CYBERU_lastculture=en-US; ASP.NET_SessionId=4q51sxnhcnrk4a5puyvuxhzx; cscx=hyland^|-103^|1^|14^|KumB4VhzYXML22MnMxjtTB9SKgHiWW0tFg0HbHnOek4=; c-s=expires=1577909201~access=/clientimg/hyland/*^!/content/hyland/*~md5=78cd5252d2efff6eb77d2e6bf0ce3127',
}
data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
'{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']
def hyland(url, data):
# for openings in data:
dirty = requests.post(url, headers=headers, data=data).text
if 'Unauthorized' in dirty:
print(dirty)
print("There was an error connecting. Check Info")
# print(dirty)
clean = json.loads(dirty)
cleaner = json.dumps(clean, indent=4)
print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
for i in range(0,60):
try:
print(clean["data"]["requisitions"][i]["displayJobTitle"])
print("")
print("")
except:
print("{} Openings at Hyland".format(i))
break
for datum in data:
hyland(url, data=datum)
So basically what my code is doing is sending a post request to the url above along with the headers and necessary data to retrieve what I want. This scraper works for a short period of time, but if I leave and come back after a few hours it no longer works due to authorization (at least that is what I have concluded).
Any help/ clarification on how all this works would be greatly appreciated.
Your code has a few problems:
As you noted you have to get the bearer token
You have to send your requests using requests.session() (as this webpage seems to pay attention to the cookies you send)
Optional: your headers had a lot of unnecessary headers that could be removed
All in all, here bellow is the working code:
import requests
import json
import datetime
today = datetime.datetime.today()
session = requests.session()
url = "https://hyland.csod.com:443/ux/ats/careersite/4/home?c=hyland"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
raw = session.get(url, headers=headers).text
token = raw[raw.index("token")+8:]
token = token[:token.index("\"")]
bearer_token = f"Bearer {token}"
url = "https://hyland.csod.com/services/x/career-site/v1/search"
# actual sitehttps://hyland.csod.com/ux/ats/careersite/4/home?c=hyland
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0", "Authorization": bearer_token}
data = ['{"careerSiteId":4,"pageNumber":1,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}',
'{"careerSiteId":4,"pageNumber":2,"pageSize":25,"cultureId":1,"searchText":"","cultureName":"en-US","states":["oh"],"countryCodes":[],"cities":[],"placeID":"","radius":null,"postingsWithinDays":null,"customFieldCheckboxKeys":[],"customFieldDropdowns":[],"customFieldRadios":[]}']
def hyland(url, data, session= session):
# for openings in data:
dirty = session.post(url, headers=headers, data=data).text
if 'Unauthorized' in dirty:
print(dirty)
print("There was an error connecting. Check Info")
# print(dirty)
clean = json.loads(dirty)
cleaner = json.dumps(clean, indent=4)
print("Openings at Hyland Software in Westlake as of {}".format(today.strftime('%m-%d-%Y')))
for i in range(0,60):
try:
print(clean["data"]["requisitions"][i]["displayJobTitle"])
print("")
print("")
except:
print("{} Openings at Hyland".format(i))
break
for datum in data:
hyland(url, data=datum, session = session)
hope this helps

Expand short urls in python using requests library

I have a large number of short URLs and I want to expand them. I found somewhere online (I missed the source) the following code:
short_url = "t.co/NHBbLlfCaa"
r = requests.get(short_url)
if r.status_code == 200:
print("Actual url:%s" % r.url)
It works perfectly. But I get this error when I ping the same server for many times:
urllib3.exceptions.MaxRetryError:
HTTPConnectionPool(host='www.fatlossadvice.pw', port=80): Max retries
exceeded with url:
/TIPS/KILLED-THAT-TREADMILL-WORKOUT-WORD-TO-TIMMY-GACQUIN.ASP (Caused
by NewConnectionError(': Failed to establish a new connection: [Errno
11004] getaddrinfo failed',))
I tried many solutions like the set here: Max retries exceeded with URL in requests, but nothing worked.
I was thinking about another solution, which is to pass an useragent in the request, and each time I change it randomly (using a large number of useragents):
user_agent_list = [
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36',
]
r = requests.get(short_url, headers={'User-Agent': user_agent_list[np.random.randint(0, len(user_agent_list))]})
if r.status_code == 200:
print("Actual url:%s" % r.url)
My problem is that r.url always return the short url instead of the long one (the expanded one).
What am I missing?
You can prevent the error by adding allow_redirects=False to requests.get() method to prevent redirecting to page that doesn't exist (and thus raising the error). You have to examine the header sent by server yourself (replace XXXX by https, remove spaces):
import requests
short_url = ["XXXX t.co /namDL4YHYu",
'XXXX t.co /MjvmV',
'XXXX t.co /JSjtxfaxRJ',
'XXXX t.co /xxGSANSE8K',
'XXXX t.co /ZRhf5gWNQg']
for url in short_url:
r = requests.get(url, allow_redirects=False)
try:
print(url, r.headers['location'])
except KeyError:
print(url, "Page doesn't exist!")
Prints:
XXXX t.co/namDL4YHYu http://gottimechillinaround.tumblr.com/post/133931725110/tip-672
XXXX t.co/MjvmV Page doesn't exist!
XXXX t.co/JSjtxfaxRJ http://www.youtube.com/watch?v=rE693eNyyss
XXXX t.co/xxGSANSE8K http://www.losefattips.pw/Tips/My-stretch-before-and-after-my-workout-is-just-as-important-to-me-as-my-workout.asp
XXXX .co/ZRhf5gWNQg http://www.youtube.com/watch?v=3OK1P9GzDPM

not able to scrape the data and even the links are not changing while clciking on pagination using python

want to scrape data from of each block and want to change the pages, but not able to do that,help me someone to crack this.
i tried to crawl data using headers and form data , but fail to do that.
below is my code.
from bs4 import BeautifulSoup
import requests
url='http://www.msmemart.com/msme/listings/company-list/agriculture-product-stocks/1/585/Supplier'
headers={
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "unique_visitor=49.35.36.33; __utma=189548087.1864205671.1549441624.1553842230.1553856136.3; __utmc=189548087; __utmz=189548087.1553856136.3.3.utmcsr=nsic.co.in|utmccn=(referral)|utmcmd=referral|utmcct=/; csrf=d665df0941bbf3bce09d1ee4bd2b079e; ci_session=ac6adsb1eb2lcoogn58qsvbjkfa1skhv; __utmt=1; __utmb=189548087.26.10.1553856136",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json, text/javascript, */*; q=0.01",
}
data ={
'catalog': 'Supplier',
'category':'1',
'subcategory':'585',
'type': 'company-list',
'csrf': '0db0757db9473e8e5169031b7164f2a4'
}
r = requests.get(url,data=data,headers=headers)
soup = BeautifulSoup(html,'html.parser')
div = soup.find('div',{"id":"listings_result"})
for prod in div.find_all('b',string='Products/Services:').next_sibling:
print(prod)
getting "ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host" 2-3 times, i want to crawl all text details in a block.plz someone help me to found this.

LinkedIn HTTP Error 999 - Request denied

I am writing a simple script to get public profile visible without login on LinkedIn.
Below is my code to get the page for beautifulsoup. I am using public proxies as well.
import urllib.request, urllib.error
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/company/amazon"
proxy = urllib.request.ProxyHandler({'https': proxy, })
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
hdr = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3218.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,hi;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request(url, headers=hdr)
page = urllib.request.urlopen(req, timeout=20)
self.soup = BeautifulSoup(page.read(), "lxml")
But it is raising "HTTPError 999 - request Denied" error. This is only for testing purpose till I am getting access via partnership program.
What am I doing wrong? Please help.
You did not do anything wrong, LinkedIn blacklist cloud servers ip addresses to prevent "stealing" their data. Questionable practice but this is how it is.

Resources